Graphic showing how the Dark and Stormy Archives software builds descriptive summaries of web archive collections.

What’s in a Web Archive Collection? Summarization and Discovery of Archived Webpages

Michele C. Weigle, PhD, will speak on Thursday, November 17, 2022 at 2:00 PM ET. This program will be live-streamed globally, and archived, by NIH VideoCasting. Dr. Weigle is a Professor in the Department of Computer Science at Old Dominion University. Circulating Now interviewed her about her research and upcoming talk.

Circulating Now: Tell us a little about yourself. Where are you from? What do you do? What is your typical workday like? 

Informal portrait of a white woman.Michele C. Weigle: I grew up in St. Francisville, Louisiana, a small town on the Mississippi River, just north of Baton Rouge. Between graduate school and faculty positions, I’ve now spent over half of my life in the Carolinas and Virginia.

I am a Professor of Computer Science at Old Dominion University (ODU) in Norfolk, Virginia. I joined ODU in 2006 as an Assistant Professor studying computer networking and started doing research in web archiving in 2010. For the past few years, I’ve been teaching courses in Web Science and Data Visualization, and I also developed a Research Methods course that is required for all of our department’s PhD students. In addition to that, I serve as one of our department’s Graduate Program Directors, leading our curriculum committee and making sure that our graduate students are successfully progressing through our MS and PhD programs.

A typical day during the academic semester involves research, in terms of meeting with PhD students and writing and editing paper drafts, teaching and course prep, and my departmental service, which, honestly and unglamorously, often involves answering emails and signing forms for our graduate students. I’m married and have two school-age sons, so evenings are often filled with activities related to soccer and band.

CN: Your talk, “What’s in a Web Archive Collection? Summarization and Discovery of Archived Webpages,” will explore some of the projects in the Web Science and Digital Libraries (WS-DL) group at Old Dominion University. Tell us a little more about that group.

MW: My colleague Michael L. Nelson founded the WS-DL research group when he joined ODU in 2002. I joined the group in 2010, and between 2018-2020 we added four more faculty members. We currently have six faculty members, 20 PhD students, and several Masters and undergraduate students. As a whole, we study web archiving, social media, scholarly communication, information retrieval, human computing interaction, accessible computing, and information visualization. Michael and I direct the group’s web archiving projects, where we study the process of web archiving and the impact of changing technologies on webpage capture and replay.

CN: What are some technical challenges you’ve seen in building web archive collections?

MW: One of the biggest challenges for web archiving is capturing today’s dynamic webpages at scale. As a research community, we have gotten good at capturing traditional webpages at scale, and we have some tools that can better capture more dynamic webpages, but these tools run much more slowly because dynamic webpages include JavaScript code that executes in the web browser, which then may cause other content to be loaded.

For collections themselves, another challenge is in helping curators apply rich metadata to the webpages in the collection. There may be some technical approaches that could automatically suggest categories or extract other types of metadata from the pages themselves, to help reduce the amount of manual entry required.

A particular interest of mine is in discovery. How can users find appropriate web archive collections, and then how can they identify relevant resources once inside a collection? Our “Dark and Stormy Archives” (DSA) Toolkit is one approach to help users gain a quick overview and understanding of what is contained in a web archive collection.

Graphic showing how the DSN software builds descriptive summaries of web archive collections.
The Dark and Stormy Archives Project intends to produce stories by selecting representative mementos from web archive collections. Each story consists of around 28 mementos telling the user about the collection. Each of these mementos is visualized as a surrogate, and together they become a story. Each story is thus a summary of summaries.

CN: Do you have favorite collection, or one that you find interesting or particularly relevant to your work?

MW: I have been impressed with NLM’s Global Health Events web archive, hosted at Archive-It. With a collection this large, it is essential to have good metadata to allow users to facet and drill-down on their topics of interest. In particular, I know that the webpages included in this collection regarding COVID-19 will continue to serve as excellent resources for future scholars and researchers studying this time period.

CN: What changes to web archiving do you see coming and how will these benefit researchers?

MW: For one, I hope that groups like ours can continue to study the needs of researchers and produce tools and techniques that can improve discovery of content in web archives. Additionally, we are continually learning new things about how the web archiving process, both in terms of capture and replay, interacts with the ever-changing technologies that power the Web. We hope to be able to address some of the challenges we discover, so that web archives can remain rich sources of study for researchers.

Michele C. Weigle’s presentation is part of our NLM History Talks, which promote awareness and use of the National Library of Medicine and other historical collections for research, education, and public service in biomedicine, the social sciences, and the humanities. All talks are live-streamed globally, and subsequently archived, by NIH VideoCasting. Stay informed about the lecture series on Twitter at #NLMHistTalk.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.