A network of nodes and edges showing the domains and links between them. for the NLM Domestic Violence Awareness and Prevention web archive collection.

Exploring the Data of Web Archives as Part of Data Science @ NLM

By Christie Moffatt ~

Over the past year (May 2020–May 2021) I participated in a National Library of Medicine (NLM) Data Science Mentorship program, which is part of a broader Data Science @ NLM training program designed to prepare staff to engage with and participate in the Library’s developing data science efforts. My mentor was NLM Computer Scientist Marie Gallagher, and our project was to gain a better understanding of and practical experience with tools and techniques for the exploration and study of the data of web archives.

NLM has been actively involved in building collections of web archives through the work of our Web Collecting and Archiving Group.  A team of archivists, librarians, and historians is primarily using Archive-It, a service of the Internet Archive, to collect on a broad range of topics in line with NLM collection development policies, including HIV/AIDS, the Opioid Epidemic, the 2014 Ebola Outbreak, and currently around the COVID-19 Pandemic.   As a member of this working group, I was interested to learn more about tools such as Archives Unleashed and the GLAM Workbench to better understand the work and needs of researchers, as well as explore the possibility of using these tools to support ongoing collection development and curation.   I had participated in an Archives Unleashed datathon in 2019, and recognized that I needed much more hands-on experience to better understand the nature of the tools, how to use them, and the broader picture of web archives data and research.  The NLM Data Science Mentorship program provided a wonderful opportunity to collaborate and learn more about the data of web archives, as well as project design, experimental thinking, science communication, and data storytelling.

The Archives Unleashed project, supported by The Andrew W. Mellon Foundation, provides a set of tools designed to lower barriers for researchers to explore web archives.  The Archives Unleashed tools are designed for different levels of experience including the Archives Unleashed Cloud (for beginners), Archives Unleashed Notebooks (for beginner/intermediate users), and Archives Unleashed Toolkit (for advanced users).  My mentor and I reviewed each of these tools and focused on the Archives Unleashed Cloud (migrating soon to Archive-It) to query the data of individual NLM web archive collections and obtain derivative data files for further analysis.

We uploaded the resulting derivative data files into a variety of data visualization and text analysis tools and learned a number of lessons on the value of a flexible computer environment to install programs and software, the need for advanced data cleaning skills, and generally, the need for patience and flexibility.  I also gained an appreciation for the complexity of the analysis tools and the need for more time to understand how the data is interpreted and presented.

In one experiment, following learning guides from Archives Unleashed, we loaded one of the derivative files (the GEFX file) into an open source graph visualization program called Gephi to create a visualization of the network of nodes (domains) and edges (hyperlinks between them) for a small collection of sites related to the NLM exhibition Confronting Violence, Improving Women’s Lives.

Graphic visualization in Gephi, of network connections between domains in an NLM web archive collection related to NLM exhibition Confronting Violence, Improving Women’s Lives.

If we look closely, we can see that there are arrows between the domains, indicating hyperlink connections.  The size of the labels and nodes is significant, representing how many times the source is linked to. Researchers can use this visualization to see who is linking to who, and the most popular domains in the collection.  In this case, we found that forensicnurses.org, Twitter, Facebook, and Youtube are domains frequently linked to in the collection. We can look specifically at forensicnurses.org and focus attention on the links to and from this particular domain, with safeta.org, then community.iafn.org, having the largest number of links.

In another experiment we used a derivative file containing the text extracted from HTML documents  within the web archive (a csv file).  We explored this data using a web-based text analysis set of tools called Voyant Tools.   Data cleanup of the text in our larger collections proved challenging, and we ended up creating a very small sample data set created for this project. The text was still a challenge with unrecognizable characters, though a bit easier to manage. We removed content in languages not English or Spanish (which made sense for this data set), and removed file formats that were not text. You can see a big difference in the visualizations using the “before” and “after” version of the derivative text file.

A “before” and “after” data cleaning text visualization in Voyant, using data from a test NLM web archive collection related to HHS efforts to address hesitancy around COVID-19 vaccines.
With Voyant Tools researchers can visualize the text in multiple ways: a word cloud showing the most frequent words used in the collection, the context of the word or words used in a collection (for example, what text comes before or after the word “vaccine”), where in the text the terms of interest are most concentrated, and the terms highlighted in the text itself.  There are all kinds of ways to filter this data, and researchers can swap out the visualizations (there are 28 available) depending on what is most useful to their research.

We also tested out a set of Jupyter notebooks, released in 2020 as part of the GLAM Workbench (Galleries, Libraries, Archives, and Museums) with funding from the International Internet Preservation Consortium (IIPC). Like Archives Unleashed Cloud, the notebooks (there are 16) are intended to be a starting point specifically for researchers who want to make use of web archives.  The notebooks offer a range of options for examining content in the Internet Archive (and other archives); and—even easier on the researcher—run using Binder, a virtual machine that you don’t need any software installed on your own computer to use.

Researchers can use a notebook, “Get full page screenshots from archived web pages,” for example, to examine visual changes in a website over time.  In the visual below, we reviewed the the CDC coronavirus homepage as it changed throughout 2020.

A display of three versions of a CDC wepage from January, February and March of 2020 showing how it got longer and more complex.
Screenshot of the target URL https://www.cdc.gov/coronavirus/2019-ncov/ over time using “Get full page screenshots from archived web pages” GLAM Workbench notebook.

Other notebooks in the collection allow researchers to discover changes to text on a webpage over time, for example, or to discover when a piece of text first appears in an archived webpage. In the example below, we used the notebook “Find when a piece of text appears in an archived web page” to discover the first time “social distancing” was used on the CDC coronavirus homepage.

A screenshot of a tool that locates specific phrases in webpage text accross time.
Screenshot of results from searching for First Occurrence of the text “Social Distancing” on https://www.cdc.gov/coronavirus/2019-ncov/ using “Find when a piece of text appears in an archived web page” GLAM Workbench notebook.

These notebooks are not without challenges themselves.  While really easy to use, it takes time to query the entire Internet Archive for results (sometimes hours) and the notebooks can time out.  This work provided the opportunity to compare approaches to querying the data of web archives, as well as more lessons on patience and persistence.

The work we started during this mentorship is ongoing and the landscape of tools is evolving.  There is definitely more room for further exploration of analysis tools to better understand how researchers can use web archives and how web collecting organizations like NLM can support their work. We learned midway through the project that Archives Unleashed Cloud will be decommissioned at the end of June 2021 and migrated to Archive-It.  I look forward to learning more about what opportunities this will bring for making NLM web archive collections available as data, whether through providing tidy derivative data sets for our researchers, or sharing notebooks querying the web archive data.  Supporting researchers through description and transparency about the scope of a collection is also important, as well as helping them understand the nature of web content as historical materials (on this topic I read, and consulted many times throughout this project, Ian Milligan’s History in the Age of Abundance?  How the Web is Transforming Historical Research). There is much exciting and important work possible ahead.

Network graphic naming NLM staff and offices that supported the project.
Thanks to everyone across NLM for support on this project!

I’m grateful for this opportunity to work with a mentor to explore and learn about the bigger picture of working with web archives as data over this past year.  Many, many thanks to Marie Gallagher, the entire Data Science @ NLM training program team, and all those at NLM supporting this work.

An informal portrait of Christie Moffatt.Christie Moffatt is Manager of the Digital Manuscripts Program in the History of Medicine Division at the National Library of Medicine and Chair of NLM’s Web Collecting and Archiving Working Group.

2 comments

Leave a Reply to James Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.