Revealing Data: Why We Need Humans to Curate Web Collections

In this Revealing Data series we explore data in historical medical collections, and how preserving this data helps to ensure that generations of researchers can reexamine it, reveal new stories, and make new discoveries. Future researchers will likely want to examine the data of the web archive collections, collected and preserved by libraries, archives, and others, using a wide range of approaches, to document unfolding events. Today Circulating Now welcomes guest blogger Alexander Nwala (@acnwala ), writing on his research using NLM web archive collections to compare different methods of selecting web content, and some of the difficulties encountered in generating seeds automatically.

I am a Computer Science PhD student and member of the Web Science and Digital Libraries research group at Old Dominion University, Norfolk Virginia. For the past three years, I have been researching generating collections for stories and events under the supervision of Dr. Michael Nelson and Dr. Michele Weigle. There is a shortage of curators to build web archive collections in a world of rapidly unfolding events. A primary objective of my research is investigating how to automatically generate seeds (in the absence of domain knowledge) to create or augment web archive collections.

A screen capture of a news story in L'Economiste about Ebola dated May 25, 2018 — “DRC: New Ebola death confirmed” in *L’Economiste*, May 25, 2018

In the past four years there have been three outbreaks of Ebola virus in Western and Central Africa. Between March 2014 and June 2016, Guinea, Liberia, and Sierra Leone experienced the largest Ebola epidemic in history with over 11,000 deaths. More recently (2017–2018), the Democratic Republic of Congo in Central Africa has been grappling with another Ebola outbreak.

Two months after the World Health Organization declared the 2014 Ebola outbreak a Public Health Emergency of International Concern (PHEIC), the NLM Web Collecting and Archiving Working Group started collecting website URLs as part of the NLM Global Health Events web archive collection. The NLM Ebola virus web archive collection includes websites of organizations, journalists, healthcare workers, and scientists related to the 2014 Ebola virus discourse. More recently as part of a new web archiving initiative, historian Christine Wenc and archivist Erin Mashni collaborated with NLM to collect URLs for the HIV/AIDS web archive collection. This collection consists of websites and social media archived to document HIV/AIDS in the early 21st century.

Anyone who has ever clicked a link and was presented with a disappointing 404 response indicating the absence of a resource understands the impermanence of web resources. You’ve probably seen many 404 pages from the conventional (CNN) to the creative (Pixar)

The CNN 404 page says "Uh-Oh!" and provides a search and suggested articles.

The Pixar 404 page says "Aww...Don't Cry" illustrated by Sadness from the film Inside Out.

This decay of web links over time is known as link rot. Consequently, the preservation of elements of our collective digital heritage, ranging from disease outbreaks to elections is critical, and this is a primary purpose of web archive collections such as the NLM Ebola virus and HIV/AIDS collections.

A flower pattern of dots, repeated 4 times with progressively more dots turning red representing broken links. — An illustration from Perma.cc showing how links rot (orange circles) over time, August 10, 2018

Web archive collections consist of groups of web pages that are copied into an archive so as to resist link rot on the live web. These collections share a common topic e.g., “Ebola virus” or “2018 Winter Olympics”, and begin with an initial selection of website URLs called seeds. For example, below is a sample of five seeds selected by NLM for the Archive-It Ebola virus collection:

Quality seeds lead to quality archive collections. A collection of seeds with variations of “buy cheap Rolex watches” or “we sell gold nuggets” is not expected to yield a good web archive collection for the “2009 Swine Flu outbreak.” This is where curators come into the picture. Curators must ensure they select seeds that are relevant for the collection topic. This means curators such as those at NLM not only have the responsibility of searching for URLs to populate the seed list, but they also serve as filters to remove irrelevant URLs. This is a time consuming process because it is mostly done manually. For example, it took several months to collect the NLM Ebola virus seeds. A natural question is: Can we automate the seed generation process? In other words, to what extent can a computer program replicate creating seeds for web archive collection?

See: “Bootstrapping Web Archive Collections from Social Media,” for more details about automating the seed generation process.

Can We Automate the Seed Generation Process?

To answer this question I compared seed URLs from the NLM HIV/AIDS collection to those generated automatically on June 18, 2018 from search results on Google and Twitter.

TL;DR

Automating seed selection is hard. It is difficult to encode the various nuanced objectives of collections with a narrow scope, consequently, machine-generated seeds are not well-suited for collections with a narrow scope. But machine-generated seeds are suitable for collections with a broad scope especially when off-topic seeds are removed.

Generating seeds automatically from Google and Twitter

The Google seeds were generated automatically by issuing the query “hiv aids” to Google and extracting 115 links from the first five pages. Two sets of seeds (Twitter-Top and Twitter-Latest) were generated automatically by extracting URLs from tweets with the hashtag #hivaids from the Top (157 URLs) and Latest (146 URLs) Twitter search results. These automatically generated URLs constitute the seeds for the Google, Twitter-Top, and Twitter-Latest collections, in comparison with the human selected seeds of the NLM collection, for the purposes of this study.

Seed comparison: Human curated NLM HIV/AIDS vs Google and Twitter generated HIV/AIDS seeds

The NLM seeds were compared to the seeds from Google and Twitter across four dimensions: precision, percentage of top-level sites, age distribution, and topical coverage.

Precision

A precision score of 0 indicates that no URL is relevant to HIV/AIDS, while a precision score of 1 indicates that all URLs under consideration are relevant. To measure precision, all URLs in the Google and Twitter collections were manually inspected and assigned a relevance score of 0, 0.5, or 1. A 0 was assigned to URLs irrelevant to the HIV/AIDS topic, 0.5—somewhat related, and 1—primarily related. Subsequently, the sum of all the relevance scores of each URL divided by the seed list size produced the precision.

Unsurprisingly, URLs from Google (0.98) outranked URLs from Twitter (0.70–0.77) in the precision ranking (Fig. 1). Also, more relevant URLs were included in tweets from Twitter’s Top (0.77) than the Latest (0.70) search results—popularity may be correlated to quality.

Screen captures of non-relevant results. including a Charmin ad and a petition from a cancer organization. — Two (Charmin & TPPKills) spam and/or promoted web pages found in the tweet collection.

These precision results show that we cannot blindly include URLs generated automatically in a seed list without inspection, especially if the URLs are from tweets, since hashtags are frequently used by spammers. However, Twitter and Google still offer an opportunity for generating good seeds, but URLs extracted from these sources should be screened to remove off-topic URLs.

Screen captures of relevant results from Seeing Red and loveyourself.. — Two (Seeing Red & loveyourself) non-spam web pages found in the tweet collection.

Percentage of top-level sites

Almost half (47%) of the URLs in the NLM HIV/AIDS seed list are top-level sites (e.g., http://hiv-age.org/) (Fig. 2). This is in contrast to Twitter and Google, which tend to give deep links, which typically correspond to a narrow topic or specific story.

Age distribution

The date a web page was created can be used to rule out the kinds of topics it discusses. For example, web pages created before January 2017 are not expected to discuss the policies President Trump implemented to combat HIV/AIDS, since the Trump administration began in January 2017 (of course, a page created in October 2016 could speculate about future events (2017) or review past events (1987), but we focus only on the date the page was created (2016)). This means that without analyzing the content of seeds, we may be able to use the creation dates of the URLs to predict the topics to expect. As a result, we expect the creation dates of the NLM seeds to cover a broad range, providing a 21st century perspective on HIV/AIDS. The age distribution confirms this expectation—the NLM seeds produced the oldest documents.

Topical coverage

The seeds from Google and Twitter are a mix of various topics on HIV/AIDS ranging from the worsening AIDS epidemic in Russia to a tweet encouraging men to get tested. The intent of the NLM HIV/AIDS collection is broadly scoped, so these seeds would be at home in the NLM collection. But a more narrowly scoped collection, such as the 2018 Ebola outbreak vs. the 2016 outbreak vs. the 2014 outbreak, would be more challenging to create with seeds from Google and Twitter.

Why We Need Human Curators to Generate Seeds

In a web plagued by disappearing resources, archived collections stand as a valuable means of preserving some of the web resources important to the study of past events such as disease outbreaks. These archived collections start with seeds (URLs) hand-selected by curators. Human curators produce high quality seeds by removing irrelevant URLs, adding URLs from credible and authoritative sources, but most importantly, they include only URLs that meet the collecting policy of the collection. The collecting policy could be broad (e.g., HIV/AIDS in the 21st century) or narrow (e.g., 2017 Opioid epidemic in Pennsylvania) and the curator has the ability to adapt to the multiple objectives of a collection. But this ability comes at a cost: it is time consuming to collect these seeds.

Machines, on the other hand, are fast and can discover a lot of seeds quickly, but even if content analysis is applied to filter non-relevant URLs, it is still hard to seamlessly adapt specific collecting policies (broad or narrow scope). Although search engines and social media can be used to generate seeds for collections with a broad scope, collections with nuance and subtlety are harder to automate. This is a primary objective of my research: to leverage the collective domain expertise of users on social media by extracting seeds from their social media “micro-collections” instead of simply using search engines and social media directly.

The web archive community has invested significant effort in building collections for specific topics through focused-crawling. But not much research effort addresses the discovery of seeds needed by focused-crawlers. My research addresses automating the seed generation stage of building web archive collections.

Learn more about Web Collecting at NLM.

We hear about data every day. In historical medical collections, data abounds, both quantitative and qualitative. In its format, scope, and biases, data inherently contains more information than its face value. This series, Revealing Data, explores how, by preserving the research data of the past and making it publicly available, the National Library of Medicine (NLM) helps to ensure that generations of researchers can reexamine it, reveal new stories, and make new discoveries. As the NLM becomes the new home of data science at the National Institutes of Health (NIH), Circulating Now explores what researchers from a variety of disciplines are learning from centuries of preserved data, and how their work can help us think about the future preservation and uses of the data we collect today.