A word cloud that represents relative frequency of words in articles about influenza, with medical, cases, Dr, Mr, January, epidemic, number, public and symptoms included.

Revealing Data: Using Term Frequency to Chart Influenza Reporting

Historical medical journals provide unique perspectives on the development of expert understanding of transmission, morbidity, and impact during an epidemic. Examining the ways that medical journals contributed to the spread of information, evaluation of interpretations, and creation of new knowledge in a specific historical process can contribute to current discussions about the relationship between expert perspectives and public understanding. Information on the Russian Influenza (1889–1890) in the British Medical Journal offers an excellent case study for evaluating historical significance and contemporary relevance. Circulating Now welcomes guest bloggers E. Thomas Ewing, Ian Hargreaves, Jessica King, Andrew Pregnall, and Tyler Talnagi who examine different dimensions of the role of medical journals in collecting, presenting, and interpreting knowledge of a disease outbreak in this first of three posts for our ongoing Revealing Data series.

Term frequency makes it possible to visualize the sequence of reporting about a disease outbreak, while also comparing the relative scope of epidemic reporting with the coverage of other diseases and medical topics during various time periods. As an analytical tool, term frequency works well for understanding influenza because this disease term was used relatively consistently across time, particularly in a medical journal, and because the epidemic nature of outbreaks makes it possible to associate increased frequency of use with increases in the number of cases and deaths.

Term frequency can be measured for the British Medical Journal using several databases, each of which provides a different way of counting the appearance of terms across pages, issues, and volumes.

A graph showing a peak in 1892 of 345 instances of the word influenza.
Figure 1. British Medical Journal, term frequency, all years

The History of Medicine project at Manchester University [access to some of the articles through this project is limited by subscription, your library may provide onsite access] makes it possible to visualize the appearance of keywords from the journal’s first issue through the early 2000s. As shown in Figure 1, influenza appears rarely in this journal for almost fifty years, until the start of the Russian influenza epidemic in late 1889. The year 1892 produces 345 results for this keyword search, the most in any year of publication. Influenza appeared regularly, but less frequently through the early years of the twentieth century, until the second spike in results around the 1918 Spanish influenza epidemic. During the rest of the twentieth century, influenza appeared regularly, until a steady decline began in the 1950s, with brief increases in the last two decades.

A bar graph showing a peak in 1892 of the word influenza appearing in 18.7% of articles.
Figure 2. British Medical Journal, term frequency, 1885–1900

PubMed Central (PMC) at the National Library of Medicine allows for searching by journal title, publication date, and keyword. A search for the keyword influenza in the British Medical Journal produces more than 10,000 results. Narrowing the search by years indicates that in the five years before and ten years after the Russian influenza epidemic, from 1885 to 1899, the term influenza appears in 1,589 articles. Figure 2 shows the distribution of search results by year. The year 1892 accounted for more than one-fifth of all articles published with the word influenza during this fifteen year period, more than twice as many as the year of the Russian influenza epidemic. This distribution results from two factors: first, the start of a new and even more deadly wave of influenza epidemic in spring 1892, particularly in Great Britain, and second, the publication of medical studies of influenza cases from the previous three years published in this medical journal.

A graph showing the changing frequency of the term influenza in the BMJ over the course of a year.
Figure 3. British Medical Journal, term frequency, all issues, July 1889–June 1890

The Medical Heritage Library provides access to volumes of the British Medical Journal available from the Internet Archive. A keyword search by volume allows for both a visualization of the term frequency and links directly to articles containing the keywords. Figure 3 shows the term frequency for “influenza” in two volumes of the British Medical Journal, from July 1889 to June 1890, one full year, with the Russian influenza in the middle. These two volumes combined have more than 4.5 million total words and nearly 200,000 unique word forms. Each volume has weekly issues, so dividing the visualization into twenty five segments means that each segment covers two issues. The index has been deleted from the text file, so the results are from articles alone. As the visualization indicates, the word influenza was absent from the British Medical Journal until the outbreak in December 1889 (segments 11–13), while the coverage in this journal continued to increase significantly, before gradually decreasing over the next several months.

A chart indicating how often the word influenza was used in articles that included the word.
Figure 4. British Medical Journal, term frequency, only articles with influenza

Using Voyant to analyze the text files from PMC makes it possible to identify in which specific articles the word influenza appeared most frequently. As shown in Figure 4, the word influenza appears more than 800 times in 175 articles that include this word in the years 1889 and 1890. Because this chart only includes articles with the word influenza, it does not provide full chronological coverage of 1889 and 1890. The relatively higher frequency of the term in the first several articles indicates that these articles included the word influenza many times, which is a useful way to see where this disease received intense coverage. This approach also makes it possible to identify articles that require close reading to understand how the disease was being reported and interpreted.

A word cloud that represents relative frequency of words in articles about influenza, with medical, cases, Dr, Mr, January, epidemic, number, public and symptoms included.
Figure 5. Cirrus cloud, two articles about influenza, Jan 18 and Jan 25, 1890, in British Medical Journal

Two articles with the highest term frequency, included in PMC records 2207094 and 2207168, contain numerous reports under the heading “The Epidemic of Influenza” published in the January 18, 1890 and January 25, 1890 issues. Figure 5, created using the Cirrus cloud text visualization tool and focused on these two articles, produces a useful way to think about how influenza as a key term relates to other words that appeared most frequently in these two lengthy articles. As suggested by this visualization of keyword frequency, some of the most common terms were logical given the nature of the reporting: cases, symptoms, disease, and health, but we also see terms specific to this disease outbreak, particularly the word epidemic. It is interesting to note that the word Russia does not appear among the top 100 terms. This, however, makes sense because these articles appeared in mid-January 1890, after the disease had moved into Western Europe and Great Britain (thus the locations London and Paris).

Although these tools for counting term frequency are relatively easy to use and produce clear visualizations, their utility depends on a number of factors, including the accuracy of the text produced by the digitization process. In the case of the PMC articles, term frequency will be also be shaped by the number of pages included per record. As indicated above, for example, PMC records 2207094 and 2207168 each include about 15 pages, only a fraction of which are the two articles about the influenza epidemic cited above. Other records might include an article on influenza that was just one or two paragraphs, making up less than one-tenth of a single page. However, even with these limitations, term frequency is a useful tool for appreciating how a medical journal reported on a disease outbreak as it happened.

We hear about data every day. In historical medical collections, data abounds, both quantitative and qualitative. In its format, scope, and biases, data inherently contains more information than its face value. This series, Revealing Data, explores how, by preserving the research data of the past and making it publicly available, the National Library of Medicine (NLM) helps to ensure that generations of researchers can reexamine it, reveal new stories, and make new discoveries. As the NLM becomes the new home of data science at the National Institutes of Health (NIH), Circulating Now explores what researchers from a variety of disciplines are learning from centuries of preserved data, and how their work can help us think about the future preservation and uses of the data we collect today.

Thomas Ewing is a professor of history at Virginia Tech and director of the Tracking the Russian flu project. Funding for this project, including support for undergraduate student researchers, was provided by the National Endowment for the Humanities and a 4VA research grant.

Ian Hargreaves graduated from Virginia Tech with degrees in Foreign Languages (German) and International Studies, and he is enrolled at the The George Washington University Law School.

Jessica King graduated from Virginia Tech with majors in Communication and International Studies with a minor in German, and she is enrolled in the Virginia Tech graduate program in Communication.

Andrew Pregnall is a senior at Virginia Tech pursuing degrees in Microbiology and History.

Tyler Talnagi graduated from Virginia Tech with degrees in German and International Studies.

2 comments

  1. Replicating Figure 2 in Google NGrams (words in books) shows an interesting delay. It took longer for influenza mentions to peak but they also didn’t decline as quickly as BMJ. NGrams is always a nice quick starting point for word frequency searches. http://bit.ly/2DCO0kr

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.