Detail from Hospital The Modern Newspaper of Administrative Medicine and Institutional Life, 1921, showing graph used in advertisment.

Revealing Data: Investigating The Hospital’s file sizes

By Ashley Bowen ~

For researchers interested in the administration of British hospitals in the late 19th and early 20th century, The Hospital is a vital resource. The Hospital, a journal published in London from 1886-1921, carried the tag line “the modern newspaper of administrative medicine and institutional life.” It published an enormous variety of items of interest to physicians, nurses, hospital administrators, and public health professionals—everything from medical research to notes on fire prevention and institutional kitchen management, reflections on “the dignity of medicine,” opinions about housing policy, and much more.

While it is possible to study the journal’s contents by skimming through its archives or searching by keyword and author, researchers can also evaluate the whole corpus, or set, of the journal’s articles. The Medical Journal Backfiles Digitization Project, an ongoing joint effort between the National Library of Medicine and the Wellcome Collection, has made it possible to download every article in the journal’s entire run as thousands of individual .txt files created in a standard (though unedited) format.

Hospital The Modern Newspaper of Administrative Medicine and Institutional Life, 1921.
Cover of Hospital, January 22, 1921

Approached this way, The Hospital can be analyzed for change in the contents over time. The process of digitizing the journal also created metadata available for researchers to evaluate in addition to their traditional content analysis. Metadata, according to UCLA’s Center for Digital Humanities, is the term applied to “information that describes information, objects, content, or documents.” Sometimes metadata refers to the contents of a file or publication and sometimes it refers to the item itself. This kind of data is not visible in the original item but represents another level of information about the article and the journal when considered in addition to the content.

Inspired by research that used file size as a way to identify the rate of change in Victorian newspapers, I wondered if there were any interesting trends in the size of the Hospital files digitized as part of the Medical Journal Backfiles Digitization Project. What might analyzing the file sizes reveal about trends in the journal’s content over the last 22 years of its run?

The process I used was simple but time consuming. First, I downloaded the complete run of The Hospital from PMC’s Historical OCR collection. Then, I imported all the file names and file sizes into Excel. Each year went into a different sheet and each .txt file’s data went on its own line. Third, I converted all the file sizes to kB. Although I used a simple calculation for this, it took time to find each file not already in kB and convert it. Then, I removed the units from the column and converted the values from text to a number. Next, I used Excel to calculate the average, median, and mode file sizes in each year. Finally, I graphed these values using Excel’s built-in graphing capabilities to help visualize the size of articles in Hospital from 1900 until it ceased publication in 1921.

The findings suggest some interesting directions for future research. For example, the total number of individual items peaked in 1920. That year, the journal published nearly 500 more pieces of all kinds—articles, book reviews, obituaries, nursing notes, and more—than the next highest year in the period evaluated. A researcher might investigate what factors contributed to this surge in publishing in 1920. Looking at the graph, the journal as a whole published fewer articles during World War I than in other years. Perhaps this spike represents a rush of publishing post-war or the increased output of medical professionals returning to civilian life.

Line graph showing a dip in 1906, a moderate rise to 1910, a low point in 1917, and a spike to over 1700 articles in 1920.
Total number of Articles in Hospital from 1900 to 1921.

Additionally, by looking at average, median, and mode file sizes reveals trends in the journal’s content. The average file size is always larger than the median and mode and the median and mode are often quite similar. However, I was particularly interested in the few years where the median file size was smaller than the mode, or most frequent, file size. This suggests that the journal’s contents were skewed. It might be useful to consider the range of file sizes in each year to account of the large number of shorter pieces that appeared in some issues of The Hospital. A scholar interested in the journal might consider what the balance of longer form articles to “quick hit” pieces suggests about the readership or the publication’s purpose. Likewise, historians often focus on the major research articles in journals although they are not representative of the bulk of a journal’s content.

Line graph showing the median and mode file sizes generally between 4.5 and 7, while the Average size spikes in 1913, dips in 1915 and falls sharply to below the previous median in 1920.
Article sizes, in kB, in Hospital from 1900 to 1921.

Until very recently, historians have not often looked at the metadata associated with digital archives. Evaluating an archive in a quantitative way can provoke new questions or suggest alternative sources of evidence in support of a central argument.

We hear about data every day. In historical medical collections, data abounds, both quantitative and qualitative. In its format, scope, and biases, data inherently contains more information than its face value. This series, Revealing Data, explores how, by preserving the research data of the past and making it publicly available, the National Library of Medicine (NLM) helps to ensure that generations of researchers can reexamine it, reveal new stories, and make new discoveries. As the NLM becomes the new home of data science at the National Institutes of Health (NIH), Circulating Now explores what researchers from a variety of disciplines are learning from centuries of preserved data, and how their work can help us think about the future preservation and uses of the data we collect today.

Ashley Bowen, PhD is a guest curator for the Exhibition Program in the History of Medicine Division of the National Library of Medicine.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.