A line chart showing variation of topic proportions from 1974 to 2001

Revealing Data: Navigating Historical Biomedical Technology Research with Digital Humanities

By Brice Bowrey ~

Historians must sometimes grapple with deafening silence in the archival record. They can just as easily find themselves drowning in the volume of documents and source materials from some repositories. The National Library of Medicine’s MEDLINE database presents the latter challenge. While contemporary scientists use the database to find current medical, biochemical, and public health literature, MEDLINE has indexed hundreds of thousands of articles published in the nineteenth and twentieth centuries. Historical research articles from MEDLINE could be valuable primary sources for social studies of science and medicine. Anyone can freely search the MEDLINE data using the web-based PubMed search tool, but even with filters, a search can return tens of thousands of potentially relevant results that the user must access and read one by one. It is arduous, if not impossible, to systematically locate, compare, and study thousands of specialized research papers using traditional methods of historical textual analysis if PubMed is the only means of accessing the documents.

To tackle the kind of challenges posed by the MEDLINE database, digital humanists and computer scientists have pioneered methods of “distant reading.” Unlike traditional close reading, in which scholars manually examine a relatively small set of texts, distant reading leverages computation and machine learning to identify patterns, trends, and relationships within a set of documents on a much larger scale. Taking inspiration from projects like Voyant Tools and the Women Writers Vector Toolkit, I have developed an experimental application for distant reading late twentieth-century biomedical technology research papers from the MEDLINE database.

The PubMed Topic Modeling Toolkit (PTMT) is an open-source, self-hostable, interactive web application allowing users to query a pre-trained topic model of more than 320,000 historical biomedical technology abstracts. Topic modeling uses statistics and machine learning to identify clusters of related words and texts (e.g., text within the same “topic”). While this technique leverages quantitative tools, significant humanistic inquiry is still involved in producing and interpreting the model. The algorithm identifies related texts, but the user must determine how and why those texts are related. PTMT’s quantitative tools supplement close reading and other forms of analysis, rather than replace traditional methods.

A line chart showing variation of topic proportions from 1974 to 2001
There are challenges associated with using topic modeling to track change over time in a corpus, but these visualizations make it easier to consider historical questions.

One of PTMT’s greatest strengths is its ability to surface trends that warrant further investigation. For example, the largest topic in the model is associated with cardiac care—a subject that has catalyzed numerous historical and social studies of medical technology and twentieth-century healthcare. However, the second largest topic relates to eye surgery and cataract removal. If the scientific community was so invested in ocular surgery, can studying that procedure yield insights into societal understandings of acquired disability, medical responses to the post-industrial longevity boom, or the economics of geriatric care?

A word cloud drawn from word correlations, which may be biased.
The model finds a biased correlation between the phrase “African American people” and a topic related to high-risk sexual behaviors and sexually transmitted infections.

Alternatively, PTMT might reinforce or supplement findings from existing research. One striking example concerns topics related to race and ethnicity. A search for the phrase “African American people” returns a strong association with Topic 17—a topic containing words such as “condom,” “sexual,” “STDs,” and “prostitutes.” As with all machine learning tools, PTMT reflects biases in the data. However, in this case, that behavior is desirable. The result underscores historical research that surfaced the persistent and pernicious hypersexualization of Black bodies in medical and popular discourse. PTMT is a starting point for exploring problems, progress, and patterns in late twentieth-century medical research. Its results should be taken as inspiration for new questions and studies.

Sustainability and access concerns also drove PTMT’s design and development. The project runs on an open-source Python framework called Streamlit. It does not rely on any external servers. The choice of framework and design placed some limitations on the speed and user interface of the application. As compensation, the public instance of PTMT runs for free and can continue running indefinitely, barring any changes to Hugging Face’s terms of service. Perhaps more importantly, anyone can download the project and run it locally by following the instructions in the REAMDME.md file. After the initial download and configuration, the application can run on a computer entirely disconnected from the Internet. Users familiar with Python can edit the local code to make improvements to the application and, in the spirit of open-source software development, contribute those changes back to the project for the benefit of all users.

In addition to the graphical user interface, the development work produced several scripts for performing other types of quantitative text analysis on the MEDLINE data. Although much of the code was unsuitable for the web application, the scripts are archived alongside the main project files on the data science platform Kaggle. The current release of PTMT is stable, but there remains ample opportunity for further advancement. Future research and development may occur on Hugging Face, the machine learning platform that hosts the project’s Git repository.

A heat map chart showing number of articles by geographic region.
This chart is a sample output from a data analysis script that did not integrate into the graphical part of the project. I encourage viewers to interpret the quantitative data and measures of statistical significance in the project as numerical representations of differences and patterns, rather than as claims of absolute truth or reality.

As digital humanities and data science evolve, I anticipate that collaborative efforts will continue to make computational text analysis more accessible to humanists and the general public. I hope researchers interested in medical history will continue exploring computational techniques and extracting new ideas from historical MEDLINE data. However, MEDLINE is just one part of the vast collection the National Library of Medicine makes freely available and preserves for current and future generations. I look forward to seeing what future scholarship emerges at the intersection of history, medicine, and data science.

 

An informal portrait of a Black man outdoors.Brice Bowrey is a PhD Candidate in the Department of History at the University of Maryland—College Park and a 2023 NLM Michael E. DeBakey Fellow in the History of Medicine. He is currently studying the relationship between R&D practices and professional norms in the late twentieth-century U.S. medical device industry.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.