Gone our the days when those in the humanities could say “no” to math. Algorithms, statistics, and other mathematical tools can change the way people convey and interpret the humanities.

Mathematics, in conjunction with computer coding and programming, allows humanities scholars to “zoom out” of a sample subset–say, every written word of William Shakespeare, for example–and draw conclusions about the broader themes, interpretations, and aspects of that sample. Continuing with the example, such a zooming out could reveal how often Shakespeare focused on sexuality, poverty, monarchy, etc.

So the question is — what would something like that tell us? How would we be better off knowing the frequency with which Shakespeare discussed the British crown?

This is where there seems to be some contention among scholars. On the one side are those who swear by the numbers. Exactitude, these scholars say, yield concrete answers to difficult questions. On the opposite end of the spectrum are those who see statistical approaches as overly assuming. How, these scholars ask, can numbers reveal nuance?


The quantitative approach is nothing new — and I should be clear here to point out that the topic of discussion today is not the “quantitative approach” as historians know it. Yes, the approaches discussed here draw from numbers, but the quantitative approach generally results in a number: X amount of slaves on the Transatlantic slave trade, X number of Native Americans present in the New World before European exploration, etc.

However, using mathematics–especially algorithms and computer coding and programming–to reveal textual trends is relatively new. The results in these approaches are not always number-focused, but rather text-focused. How often was the word “X” used by a certain history journal over the course of its existence? How often did Union soldiers discuss “death” in their letters? What issues did Churchill discuss in his speeches? Yes, numbers are inherently a part of this approach, but the driving focus is on the text.


If there is any counter-argument against mine that textual/mathematical approaches are “relatively new,” it lies in the Index Thomisticus by Roberto Busa. Busa, an Italian Jesuit priest, began this project all the way back in the 1940s–his goal was to arrange the words of St. Thomas Aquinas in such a way as to allow scholars to interpret them based off of the frequency with which Aquinas discussed certain topics, and how individual words and correlating themes linked with others. As technology grew, so to did Busa’s project. Though Busa has since passed, the project exists now online, and Aquinas’ words can be searched and understood in over ten different languages.

Busa labeled his work as “textual informatics,” and his approaches have been adopted and adapted throughout the humanities.

Another term for this kind of work is “distant reading,” which journalist and author Kathryn Schulz defines as “understanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data.”

In other words, distant reading zooms out from the individual text, or collection of texts, all the way out to sample sizes far too large for any one human to read and interpret–say, 160,000 books or more–and interprets them in some way, shape, or form. For “distant readers,” understanding literature is more than just reading—it’s zooming out and analyzing sample sizes that are far too large for any one person to do on their own.

How can this happen? Again, using digital tools (at least now, not in Busa’s 1940s) to scan texts, identify word and term frequencies, create relations between words, form groups, and produce lists–or topics–for scholars to then interpret.

According to digital whizkid Scott Weingart (now the head of Carnegie Mellon’s digital humanities efforts), this approach, called “topic modeling…represents a class of computer programs that automagically extracts topics from texts.” However, according to Weingart, “It’s all text and no subtext…it’s a bridge, and often a helpful one.”

In other words, topic modeling and text mining assist the humanities in getting somewhere, but they are not the end-all, be-all.

(This is a good spot to define the difference between text mining and topic modeling — some good answers can be found here. Basically, “Text classification (text mining) is a form of supervised learning — the set of possible classes are known/defined in advance and don’t change” whereas “Topic modeling is a form of unsupervised learning (akin to clustering) — the set of possible topics are unknown apriori. They’re defined as part of generating the topic models. With a non-deterministic algorithm like LDA, you’ll get different topics each time you run the algorithm.”)

Schulz questions whether or not they are useful at all, at least in literary studies. “Literature is an artificial universe,” Schulz says, “and the written word, unlike the natural world, can’t be counted on to obey a set of laws.”


Perhaps Schulz is right, but can’t topic modeling, text mining, and other distant reading approaches assist in understanding overarching trends that the naked eye (and mind) simply cannot identify? Do these approaches reduce narratives to numbers, or do the numbers boost the narratives? I tend to lean toward the latter.

In my own work, I can see the benefit of applying text mining and/or topic modeling to find overarching trends in newspapers I research. How often to journalist discuss race, poverty, the built urban environment, and what can this tell me about the narrative I am working on?

If there is any reluctance on my part to engage with distant reading tools, it harkens back to the introduction — I tend to be one of those who says “no” to math. Not out of ignorance — it’s not a declaration. You’ll never hear me say, “death to Math!” except perhaps in my head during a GRE exam.

It’s just that I’m a bit frightened to engage with something so arduous and unfamiliar. And I’m not alone here.

But we as historians, humanitarians, and scholars in general need to embrace all approaches, and do our best to understand the nitty-gritty of numbers and distant reading. We might not think it will be useful to us. We might not want to “obey a set of laws” that make us feel overly numerated. But how will we know if we do not try to learn the method.

As Weingart says, “The model requires us to make meaning out of them, they require interpretation, and without fully understanding the underlying algorithm, one cannot hope to properly interpret the results.”



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s