Non-consumptive research? Text mining? Welcome to the hotspot of humanities research at Stanford

Books aren't just sacks of raw data – or are they? "Text mining" and "non-consumptive research" may create an altogether different kind of literary history. Stanford is at the cutting edge of the humanities in the computer age.

L.A. Cicero Matthew Jockers

Matthew Jockers’ work is not only new data, but a new idea of what research looks like in the humanities.


Terms like "non-consumptive research" and "text-mining" seem designed to alienate the right-brained world of the humanities.  Nevertheless, the "quantification" of the humanities – treating books as "data" – is at the cutting edge of a new way of research and possibly a new way of thinking as well.

Here's an example:

Ian Watt's Rise of the Novel (1957) is considered by many contemporary literary scholars as the seminal work on the origins of the novel.  The problem is, the book considers the novels of only three writers – Daniel Defoe, Samuel Richardson and Henry Fielding.  

So how does one evaluate the validity of Watt's claims about the development of the novel?  What keeps them from being simply speculative and anecdotal?

Stanford English lecturer Matthew Jockers, who co-directs Stanford's Literature Lab with Professor Franco Moretti, checked some of Watt's claims against 3,600 other 18th and 19th century novels.  If that sounds like a lot of hours hitting the library stacks, think again:  Jockers' project is part of the effort to digitize the humanities, and its work is done entirely on computers.  That's why the research is "non-consumptive" – you don't read the books, you search them – or, in the parlance of the researchers, you mine the texts.

"Traditionally we've studied literature piecemeal – one book at a time," said Jockers.  "But computation and the digitalization of libraries have now made it possible to study literature as a much larger system."

A leader in an emerging field

The Literature Lab effort is part of a bigger move that has positioned Stanford as one of the leaders in this new field.  In October, Stanford University Libraries announced a two-year $790,310 project with the Software Environment for the Advancement of Scholarly Research (SEASR, pronounced "Caesar"), funded principally by the Mellon Foundation, to take advantage of SEASR's newly designed software that allows the mining of thousands of books.  The Stanford effort will be joined by other researchers in the United States.

At about the same time, the libraries announced that Stanford had joined the HathiTrust, a consortium of academic and research libraries that are pooling digitized texts to preserve them.

As a result, "We're asking questions on a scale never asked before," said Jockers.

"Our colleagues in computational linguistics and natural language processing have developed lots and lots of software tools for doing complex linguistic analysis," said Jockers.  "What we're doing is taking those tools and technologies and applying them to a literary corpus and asking a new set of questions."

For example, Watt described the evolution of the novel from an aristocratic literary form to a more populist one in the 19th century.  How can you prove that was what happened?  Jockers decided to search for first names.  One might suppose that the "Mr. Knightleys" of Jane Austen's Emma would yield to more frequent use of first names by the 19th century's end – or even nicknames, such as Flossie or Jim.

Jockers, sitting at his desk, enters his database and does a quick search.  Voila!  The name "Jim" is virtually non-existent in the literature before the 1870s.

"This is brand new data," Jockers said of the findings.  "We haven't drawn any conclusions." 

"The work that we're doing is bringing new knowledge," he said.  The SEASR grant encourages the researchers to bring their findings to traditional humanities outlets, but the reception there has not always been warm.  After all, they argue, you just can't treat novels like sacks of raw data, and expect to crank out intelligent insights about aesthetics.

But, according to Moretti, "What the skeptics miss is indeed the big picture: the possibility that the study of 4,000 19th-century novels, instead of the usual 40 or 50, may give us not just a bigger, but a different, literary history.

"After all, this is exactly what happened to social history two generations ago, when it was completely transformed by the quantitative study of large archives. And this is the hope – and the aim – of our Literary Lab."

Jockers insisted that the results they gather will "still require interpretation" and therefore aren't light years away from more conventional research in the humanities.  However, instead of analyzing a poem or passage or book, Jockers and his colleagues and students now study libraries – without reading the books.

Following literary patterns 

The Lit Lab is also using the data to compare the style of literary genres.  The research examines how the usage patterns of different words characterize different genres.  A comparison of gothic novels with the "sensation novels" (lurid tales of  society's underbelly) shows a marked inclination toward "locative prepositions" in the gothic novels – "where," "at," "towards." Not surprising, said Jockers, given that gothic novels are a place-centered genre, thriving on castles and dark places.

The early findings suggest that to write in a genre – whether coming-of-age  bildungsroman (think David Copperfield or Jane Eyre) or detective novel – is immediately restrictive of artistic freedom in ways writers never would guess.

The platform SEASR has provided also allows ways of ferreting out latent topics and themes by word selection (for example, words like ship, captain, ocean, sea and boat would indicate a nautical novel, such as Moby Dick) or rooting out the unconscious stylistic tics and word usages of an author – say Dickens.

"How does content drive style? We don't know yet," Jockers said.

Jockers' work is not only new data, but also a new idea of what research looks like in the humanities.  And if it still has kinks (for instance, it's vulnerable to the researcher's own preconceptions and prejudices, as always), the future will refine it.  A lot depends on asking the right questions.  (Stanford's "Mapping the Republic of Letters" is another example of quantitative research.)

"Will we succeed? Who knows?" said Moretti. "So far, what I see emerging from our early explorations is a much more confused literary landscape than the one we see when we focus only on the canon.

"Working on a digital archive is like entering a Greek myth: hybrids, metamorphoses, distortions everywhere. It's fascinating; but it's also difficult to extract a rational picture. That, however, is our project, and its pathos lies in keeping the rational aim in sight – by using statistics, network theory, computational linguistics, evolution – through a maze of crazy contradictory details and endless false starts or disappointments.

"Will we succeed? Will the digital archive allow us – or other researchers – to effect a paradigm shift in literary history? No one knows; and so the skeptics are right, in reiterating the 'so what?' question. In any case, crowning 40 years of work with a single big wager feels great: we are going to be right, or wrong. There is no middle road any more."