Hertz Fellow Erez Lieberman-Aiden Creates Word-Frequency Tool to Measure Quantitative Historical Shifts in Language

December 7, 2013
Hertz Staff

In a Scoreboard of Words, a Cultural Guide

By NATASHA SINGER, The New York Times

Data is the new oil. The next revolution will be social-mobile-local. Technology is evolving faster than ever.

Certain business catchphrases become so commonplace that they seem as if they must be true. But how do you measure the cultural signals behind such truisms?

For insight into whether data may now have more cultural currency than oil, I turned to a tool from Google that charts the yearly frequency of words and phrases contained in millions of books. Called the Google Books Ngram Viewer, it is an outgrowth of the company’s efforts to scan the world’s books. (Ngram is a technical term in which “N” stands for the number of words in a sequence; a single word like “America” is a one-gram while “the United States of America” is a five-gram.)

I started my data-versus-oil quest with casual one-gram queries about the two words. The tool produced a chart showing that the word “data” appeared more often than “oil” in English-language texts as far back as 1953, and that its frequency followed a steep upward trajectory into the late 1980s. Of course, in the world of actual commerce, oil may have greater value than raw data. But in terms of book mentions, at least, the word-use graph suggests that data isn’t simply the new oil. It’s more like a decades-old front-runner.

“The appreciation of the importance of data has been emerging for decades, hand in hand with the computers that allow us to analyze it,” says Erez Aiden, a computer scientist who helped create the word-frequency tool and is now an assistant professor at the Baylor College of Medicine in Houston. “Maybe data has been the new oil for a little longer than we think.”

The analysis of large, complex data sets — or “Big Data” — to predict phenomena is becoming ubiquitous. But Google’s tool is an example of data analysis over a much larger time scale — an approach called “Long Data” — to find and follow cultural shifts. And it is ushering in a quantitative approach to understanding human history.

Now Mr. Aiden and his data science co-researcher, Jean-Baptiste Michel, have written a book, “Uncharted: Big Data as a Lens on Human Culture” (Riverhead Books), scheduled to be published this month. It recounts how, as graduate students at Harvard, they came up with the idea for measuring historical shifts in language and then took the concept to Google.

The two have since used this system to analyze centuries of word use, examining the spread of scientific concepts, technological innovations, political repression and even celebrity fame. To detect censorship in Germany under the Nazis, for instance, they tracked the mentions and omissions of well-known artists — reporting that Marc Chagall’s full name surfaced only once from 1936 to 1943 in the German book records, even as this Jewish painter’s name appeared with increasing frequency in English texts.

“Digitized data is really powerful when it becomes long enough over time so you can see trends in society and culture that you could not see before,” says Mr. Michel, who recently started a data science company, Quantified Labs. “You are getting a whole new vantage point on something.”

Of course, computational analysis of word frequency isn’t meant as a replacement for primary sources and records. It’s simply an instrument to allow researchers to more easily investigate panoramic views of history.

Mr. Michel and Mr. Aiden began seriously contemplating the idea of an automated word-frequency calculator in 2006, while working on a laborious analysis of changes to English grammar; that involved a research team painstakingly analyzing and quantifying how irregular verbs changed over time in Old and Middle English texts. It led them to imagine a kind of “robot historian” that could make the process more efficient by reading millions of books at once and tabulating the occurrence of words and phrases.

“We wanted to create a scientific measuring instrument, something like a telescope, but instead of pointing it at a star, you point it at human culture,” Mr. Michel recalls. The pair approached Peter Norvig, the director of research at Google, with a pie-in-the-sky proposal: to mine Google’s library of digital books so they could apply automated quantitative analysis to the typically qualitative study of history.

According to the book, Mr. Norvig was intrigued. But he challenged the graduate students by asking how such a system could work without violating copyright.

After some thought, Mr. Aiden and Mr. Michel proposed creating a kind of “shadow data set” that would contain frequency statistics on the most common words or phrases in the digitized books — but would not contain the books’ actual texts.

The pair developed a prototype interface, called Bookworm, to prove their idea. Soon after, engineers at Google, including Jon Orwant and Will Brockman, built a public, web-based version of the tool.

“We were in,” Mr. Aiden and Mr. Michel write in the book. “Suddenly we had access to the biggest collection of words in history.”

Today, the Ngram Viewer contains words taken from about 7.5 million books, representing an estimated 6 percent of all books ever published. Academic researchers can tap into the data to conduct rigorous studies of linguistic shifts across decades or centuries. Members of the public may simply have fun watching how certain lingo rises and falls over time.

The system can also conduct quantitative checks on popular perceptions.

Consider our current notion that we live in a time when technology is evolving faster than ever. Mr. Aiden and Mr. Michel tested this belief by comparing the dates of invention of 147 technologies with the rates at which those innovations spread through English texts. They found that early 19th-century inventions, for instance, took 65 years to begin making a cultural impact, while turn-of-the-20th-century innovations took only 26 years. Their conclusion: the time it takes for society to learn about an invention has been shrinking by about 2.5 years every decade.

“You see it very quantitatively, going back centuries, the increasing speed with which technology is adopted,” Mr. Aiden says.

Still, they caution armchair linguists that the Ngram Viewer is a scientific tool whose results can be misinterpreted.

Witness a simple two-gram query for “fax machine.” Their book describes how the fax seems to pop up, “almost instantaneously, in the 1980s, soaring immediately to peak popularity.” But the machine was actually invented in the 1840s, the book reports. Back then it was called the “telefax.”

Certain concepts may persevere, even as the names for technologies change to suit the lexicon of their time.