The National Institute for Computational Sciences

Petascale Humanities: Supercomputing Global News Media

University of Illinois scientist uses advanced computing to study how global news media can forecast human behavior

By Caitlin Elizabeth Rockett

News abounds at lightning speeds—on the Internet and T.V., in newspapers, magazines, blogs, and social networking sites—but what do we get when we consume news? Scientist Kalev Leetaru believes news is capable of teaching us much more than just what happened in the world today.

“News gives you incredible information about people, places, and organizations,” said Leetaru, assistant director for Text and Digital Media Analytics at the Institute for Computing in the Humanities, Arts, and Social Science at the University of Illinois, Urbana-Champaign. “It also tells you about the relationships between them, about how people view each other.”

The field of digital humanities is growing, which is why Leetaru took his research to the resources at the University of Tennessee’s (UT) National Institute for Computational Sciences (NICS).

Using a large, shared memory supercomputer called Nautilus, Leetaru has analyzed the tone and geographic dimensions of a 30-year archive of global news to produce real-time forecasts of human behavior such as national conflicts and the movement of specific individuals.

A range of advanced analysis techniques were used to produce a network 2.4 petabytes in size containing more than 10 billion people, places, things, and activities connected by over 100 trillion relationships—more data than any current computing system can handle. By leveraging advanced supercomputers like Nautilus, Leetaru is able to push the envelope of the “petascale humanities,” letting the machine find interesting patterns in the bulk of data. With patterns in hand, he then recreates them using a more traditional targeted and smaller-scale approach that others can follow.

A paper detailing his findings—“Culturomics 2.0: Forecasting Large-Scale Human Behavior Using Global Media Tone in Time and Space”—was published in the September 5 edition of the open access, peer-reviewed electronic journal First Monday.

Tuning into tone

In January, Science published a paper detailing a study in “culturomics” where computers trace the frequency of words and phrases, quantifying cultural trends across seven languages and three centuries. While this work offered new insights, tracing words in books offers a limited picture of humanity.

“Books reflect a digested view of history, written with the benefit of hindsight,” Leetaru clarified. “This study is in real-time.” And thus “culturomics 2.0” was born.

Leetaru’s version of culturomics uses over 100 million news articles, including the Summary of World Broadcasts (SWB) news archive, a daily selection of print, broadcast, and Internet news from around the world. Created by the British Government in the lead up to World War II and currently operating in partnership with the CIA, content in SWB is translated in a highly iterative manner that ensures preservation of the subtleties of language. The SWB archive provided a 30-year sample for Leetaru’s research, from January 1979 to July 2010, totaling 3.9 million news articles.

For comparison, Leetaru also incorporated the complete full text of the New York Times, 1945-2005, containing 5.9 million articles. In order to monitor news on the recent Egyptian conflicts, a manual process was used to update both the SWB and Times’ archives with all articles mentioning the countries in the study.

Leetaru found that nearly half of the news monitored around the world by SWB now comes from websites. To investigate how much insight is gained from SWB’s capacity to translate and access print and non-web broadcast media, an archive of English-only web-based news sites from around the world were also incorporated. Using news from the Google News front page, main topic pages, and individual country feeds, this web crawl includes 10,000 to 100,000 articles per day from 2006 to 2011.

Perhaps the most discerning aspect of Leetaru’s version of culturomics is the addition of spatial and tonal dimensions.

“Almost every Fortune 500 company monitors the tone of news and social media coverage about their products,” Leetaru explained. “There’s been a huge amount of research coming out of the business literature on the power of news tone to predict economic behavior, yet there hasn’t been as much work in using it to predict social behavior.”

World War II provides an example of how changes in media tone can serve as a social forecasting tool. On December 6, 1941, the U.S. counterpart to SWB produced an analytical report noting that Japanese radio broadcasts had increased their criticism of the U.S. and ceased appeals for peace. Pearl Harbor was bombed the next day. According to Leetaru, the news-monitoring service had tuned into the real news.

“They recognized the most valuable part about the news was not the factual parts, but the latent parts—the tone, the emotion.”

Still, how do you measure tone, and how do computers examine human behavior?

Computing human behavior

An SGI Altix UV 1000 system managed by UT’s Remote Data Analysis and Visualization Center, Nautilus is closely integrated with NICS, located on the campus of Oak Ridge National Laboratory. With 1,024 cores and 4 terabytes of global shared memory, Nautilus enables data analysis and visualization of information on one resource.

“The ability to have large shared memory and a platform to do visualizations all in one place makes Nautilus a wonderful resource,” Leetaru explained, adding, “And when something didn’t work, NICS was there to help me—from compiling to workflow.” Nautilus allowed Leetaru to grab whole terabytes of data and execute three key data mining techniques: tone mining, fulltext geocoding, and network analysis.

Leetaru looked at 1,500 dimensions of emotion before deciding tone was the most reliable metric for conflict. Tone mining creates a numeric measure of overall tone in a document. An algorithm counts the number of “positive” and “negative” words that appear and assigns a positive or negative value. Two methods of tone mining were used, each using dictionaries with pre-assigned positive and negative words.

The first method counts the density of positive and negative words, subtracts the values and gets a measure of overall tone. The second method uses special dictionaries where each word has been assigned a numeric score from extremely negative to extremely positive, capturing the fact that “loathe” has a more negative connotation than “dislike.” The average score of all words found in a document is used to offer a slightly more nuanced understanding of its tone.

Geocoding uses algorithms that examine the text of a news article for possible location references, disambiguate them (does this document reference Cairo, Egypt, or Cairo, Illinois?), and ultimately output an approximate geographic coordinate for the location that can be displayed on a map.

A third technique, network analysis, shows how global media groups the countries of Earth.

“Using global news coverage, you count how many times every city on Earth is mentioned with every other city in an article,” explained Leetaru. “Group those results by country and you have a network of how the world news media relates and frames all the countries on Earth.”

Each of these techniques was used to examine the ability of tonality in news to forecast a number of human behaviors, leading to some thought-provoking results.

Results and the future

Arguably one of the most unexpected findings highlighted in Leetaru’s paper focuses on using news to map movement of a particular individual—in this case Osama bin Laden. Leetaru was able to estimate the militant leader’s hiding place as a 200-kilometer radius in Northern Pakistan, including Abbottabad where he was ultimately found.

“I never expected to pinpoint him so accurately,” admitted Leetaru. “But it’s fascinating—if you make a map of all the cities mentioned in articles about him over the last decade it leads to a 200-kilometer radius around where he was found. It begs the question, ‘Why did that work so well?’”

Leetaru was also able to use news to retroactively forecast revolutions in Egypt, Tunisia, and Lybia and dissect the underpinnings.

“Certainly Tunisia played a huge role in pushing Egypt over the edge, but if you trace out the tonal curve of news concerning Egypt, you see that the real downspiral didn’t happen until after January first, the day of the Coptic Church bombing.”

Using network analysis techniques, Leetaru used both the SWB and New York Times archives to create world “civilizations,” essentially groups of countries that the news media tends to group together. The resulting map created from the SWB archive was markedly different from the map created by the New York Times archive.

While the SWB news led to seven civilizations, the Times news led to only five groups and a far greater portion of countries grouped with America.

“Each country’s media will depict the world differently,” explained Leetaru. “It’s a standard principle of journalism—you write for your audience. Still, it vividly reinforces that what we get here in the U.S. is a very U.S. centric view of the world.”

Leetaru emphasizes that while these are captivating findings, the real goal of his work is to encourage further study.

“The purpose of this paper is not to say, ‘Here’s the magic bullet that solves these problems,’ but more as a road map for future research,” he said. “I see it as diving beneath the ocean—we’ve been so focused on the surface that we’re only just beginning to start exploring the entire new world that’s underneath.”

For complete details on this research, click here to read the full paper at First Monday.