How to "Read" A Million Novels in One Afternoon With Spark
A look at how Spark can handle massive volumes of data through topic modeling.
Join the DZone community and get the full member experience.Join For Free
The world communicates in text. Our work lives have us treading waist-deep in email, our hobbies often have blogs, our complaints go to Yelp, and even our personal lives are lived out via tweets, Facebook updates, and texts. There is a massive amount of information that can be heard from text — as long as you know how to listen.
Learning from large amounts of text data is less a question of will and more a question of feasibility. How do you “read” the equivalent of thousands or even millions of novels in an afternoon?
Topic models extract the key concepts in a set of documents. Each concept can be described by a list of keywords from most to least important. Then, each document can be connected to those concepts, or topics, to determine how representative that document is of that overall concept.
For example, given a corpus of news stories, topics emerge independent of any one document. One topic may be characterized by terms like “poll,” “campaign,” and “debate,” which an observer will quickly see is a topic about politics and elections. Another sample topic may be characterized by terms like “euro,” “prime minister,” and “fiscal” – an observer immediately sees as a topic about the European economy. Commonly, a given document is not wholly about only a single topic but rather a mix of topics. The topic model outputs a probability that the document is about each possible topic. An analyst interested in how an election may affect the European economy can now isolate the topics of interest and search directly for those documents that contain a mixture. In a very short period of time, what started as thousands or millions of documents can be whittled to only the most important few.
When the Data Gets Large
To build these topics, an algorithm is employed called Latent Dirichlet Allocation. The algorithm will need to store vectors representing each possible term multiplied by the number of documents. Then, it will iterate through hundreds of thousands of cycles, seeking to improve each abstract topic at each stage.
Apache Spark is optimal for building a pipeline for this task. Spark uses the pool of memory across many servers to break up the problem into many parallel components and uses the comparative speed of RAM to quickly iterate the algorithm. Because the task can be broken into an arbitrary number of smaller pieces and iteration can continue at speed, it handles very large amounts of text data as easily as a single machine handles a moderate amount.
Big Text Data: Congress Loves to Talk
To see how Spark can handle a massive amount of text data, consider the case of a University of Oklahoma doctoral student in Political Science. She wants to investigate how international politics have changed over the last 20 years, using congressional hearing transcripts from 1995 to 2015. However, as you likely would expect, Congress loves to talk. Over this 20-year period, there are more than 19,000 hearing documents. The average hearing runs approx. 32,000 words, with the longest at nearly 900,000 words long. In total, the documents run 613 million words, over 600 times longer than the entire Harry Potter book series.
To build a tool to allow her to investigate this mountain of text efficiently and effectively, Apache Spark was used to build a suite of topic models with different parameters. Using a cluster of five moderate-sized Amazon servers, she was able to build and cache 36 different topic models on the whole data set in less than 48 hours. Once built, the output models could be persisted to JSON and routed to an interactive application, where those models could be explored interactively. The different models provided a range of how frequently or infrequently terms were included in the model, as well as the number of topics into which the corpus was divided.
After choosing which model to load, a user is presented with a scatter plot of all topics. The topics are arranged such that similar topics are close together. For example, topics about wildlife conservation and industrial pollution may be relatively similar as they both commonly have some of the same words, like “lake,” “river,” or “fish.” Each topic is sized according to the relative share of total document space that is attributable to this topic. When a user clicks on a topic, a word cloud is created that communicates the most important terms, sized by how strongly they contribute to the uniqueness of that topic.
Additionally, other meta-data aspects, such as the year of the congressional documents, are preserved. This allows a breakdown of document space specific to a chosen topic. Thus, a user can see how prominent a topic was across time, both relative to the volume of text each year as well as in absolute terms.
From here, a topic can be broken down further into the top documents that most represent that topic (either across all hearing documents or specific to a certain year). Clicking on a given document provides the full range of meta data: what percent it represents the topic (called gamma), what date the hearing was held, which chamber it was held in (Senate, House, joint), and which committee actually held the hearing. Lastly, a user can see what other topics a given document also contains.
Without big data tools like Spark, it would be impossible to build interesting models on large amounts of text data. However, without interactive and exploratory applications to communicate those models to analysts and subject matter experts, the models wouldn’t matter in the first place. Using Spark to power interactive applications can help break open the world of unstructured text data and give an analyst the ability to read like a super-human.
Opinions expressed by DZone contributors are their own.