Topic Modeling the State of the Union Address With Python
Topic Modeling the State of the Union Address With Python
Do you feel like partisanship is running amok? It’s not your imagination. As an example, the modern State of the Union has become hyper-partisan, and topic modeling quantifies that effect. Read on to learn more.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Do you feel like partisanship is running amok? It’s not your imagination. As an example, the modern State of the Union has become hyper-partisan, and topic modeling quantifies that effect.
Topic modeling finds broad topics that occur in a body of text. Those topics are characterized by key terms that have some relationship to each other. Here are the four dominant topic groups found in State of the Union addresses since 1945.
How would you characterize the topic of each group? Their fluctuation over time rings true in light of the events that dominated each presidency.
It turns out that their the fluctuation isn't just a factor of current events. It illustrates the influence of television on the SOTU and the rise of modern partisanship. Can you guess how?
Three Eras: Legislative (Before TV), Cultural (Rise of TV), Amplified Partisan (Rise of Cable)
The topics illustrate three distinctive eras of the State of the Union address. The blue topic reigns prior to 1960 and is filled with legislative and fiscal terms, indicative of who was the primary audience before television. The green topic shows a change in tone from legislative to cultural. The rise of television meant that the primary audience of the speech was no longer Congress, with the people getting a review the next day in the newspaper. With TV more prevalent, the SOTU went directly to the people. In line with this hypothesis, the red and purple topics—indicative of modern popular political culture—start together and turn hyper-partisan from the Clinton years to today—red and purple flip-flop drastically as the president's party changes.
From Boring and Dry to Cultural Mirror: The Rise of Television
The blue topic consists of legislative terms like “expenditure”, “fiscal year”, and “recommend”. This topic dominates the late 40s through most of the 50s and declines as television ownership seems to hit a critical mass. The first televised State of the Union was in 1965, as TV reaches ~80% of US households. It's probably not coincidental that this is the year that the event is moved from daytime to the evening. What formerly was a daytime presidential speech primarily to members of Congress, in 1965 it was a prime time television event available to millions of people. The tone visibly shifts to issues of the culture of the day—the green line—highlighted by terms like “space”, “soviet union”, “Vietnam”, and “missile”. (It's fascinating how humans change technology and then technology changes us.)
Next? Modern Partisanship is Born With the Rise of Cable and the 24 Hour News Cycle
President Clinton’s first State of the Union is in stark contrast with the gradual changes in the decades leading up to 1993. Throughout the Carter, Bush Sr., and Reagan years, modern cultural topics—the red and purple lines—trend upward together. That trend changes sharply in 1993 with Bill Clinton’s first address and stays much the same throughout his presidency, with terms such as “college”, “parent”, “laughter”, and, ironically, “bipartisan” dominating the topic. The topics flip immediately in 2001 with Bush Jr.’s first speech. The tone shifts heavily to terms like “terrorist/terror”, “Iraq”, “oil”, “violence”, and “fighting” (the purple line). The trend is at its starkest for his speech in 2001, which was delivered in January of that year, 8 months before the 9/11 attacks. Then the topics immediately revert back to a Clinton-esque pattern in 2009 for Obama’s first State of the Union, where it has remained largely unchanged.
The amplified nature of modern partisanship is jarring compared to the decades prior. Without annotations to the year and presidency it would be difficult to locate exactly when Johnson left and Nixon started, despite being on opposite ends of the political spectrum during the height of the Vietnam War. It is equally difficult to see the difference in Carter and Reagan, the transition is smooth and largely unchanged. The eye can immediately locate the hyper-distinctions of Clinton to Bush to Obama. Why?
The pattern emerges as cable TV breaks the 50% mark, and the 24 hour news cycle is born. Network TV is no longer a universal medium of communication. (With only 3 channels, the president was on all 3.) With cable there are many channels, including dedicated news channels that need to fill time with opinion programming, and each partisan outlet is able to speak inside a bubble to their segmented audience directly. The modern era is characterized not just by partisanship, but a kind of electrified, amplified partisanship that is literally unprecedented.
Topic Model Details
The graphics above represent a Latent Dirichlet Allocation (LDA) topic model for four topics built for all Presidential State of the Union addresses since 1945. The LDA model allows one-to-three-word phrases to be considered as terms and only considered terms that appear in more than 5% and less than 60% of the speeches, as the idea is to find the terms that distinguish one speech from another.
Interactive: Build Your Own Topic Model
Below is a link to a web-based data application where you can build your own topic model of the State of the Union addresses. You can choose the number of topics, as well as document frequency filters for the maximum and minimum number of documents a word can appear. More topics is like zooming in on the body of text, fewer is like zooming out. A maximum document frequency of 0.75 means that if a word appears in more than 75% of the speeches it is not used since it is “too common”. In the same way, a minimum document frequency of 0.05 means that if a word appears in less than 5% of the speeches, it is “too rare” to contribute to an overall trend. (Note: the LDA model takes a few seconds to run.)
Link to the application: http://city.exaptive.com/reference/repos/demo/TopicModelSOTU.html
This data application was built using multiple technologies in the Exaptive platform (Python machine learning to build the models and D3.js. to build the graphics). Here is a behind the scenes tour of how this and similar data applications can be built.
For more on text clustering, check out my prior post using R to do some different text analysis of the same data set. The data is available here and here.
Published at DZone with permission of Frank Evans , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.