Who Wrote the Anti-Trump New York Times Op-Ed? Using Tidytext to Find Document Similarity
Let's check out who wrote the Anti-Trump New York Times Op-Ed and also explore using Tidytext to find document similarities.
Join the DZone community and get the full member experience.Join For Free
like a lot of people, i was intrigued by, "i am part of the resistance inside the trump administration," an anonymous new york times op-ed written by a "senior official in the trump administration." and like many data scientists, i was curious about what role text mining could play.
this is a useful opportunity to demonstrate how to use the tidytext package that julia silge and i developed, and in particular to apply three methods:
- using tf-idf to find words specific to each document (examined in more detail in chapter 3 of our book )
- using widyr to compute pairwise cosine similarity
- how to make similarity interpretable by breaking it down by word
since my goal is r education more than it is political analysis, i show all the code in the post.
even in the less than 24 hours since the article was posted, i'm far from the first to run text analysis on it. in particular, mike kearney has shared a great r analysis on github (which in particular pointed me towards cspan's cabinet twitter list ), and kanishka misra has done some exciting work here .
getting the text of the op-ed is doable with the rvest package .
# setup library(tidyverse) library(tidytext) library(rvest) theme_set(theme_light()) url <- "https://www.nytimes.com/2018/09/05/opinion/trump-white-house-anonymous-resistance.html" # tail(-1) removes the first paragraph, which is an editorial header op_ed <- read_html(url) %>% html_nodes(".e2kc3sl0") %>% html_text() %>% tail(-1) %>% data_frame(text = .)
the harder step is getting a set of documents representing "senior officials." an imperfect but fast approach is to collect text from their twitter accounts. (if you find an interesting dataset of, say, government foia documents, i recommend you try extending this analysis!)
we can look at a combination of two (overlapping) twitter lists containing administration staff members:
library(rtweet) cabinet_accounts <- lists_members(owner_user = "cspan", slug = "the-cabinet") staff <- lists_members(slug = "white-house-staff", owner = "digiphile") # find unique screen names from either account accounts <- unique(c(cabinet_accounts$screen_name, staff$screen_name)) # download ~3200 from each account tweets <- map_df(accounts, get_timeline, n = 3200)
this results in a set of 136,501 from 69 twitter handles. there's certainly no guarantee that the op-ed writer is among these twitter accounts (or, if they are, that they even write their tweets themselves). but it still serves as an interesting case study of text analysis. how do we find the tweets with the closest use of language?
first, we need to tokenize the tweets: to turn them from full messages into individual words. we probably want to avoid retweets, and we need to use a custom regular expression for splitting it and remove links (just as i'd done when analyzing trump's twitter account ).
# when multiple tweets across accounts are identical (common in government # accounts), use distinct() to keep only the earliest reg <- "([^a-za-z\\d#@']|'(?![a-za-z\\d#@]))" tweet_words <- tweets %>% filter(!is_retweet) %>% arrange(created_at) %>% distinct(text, .keep_all = true) %>% select(screen_name, status_id, text) %>% mutate(text = str_replace_all(text, "https?://t.co/[a-za-z\\d]+|&", "")) %>% unnest_tokens(word, text, token = "regex", pattern = reg) %>% filter(str_detect(word, "[a-z]"))
this parses the corpus of tweets into almost 1.5 million words.
among this population of accounts, and ignoring "stop words" like "the" and "of," what are the most common words? we can use ggplot2 to visualize this.
tweet_words %>% filter(!word %in% stop_words$word) %>% count(word, sort = true) %>% head(16) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col() + coord_flip() + labs(y = "# of uses among staff twitter accounts")
what words make up someone's "signature"? what make up mine, trump's, mike pence's, or the op-ed's?
we could start with the most common words someone uses. but there are some words, like "the" and "of" that just about everyone uses, as well as words like "president" that everyone in our dataset will use. so we also want to downweight words that appear across many documents. a common tool for balancing these two considerations and turning them into a "signature" vector is tf-idf : term-frequency inverse-document-frequency . this takes how frequently someone uses a term, but divides it by (the log of) how many documents mention it. for more details, see chapter 3 of text mining with r .
function from tidytext lets us compute tf-idf on a dataset of word counts like this. before we do, we bring in the op-ed as an additional document (since we're interesting in considering it as one "special" document in our corpus).
## # a tibble: 226,410 x 6 ## screen_name word n tf idf tf_idf ## <chr> <chr> <int> <dbl> <dbl> <dbl> ## 1 joshpaciorek #gogreen 170 0.0204 4.25 0.0868 ## 2 ustraderep ustr 147 0.0213 3.56 0.0757 ## 3 deptvetaffairs #vantagepoint 762 0.0173 3.56 0.0614 ## 4 deptofdefense #knowyourmil 800 0.0185 3.15 0.0584 ## 5 danscavino #trumptrain 655 0.0201 2.86 0.0575 ## 6 usun @ambassadorpower 580 0.0154 3.56 0.0548 ## 7 ustreasury lew 690 0.0183 2.86 0.0523 ## 8 hudgov hud 566 0.0196 2.30 0.0451 ## 9 ombpress omb 38 0.0228 1.95 0.0444 ## 10 secelainechao 'can 1 0.00990 4.25 0.0421 ## # ... with 226,400 more rows
we can now see the words with the strongest associations to a user. for example,
(the vp's deputy press secretary) uses the hashtag #gogreen (supporting michigan state football) quite often; it makes up 2% of the words (
, term frequency). since no one else uses it (leading to an inverse document frequency,
, of 4.5), this makes it a critical part of his tf-idf vector (his "signature").
we could take a look at the "signatures" of a few selected twitter accounts.
library(drlib) selected <- c("realdonaldtrump", "mike_pence", "deptvetaffairs", "kellyannepolls") word_tf_idf %>% filter(screen_name %in% selected) %>% group_by(screen_name) %>% top_n(12, tf_idf) %>% ungroup() %>% mutate(word = reorder_within(word, tf_idf, screen_name)) %>% ggplot(aes(word, tf_idf, fill = screen_name)) + geom_col(show.legend = false) + scale_x_reordered() + coord_flip() + facet_wrap(~ screen_name, scales = "free_y") + labs(x = "", y = "tf-idf vectors of this word for this user", title = "tf-idf: top words for selected staff members")
this gives us a set of words that are quite specific to each account. for instance, @deptvetaffairs uses hashtags like "#vantagepoint" and "#veteranoftheday" that almost no other account in this set would use. words that are specific to trump include "witch" (as in "witch hunt"), "fake" (as in "fake news") and other phrases that he tends to fixate on while other government officials don't. (see here for my text analysis of trump's tweets as of august 2017).
this shows how tf-idf offers us a vector (an association of each word with a number) that describes the unique signature of that document. to compare our documents (the op-ed with each twitter account), we'll be comparing those vectors.
the widyr package: cosine similarity
how can we compare two vectors to get a measure of document similarity? there are many approaches, but perhaps the most common for comparing tf-idf vectors is cosine similarity . this is a combination of a dot product (multiplying the same term in document x and document y together) and a normalization (dividing by the magnitudes of the vectors).
my widyr package offers a convenient way to compute pairwise similarities on a tidy dataset:
## # a tibble: 2,415 x 3 ## item1 item2 similarity ## <chr> <chr> <dbl> ## 1 vpcomdir vppresssec 0.582 ## 2 deptvetaffairs secshulkin 0.548 ## 3 energy secretaryperry 0.472 ## 4 hudgov secretarycarson 0.440 ## 5 usedgov betsydevosed 0.417 ## 6 interior secretaryzinke 0.386 ## 7 secpricemd secazar 0.381 ## 8 secpompeo statedept 0.347 ## 9 vppresssec vp 0.343 ## 10 usda secretarysonny 0.337 ## # ... with 2,405 more rows
the top results show that this elementary method is able to match people to their positions. the vp press secretary and vp communications director unsurprisingly work closely together and tweet on similar topics. similarly, it matches shulkin, perry, carson, devos, and zinke to their (current or former) cabinet positions, and links the two consecutive health and human services directors (price and azar) to each other.
it's worth seeing this document similarity metric in action, but it's not what you're here for. we're really excited about seeing comparisons between the op-ed and twitter articles. we can
# look only at the similarity of the op-ed to other documents op_ed_similarity <- word_tf_idf %>% pairwise_similarity(screen_name, word, tf_idf, sort = true) %>% filter(item1 == "op-ed")
library(drlib) op_ed_similarity %>% head(12) %>% mutate(item2 = reorder(item2, similarity)) %>% ggplot(aes(item2, similarity)) + geom_col() + scale_x_reordered() + coord_flip() + facet_wrap(~ item1, scales = "free_y") + labs(x = "", y = "cosine similarity between tf-idf vectors", subtitle = "based on 69 selected staff accounts", title = "twitter accounts using words similar to nytimes op-ed")
this unveils the most similar writer as...trump himself.
hmmm. while that would certainly be a scoop, it doesn't sound very likely to me. and the other top picks (the official white house account, the press secretary, and the vice president) also seem like suspicious guesses.
interpreting machine learning: what words contributed to scores?
the method of tf-idf is a fairly basic one for text mining, but as a result, it has a useful trait: it's based on a linear combination of one-score-per-word. this means we can say exactly how much each word contributed to a tf-idf similarity between the article and a twitter account. (other machine learning methods allow interactions between words, which makes them harder to interpret).
we'll try an approach of decomposing our tf-idf similarity to see how much each. you could think of this as asking "if the op-ed hadn't used this word, how much lower would the similarity score be?"
# this takes a little r judo, but it's worth the effort # first we normalize the tf-idf vector for each screen name, # necessary for cosine similarity tf_idf <- word_tf_idf %>% group_by(screen_name) %>% mutate(normalized = tf_idf / sqrt(sum(tf_idf ^ 2))) %>% ungroup() # then we join the op-ed words with the full corpus, and find # the product of their tf-idf with it in other documents word_combinations <- tf_idf %>% filter(screen_name == "op-ed") %>% select(-screen_name) %>% inner_join(tf_idf, by = "word", suffix = c("_oped", "_twitter")) %>% filter(screen_name != "op-ed") %>% mutate(contribution = normalized_oped * normalized_twitter) %>% arrange(desc(contribution)) %>% select(screen_name, word, tf_idf_oped, tf_idf_twitter, contribution)
# get the scores from the six most similar word_combinations %>% filter(screen_name %in% head(op_ed_similarity$item2)) %>% mutate(screen_name = reorder(screen_name, -contribution, sum), word = reorder_within(word, contribution, screen_name)) %>% group_by(screen_name) %>% top_n(12, contribution) %>% ungroup() %>% mutate(word = reorder_within(word, contribution, screen_name)) %>% ggplot(aes(word, contribution, fill = screen_name)) + geom_col(show.legend = false) + scale_x_reordered() + facet_wrap(~ screen_name, scales = "free_y") + coord_flip() + labs(x = "", y = "contribution to similarity score", title = "what caused each twitter account to be similar to the article", subtitle = "for the 6 accounts with the highest similarity score")
now the reasons for the tf-idf similarities become clearer.
the op-ed uses the words "russia" five times. the press secretary and especially trump mention russia multiple times on their twitter accounts, always within the context of defending trump (as expected). several accounts also get a high score because they mention the word "trump" so frequently.
unfortunately, with a document this short and topical, that's all it takes to get a high similarity score (a bag of words method can't understand the context, such as mentioning russia in a negative or a defensive context). this is one reason it's worth taking a closer look at what goes into an algorithm,
having said that, there's one signature i think is notable.
many others have noted "lodestar" as a telltale word in the piece . none of the relevant documents included that. i'd like to focus on another word that did: malign . emphasis mine:
he complained for weeks about senior staff members letting him get boxed into further confrontation with russia, and he expressed frustration that the united states continued to impose sanctions on the country for its malign behavior.
"malign" isn't as rare a word as "lodestar", but it's notable for being used in the exact same context (discussing russia or other countries' behavior) in a number of tweets from both secretary of state pompeo and the @statedepartment account. (pompeo has actually used the term "malign" an impressive seven times since may , though all but one were about iran rather than russia).
"malign behavior" has been common language for pompeo this whole year, as it has for other state department officials like jon huntsman . what's more, you don't need data science to notice the letter spends three paragraphs on foreign policy (and praises "the rest of the administration" on that front). i'm not a pundit or a political journalist, but i can't resist speculating a bit. pompeo is named by the weekly standard as one of four likely authors of the op-ed, but even if he's not the author my guess would be someone in the state department.
conclusion: opening the black box
it's worth emphasizing again that this article is just my guess based on a single piece of language (it's nowhere close to the certainty of my analysis of trump's twitter account during the campaign, which was statistically significant enough that i'd be willing to consider it "proof").
i was fairly skeptical from the start that we could get strong results with document-comparison methods like this, especially in such a small article. that opinion mirrored people with much more expertise than i have:
but i'm satisfied with this analysis both as a demonstration of tidytext methods and one on the importance of model interpretability. when we ran a tf-idf comparison, we knew it was wrong because @realdonaldtrump appeared at the top. but what if trump hadn't been the one to mention russia the most or if another false positive had caused an account to rise to the top? breaking similarity scores down by word is a useful way to interrogate our model and understand its output. (see here for a similar article about understanding the components of a model).
Published at DZone with permission of David Robinson, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.