What Twitter is saying about the presidential race
Election Pulse shows changes in Twitter sentiment toward the top two U.S. presidential candidates. The algorithm analyzes nearly 1 million tweets per day using state of the art Natural Language Processing and big data tools.
Comparing yesterday to the day before, sentiment has shifted toward
Comparing the past three days to the three days before that, sentiment has shifted toward
Comparing this week to the week before, sentiment has shifted toward
Election Pulse collects data from the Twitter Streaming API using the Public POST filter, which means it collects all tweets matching select search parameters. In this case, the search parameters are the two top presidential candidates' names and Twitter handles, as follows:
["trump", "clinton", "hillary","hillaryclinton", "realdonaldtrump", "hillary's", "trump's", "hillarys", "trumps"]
Twitter returns any tweets in which the text of the tweet, full URL, hashtags, or screennames of the tweet match any one of these terms. This generally results in approximately 2 million tweets collected per day.
Then Election Pulse filters on two criteria:
After the filtering process is complete, there are approximately 700,000 tweets per day, with approximately half of the drop outs being attributable to each of the two filter criteria.
Each tweet is then analyzed to produce a sentiment score using the Stanford CoreNLP Server ranging from 1000 (very negative) to 3000 (very positive). CoreNLP uses state-of-the-art technology to tokenize, annotate, and apply a Recursive Neural Tensor Network on grammar that takes into account the context of each successive word in a sentence. The model achieves about 85% accuracy on single sentence classification tasks, the best in the field (as a Berkeley grad, this is difficult to admit).
Election Pulse applies CoreNLP as follows:
Next, Election Pulse cleans the user location field to assign each user to a U.S. state. The algorithm also identifies the name of the presidential candidate referenced in the tweet and the day of the tweet. Both techniques use simple tokenizing (on white space and commas) and search in string methods to identify the user's name and candidate up for discussion. Finally, it groups all tweets by the candidate, state, and day and calculates the average of the sentiment score for each candidate, state, and day combination.
This analysis is computationally expensive — it would take one new MacBook Pro laptop at least 5 hours to process a single day's worth of tweets, which would take quite a bit of my time and available computer capabilities. Instead, Election Pulse uses Apache Spark and Amazon AWS EC2 instances to process the data in parallel in just a few minutes. Because 25 m3.large EC2 instances cost less than 50 cents/hour and can process about 1 million records every 3 minutes, the process is quick, efficient, and cheap (and the best part is that Spark was developed at Berkeley's AMP lab).
If you have any questions or comments, you can tweet me @grahamimac.