Election Pulse

What Twitter is saying about the presidential race

Election Pulse shows changes in Twitter sentiment toward the top two U.S. presidential candidates. The algorithm analyzes nearly 1 million tweets per day using state of the art Natural Language Processing and big data tools.

Comparing yesterday to the day before, sentiment has shifted toward

Comparing the past three days to the three days before that, sentiment has shifted toward

Comparing this week to the week before, sentiment has shifted toward

Methodology

Election Pulse collects data from the Twitter Streaming API using the Public POST filter, which means it collects all tweets matching select search parameters. In this case, the search parameters are the two top presidential candidates' names and Twitter handles, as follows:

["trump", "clinton", "hillary","hillaryclinton", "realdonaldtrump", "hillary's", "trump's", "hillarys", "trumps"]

Twitter returns any tweets in which the text of the tweet, full URL, hashtags, or screennames of the tweet match any one of these terms. This generally results in approximately 2 million tweets collected per day.

Then Election Pulse filters on two criteria:

  1. The algorithm filters out tweets in which neither presidential candidate was mentioned (i.e., the mention was only in the URL), or in which both presidential candidates were mentioned (in which case, it would be difficult to decide who is the receiver of the tweet's sentiment).
  2. The algorithm filters out tweets in which the location of the user, based on the user's profile, does not match any recognizable U.S. state. Note, that though these cases are in the minority, this method also filters out Twitter users who only specify a city, and not a state, in their profile, or who only specify the "United States". This was done to simplify data processing, given the large variety in self-reported place name spellings and acronyms.

After the filtering process is complete, there are approximately 700,000 tweets per day, with approximately half of the drop outs being attributable to each of the two filter criteria.

Each tweet is then analyzed to produce a sentiment score using the Stanford CoreNLP Server ranging from 1000 (very negative) to 3000 (very positive). CoreNLP uses state-of-the-art technology to tokenize, annotate, and apply a Recursive Neural Tensor Network on grammar that takes into account the context of each successive word in a sentence. The model achieves about 85% accuracy on single sentence classification tasks, the best in the field (as a Berkeley grad, this is difficult to admit).

Election Pulse applies CoreNLP as follows:

  1. Election Pulse replaces all emojis with keywords representing the emoji from Unicode's Full Emoji list v3. Each keyword is separated by a space and at the end of the keyword list, the agorithm inserts a period, so each emoji is converted to its own sentence of keywords. For example, the tweet "I like Trump! " would be replaced with "I like Trump! eye face grin smile." so it can be properly processed by the CoreNLP Server.
  2. The algorithm passes the text of the tweet, including any usernames and hashtags of the tweet, but not the username of the person who created the tweet. CoreNLP then separates out the tweet into individual sentences and evaluates the sentiment of each sentence as Negative (1), Neutral (2), or Positive (3). The sentiment score for the tweet is the average of each sentence's sentiment score, multiplied by 1000. In the example from 1., the tweet "I like Trump! eye face grin smile." would likely rate as a 3 for both sentences, so the sentiment score for the tweet would be (3 + 3) * 1000 / 2 = 3000.

Next, Election Pulse cleans the user location field to assign each user to a U.S. state. The algorithm also identifies the name of the presidential candidate referenced in the tweet and the day of the tweet. Both techniques use simple tokenizing (on white space and commas) and search in string methods to identify the user's name and candidate up for discussion. Finally, it groups all tweets by the candidate, state, and day and calculates the average of the sentiment score for each candidate, state, and day combination.

This analysis is computationally expensive — it would take one new MacBook Pro laptop at least 5 hours to process a single day's worth of tweets, which would take quite a bit of my time and available computer capabilities. Instead, Election Pulse uses Apache Spark and Amazon AWS EC2 instances to process the data in parallel in just a few minutes. Because 25 m3.large EC2 instances cost less than 50 cents/hour and can process about 1 million records every 3 minutes, the process is quick, efficient, and cheap (and the best part is that Spark was developed at Berkeley's AMP lab).

If you have any questions or comments, you can tweet me @grahamimac.