2012 US Presidential Election Timeline in Twitter

I was looking at the spikes in twitter traffic about Obama and Romney during the 2012 US presidential election. So far this is just volumes of tweets containing “obama” and/or (logical or) “romney” (case insensitive), but in the future I would like to do similar plots for sentiment.


The data this is based on 315,000,000 tweets, which was more than 1TB when including all the metadata.  Only the subset of these matching “obama” or “romney” were used for the chart, but still it took 11 hours to tabulate the results on my desktop computer (the one that I mentioned in the previous post).

This plot was made using R and Inkscape.  Here’s the input data table, and here’s a gist of the R code :


raid setup

I recently got a new computer for work, a 32 GB, 8 core, RAID enabled machine (woohoo!).  I wanted to test exactly how fast the RAID-0 setup actually was.  Transferring about 820 GB from a USB-3 external HD:

time cp -rp /media/ts-backup-2/data_gnip/ /raid/
real    176m14.507s
user    0m23.533s
sys     44m27.038s

or about 3 hours, 273GB per hour, 4GB per min., 75MB per sec.  I did the same
copy procedure to a regular hdd (7200RPM, bla, bla), and the result was pretty
much the same, which leads me to believe (hope) that the speed was limited by
the usb hdd.

one thing that was taking a lot of time was simply doing wc on all the twitter
log files, so now I’m going to compare running wc on the raid and the regular

time wc /raid/data_gnip/tweets_2*txt
348649707  21225017895 877863706316 total

real    219m18.297s
user    213m20.851s
sys     5m25.995s

which is about 3 hrs 40 minutes, or about 1.5 million lines (rather big json
records) per minute or 26000 lines per second.

comparing that with the regular hdd:

time wc /data/data_gnip/tweets_2*txt

348649707  21225017895 877863706316 total

real    223m9.239s
user    216m21.881s
sys     6m3.174s

So I’m not really seeing the benefit of RAID at this point.  I’m still hoping
that I can find the benefit of RAID.  I was thinking that since it is a
sequential read, one file after another, that I might see the benefit with parallel reading:

time find /raid -name “tweets_2*txt” | parallel wc

real    67m58.278s
user    425m54.987s
sys     7m9.929s

So that was a lot faster (3x), and you can see the multiple
core effect in that the user CPU time is higher than the real time.

time find /data -name “tweets_2*txt” | parallel wc

real    305m27.366s
user    265m43.306s
sys     6m49.296s

aha, it finally found a way to see the speedup from RAID. itseems like the
cpu’s in the system monitor are less saturated, and more jagged/noisy.

Still, in the process of waiting for these programs to terminate, I found an
approximate line counter on a forum.  It uses a partial line count from
portion of a big file and then extrapolates the approximate number based on
the file size.
Thanks to Erik Aronesty for providing this perl script!

time alc.pl /raid/data_gnip/tweets_2*txt

359757377 total
real    0m32.885s
user    0m0.496s
sys     0m0.800s

which is off by about 10 million out of 350 millon, roughly a factor of less than 10, but faster by about an hour (roughly faster than a factor of 100).
To be complete, here’s the performance of alc.pl on the regular disk:

time alc.pl /data/data_gnip/tweets_2*txt

359757377 total

real    0m36.171s
user    0m0.639s
sys     0m0.684s

VERDICT: Raid is definitely faster if you use it wisely, but if you just want an approximate line count, you can save a lot by just using a clever perl script. Of course I’ll be doing more than counting lines, so the raid should come in handy in the future, but it’s definitely possible to over engineer things if I’m not careful.

Interspeech 2012, Day 4

Today, the keynote speaker was Garet Lahvis, who presented on analyzing prosody in ultrasonic rat vocalizations.  I found this pretty interesting.  Even though the signal processing aspect of the study was very basic, the novel part is looking at animals instead of humans, and (non-human) lab rats that have been removed from their natural environment for several generations.  Also, an interesting aspect of this study is the implication for emotion research.  One of the interesting findings is that for a certain strain of rats, b6, positive valence emotion vocalization recordings had similar effects on other rats as food and morphine, and similar for negative valence events like foot shocks.  Another strain of rats, BALB, had less response to emotional vocalizations from other rats.  This study had implications for both substance abuse studies and autism research.

One of the interesting posters I saw was on building speech recognition language models from twitter.

Interspeech 2012, day 3

Today the keynote speaker was Michael Riley from Google, who talked about openfst and some of the new aspects of their work.  Most of the talk was a bit too basic for people that know about fst’s and to advanced for people who don’t, but it was still interesting as a refresher.  I was most interested in the push-down automata/transducers that they added to openFST recently.  Also, the discussion about some of the tradeoffs in the design of FSTs was interesting, e.g., time vs. space, static vs. dynamic machines, and centralized vs. distributed algorithms.  Also, the discussion about the different types of artificial languages and how they relate to parser types was interesting.  The idea was that with a pushdown automaton, the determinization/minimization/epsilon removal is no longer unique and different strategies result in different types of parsers, like Early, Cyk, etc.  I think I would get more out of this talk if I refreshed my knowledge of parser types.

One of the sessions that caught my attention was the language modelling session.  I missed the talk on paraphrastic language models, but I read some of the paper and it seems very promising to read in more detail.  The IBM talk on neural network language models was interesting and it was pretty funny that some of their models took 50 days to train.  The main results for this work were training speedups rather than performance increases.  The low rank plus sparse talk was also very compelling model and interesting talk.  I want to reexamine some of my work in light of this (the fuzzy logic model of emotion dimensions seems analogous to the  continuous low-rank component, and the emo20q question-answer based model analogous to the discrete, sparse component).

The reception at the Portland art museum was very nice.  The museum was had a lot of beautiful art and there was a tango  band in the upstairs ballroom that I enjoyed with a researcher from Toyota that was a Tango fan.  Here are some pictures of the reception.

After the reception, the sailers went out for drinks.  Here are some of the pictures that are fit for the public 🙂

This slideshow requires JavaScript.

Interspeech 2012, day 2

Today the key note speaker was Roger Dannenberg, who presented about music understanding and the future of music performance.  It was a pretty interesting talk and it even included him playing his trumpet in a demo.  It included both instrumental and speech/voice based applications and presented a good point about how big a market this area is.

I didn’t make any notes about the presentations today but I was asked to be a reporter for the student industry academic lunch (that is why I decided to blog about the rest of the conference in the first place–glad I did b/c it helps to reinforce the memory of all the things that I saw, even though it was a bit of work).

The student lunch at the table I was assigned to was hosted by Murat Saraclar (Google and Bogazici University) and Gareth Jones (Dublin City University) and the topic of discussion was spoken term detection and spoken document retrieval.  The students that were attending were: Tsung-Hsien Wen (National Taiwan University), Deepak Krishnarajanaga Thotappa (IIT Guwahati),  Haiyang Li (Harbin Institute of Technology), Atta Norouzian (McGill), Hungya Lee (National Taiwan University), Maria Eskevich (Dublin City University), and myself, Abe Kazemzadeh (USC SAIL Lab). We spent most of the time doing introductions and learning about each other’s research, but later it became more of a conversational discussion.  Some topics that came up in the discussion were:

  • the differences between term detection and document retreival
  • differences in data sources, e.g., broadcast news vs. meeting data
  • discriminative language models
  • the work of Amit Singhal at Google
  • Malach corpus of spoken video interviews
  • an older NIST overview paper on spoken document retrieval

Here are some pictures of the student lunch:

Also, today was the student reception at the crystal ballroom.  I got side-tracked and ended up at a concert by Tango Alpha Tango at the crystal hotel, which was next to the place that I was supposed to go.  The concert was pretty good and I did finally make it to the tail end of the student reception.  The dance floor of the crystal ball room was on springs or something, so it was almost like a trampoline.  Unfortunately no one was dancing.

Interspeech 2012, day 1

Today, I missed the beginning of the keynote speaker b/c I was finding where my poster needed to go, so I didn’t follow much of the part that I saw.  Actually I skipped most of the morning presentations so I could practice.  I did go to the welcome ceremony and they had a sort of introductory keynote  that was very interesting. The speaker, Rex Ziak, is a Lewis and Clark scholar and he gave presentation about how language and translation played a big role in Lewis and Clark’s expedition from Saint Louis to the West Coast.  Supposedly Thomas Jefferson, who commissioned the expedition, also instituted a linguistic component of the mission: to collect data on Native American languages by building a Swadesh list of common words.  Unfortunately, the story ended in disaster.  When Jefferson retired and returned to his estate to study the linguistic data, a thief had ransacked the chest full of papers and, realizing that it had no monetary value, dumped the contents into a river.

Coincidentally, my paper was also about collecting ethnographic data, but using a more computational methodology and only in English and only about emotion words.  For my presentation, I got a fair amount of people interested in my poster (emo20qPosterForInterspeech2012 ) about recent work on emo20q.   One of the interesting things is seeing who knows about the game of twenty questions.  Sometimes, I described the experiment by explaining emo20q as “twenty questions played with emotion words instead of physical objects”.  The problem was that some people had never heard of twenty questions (it is a very international conference).  So far, I couldn’t generalize about which cultures have the twenty questions game and which don’t, but it seemed like Chinese and Germans don’t have the game, but it could also vary person to person.  I tried googling for “which cultures play twenty questions” and I found a fun but unrelated youtube link.  Other than that, I didn’t find any information on line.

One talk that I found interesting was “Prosodic Entrainment in an Information-Driven Dialog System”.  I liked this one b/c it not only tried to measure different types of entrainment, but used them to modify the dialog system behavior (mainly to correct error states caused by hyper articulation).  There was also a paper about using the semantic web for unsupervised natural language semantic parsing.  This seemed interesting in theory but I didn’t really  get into the presentation, so perhaps I’ll be more entertained by reading the paper.

Interspeech 2012, day 0: tutorial on topic models for acoustic data

The second tutorial I attended was about topic models for acoustic data, given by Bhiksha Raj and Paris Smaragdis. Normally topic models are applied to text data but the examples for audio are very compelling. The applications include noise removal, speaker separation, reverberation removal, and creation of features for speech recognition. There were lots of cool, even mind-blowing, demos. The one drawback was that the references were not given inline but rather in a big small font slide at the very end so they were not very useful. One of the main things that I learned that I want to try in my work is the use of entropic proofs instead of Dirichlet priors in the project I’m presenting here. I was only able to find the first half of the presentation online.