Visualizing the Prime Ministerial Debates
April 29th, 2010
For the first time, the main Prime Ministerial candidates for the 2010 UK General Elections, will take part in three live debates. Since the BBC have kindly made the full transcripts available, I decided to have a go at analyzing the data and creating a visual representation in the form of word clouds. I am currently working on my own visualization software, but in the meantime these have been done using Wordle.
Preparing the data
The BBC only provides the data in PDF form – to analyze it we need it in text form. Although this is easily done with Acrobat Reader’s “Save as Text” function, the output it produces is not really suitable for automatic processing, so some work has to be done by hand. This basically involves making sure each speaker’s comments are headed by their name and some kind of special character to split each record ( here I have used ‘@’ ), which took about 15 minutes or so.
Download the raw text of the first debate.
Having done that, a command line tool such as awk can be used to split the data by speaker. For example, the following command outputs Clegg’s comments into a separate file:
awk 'BEGIN {RS=""; FS="[@]"} $1=="NC" { print $2 }' debate.txt > clegg.txt
Parsing the data
Python‘s Natural Language Toolkit provides all the functions needed to analyze the text data, such as tokenizing the text by word and even categorizing each word by type, such as proper nouns and prepositions. For example, having extracted Nick Clegg’s speech as above and read the file as a string using Python, the following commands parse the input for sentences, and then tokenize each word procucing a complete word list.
from __future__ import division import nltk, re, pprint sentences = nltk.sent_tokenize(text) tokens=[] for s = sentences: tokens.extend(nltk.word_tokenize(s)) words=[t.lower() for t in tokens]
We can then categorize each word with a POS tag and extract a list appropriately, for example, using the word tokens above the following extracts all the nouns
# this operation takes some time to execute taggedwords=nltk.pos_tag(words) nouns=[word for (word,tag) in words if t == 'NN']
Such a list is enough to use with Wordle, however it’s straightforward to create a word frequency list for use with other software.
# this operation takes some time to execute nounfrequencies = nltk.FreqDist(nouns)
Going further
Word frequency analyses are fairly straightforward, however NLTK is a powerful library and allows for much more detailed and informative analysis based on grammar and sentence structure. It would be interesting to see the results of a more sophisticated approach.
Gallery of word clouds from the first debate



