Visualizing the Prime Ministerial Debates

April 29th, 2010
Nouns used by Gordon Brown

For the first time, the main Prime Ministerial candidates for the 2010 UK General Elections, will take part in three live debates. Since the BBC have kindly made the full transcripts available, I decided to have a go at analyzing the data and creating a visual representation in the form of word clouds. I am currently working on my own visualization software, but in the meantime these have been done using Wordle.

Preparing the data

Adjectives and Adverbs used by Nick Clegg The BBC only provides the data in PDF form – to analyze it we need it in text form. Although this is easily done with Acrobat Reader’s “Save as Text” function, the output it produces is not really suitable for automatic processing, so some work has to be done by hand. This basically involves making sure each speaker’s comments are headed by their name and some kind of special character to split each record ( here I have used ‘@’ ), which took about 15 minutes or so.

Download the raw text of the first debate.

Having done that, a command line tool such as awk can be used to split the data by speaker. For example, the following command outputs Clegg’s comments into a separate file:

Listing :
awk 'BEGIN {RS=""; FS="[@]"} $1=="NC" { print $2 }' debate.txt > clegg.txt

Parsing the data

Verbs used by David Cameron Python‘s Natural Language Toolkit provides all the functions needed to analyze the text data, such as tokenizing the text by word and even categorizing each word by type, such as proper nouns and prepositions. For example, having extracted Nick Clegg’s speech as above and read the file as a string using Python, the following commands parse the input for sentences, and then tokenize each word procucing a complete word list.

Listing :
from __future__ import division
import nltk, re, pprint

sentences = nltk.sent_tokenize(text)
tokens=[]
for s = sentences:
 tokens.extend(nltk.word_tokenize(s))
words=[t.lower() for t in tokens]

We can then categorize each word with a POS tag and extract a list appropriately, for example, using the word tokens above the following extracts all the nouns

Listing :
# this operation takes some time to execute
taggedwords=nltk.pos_tag(words)
nouns=[word for (word,tag) in words if t == 'NN']

Such a list is enough to use with Wordle, however it’s straightforward to create a word frequency list for use with other software.

Listing :
# this operation takes some time to execute
nounfrequencies = nltk.FreqDist(nouns)

Going further

Word frequency analyses are fairly straightforward, however NLTK is a powerful library and allows for much more detailed and informative analysis based on grammar and sentence structure. It would be interesting to see the results of a more sophisticated approach.

Gallery of word clouds from the first debate

Leave a Reply