<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Planetmarshall &#187; python</title>
	<atom:link href="http://www.planetmarshall.co.uk/tag/python/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.planetmarshall.co.uk</link>
	<description>Andrew Marshall&#039;s blog</description>
	<lastBuildDate>Thu, 10 Nov 2011 17:33:06 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Visualizing the Prime Ministerial Debates</title>
		<link>http://www.planetmarshall.co.uk/2010/04/visualizing-the-prime-ministerial-debates/</link>
		<comments>http://www.planetmarshall.co.uk/2010/04/visualizing-the-prime-ministerial-debates/#comments</comments>
		<pubDate>Thu, 29 Apr 2010 08:30:19 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[Imaging]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[politics]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.planetmarshall.co.uk/?p=805</guid>
		<description><![CDATA[For the first time, the main Prime Ministerial candidates for the 2010 UK General Elections, will take part in three live debates. Since the BBC have kindly made the full transcripts available, I decided to have a go at analyzing &#8230; <a href="http://www.planetmarshall.co.uk/2010/04/visualizing-the-prime-ministerial-debates/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[
<a href="http://www.planetmarshall.co.uk/wp-content/gallery/debates/brown_nouns.png" title="Nouns used by Gordon Brown in the first debate" class="shutterset_singlepic131" >
	<img class="ngg-singlepic ngg-right" src="http://www.planetmarshall.co.uk/wp-content/gallery/cache/131__140x_brown_nouns.png" alt="Nouns used by Gordon Brown" title="Nouns used by Gordon Brown" />
</a>

<p class="pm_first">For the first time, the main Prime Ministerial candidates for the 2010 UK General Elections, will take part in <a title="Debates page from the BBC" href="http://news.bbc.co.uk/1/hi/uk_politics/election_2010/the_debates/default.stm" target="_blank">three live debates</a>. Since the BBC have kindly made the full transcripts available, I decided to have a go at analyzing the data and creating a <a title="Jump to image gallery" href="#gallery" target="_self">visual representation</a> in the form of word clouds. I am currently working on my own visualization software, but in the meantime these have been done using <a title="Wordle" href="http://www.wordle.net/" target="_blank">Wordle</a>.</p>
<p>    <span id="more-805"></span><br />
<h3>Preparing the data</h3>
<p>
<a href="http://www.planetmarshall.co.uk/wp-content/gallery/debates/clegg_adj.png" title="Adjectives and Adverbs used by Nick Clegg in the first debate" class="shutterset_singlepic138" >
	<img class="ngg-singlepic ngg-left" src="http://www.planetmarshall.co.uk/wp-content/gallery/cache/138__140x_clegg_adj.png" alt="Adjectives and Adverbs used by Nick Clegg" title="Adjectives and Adverbs used by Nick Clegg" />
</a>
The BBC only provides the data in PDF form &#8211; to analyze it we need it in text form. Although this is easily done with Acrobat Reader&#8217;s &quot;Save as Text&quot; function, the output it produces is not really suitable for automatic processing, so some work has to be done by hand. This basically involves making sure each speaker&#8217;s comments are headed by their name and some kind of special character to split each record ( here I have used &#8216;@&#8217; ), which took about 15 minutes or so.</p>
<p><a title="First Prime Ministerial debate in raw text form" href="http://bit.ly/9QRrXx" target="_blank">Download </a>the raw text of the first debate.</p>
<p>Having done that, a command line tool such as <a href="http://www.gnu.org/manual/gawk/gawk.html" target="_blank">awk</a> can be used to split the data by speaker. For example, the following command outputs Clegg&#8217;s comments into a separate file:</p>
<pre class="brush: bash; title: ; notranslate">
awk 'BEGIN {RS=&quot;&quot;; FS=&quot;[@]&quot;} $1==&quot;NC&quot; { print $2 }' debate.txt &gt; clegg.txt
</pre>
<h3>Parsing the data</h3>
<p>
<a href="http://www.planetmarshall.co.uk/wp-content/gallery/debates/cameron_verbs.png" title="Verbs used by David Cameron in the first debate" class="shutterset_singlepic137" >
	<img class="ngg-singlepic ngg-right" src="http://www.planetmarshall.co.uk/wp-content/gallery/cache/137__140x_cameron_verbs.png" alt="Verbs used by David Cameron" title="Verbs used by David Cameron" />
</a>
<a title="Python homepage" href="http://www.python.org/" target="_blank">Python</a>&#8216;s <a title="The Natural Language Toolkit" href="http://www.nltk.org/" target="_blank">Natural Language Toolkit</a> provides all the functions needed to analyze the text data, such as tokenizing the text by word and even categorizing each word by type, such as proper nouns and prepositions. For example, having extracted Nick Clegg&#8217;s speech as above and read the file as a string using Python, the following commands parse the input for sentences, and then tokenize each word procucing a complete word list.</p>
<pre class="brush: python; title: ; notranslate">
from __future__ import division
import nltk, re, pprint

sentences = nltk.sent_tokenize(text)
tokens=[]
for s = sentences:
 tokens.extend(nltk.word_tokenize(s))
words=[t.lower() for t in tokens]
</pre>
<p>We can then categorize each word with a <a title="Wikipdeia page on POS tagging" href="http://en.wikipedia.org/wiki/Part-of-speech_tagging">POS tag</a> and extract a list appropriately, for example, using the word tokens above the following extracts all the nouns</p>
<pre class="brush: py; gutter: false; toolbar: false;"># this operation takes some time to execute
taggedwords=nltk.pos_tag(words)
nouns=[word for (word,tag) in words if t == 'NN']</pre>
<p>Such a list is enough to use with Wordle, however it&#8217;s straightforward to create a word frequency list for use with other software.</p>
<pre class="brush: python; title: ; notranslate">
nounfrequencies = nltk.FreqDist(nouns)
</pre>
<h3>Going further</h3>
<p>Word frequency analyses are fairly straightforward, however NLTK is a powerful library and allows for much more detailed and informative analysis based on grammar and sentence structure. It would be interesting to see the results of a more sophisticated approach.<br />
  <br /><a name="gallery"></a></p>
<h3>Gallery of word clouds from the first debate</h3>

<div class="ngg-galleryoverview" id="ngg-gallery-13-805">

	<!-- Slideshow link -->
	<div class="slideshowlink">
		<a class="slideshowlink" href="http://www.planetmarshall.co.uk/2010/04/visualizing-the-prime-ministerial-debates/?show=slide">
			[Show as slideshow]		</a>
	</div>

	
	<!-- Thumbnails -->
		
	<div id="ngg-image-131" class="ngg-gallery-thumbnail-box"  >
		<div class="ngg-gallery-thumbnail" >
			<a href="http://www.planetmarshall.co.uk/wp-content/gallery/debates/brown_nouns.png" title="Nouns used by Gordon Brown in the first debate" class="shutterset_set_13" >
								<img title="Nouns used by Gordon Brown" alt="Nouns used by Gordon Brown" src="http://www.planetmarshall.co.uk/wp-content/gallery/debates/thumbs/thumbs_brown_nouns.png" width="100" height="64" />
							</a>
		</div>
	</div>
	
		
 		
	<div id="ngg-image-132" class="ngg-gallery-thumbnail-box"  >
		<div class="ngg-gallery-thumbnail" >
			<a href="http://www.planetmarshall.co.uk/wp-content/gallery/debates/cameron_nouns.png" title="Nouns used by David Cameron in the first debate" class="shutterset_set_13" >
								<img title="Nouns used by David Cameron" alt="Nouns used by David Cameron" src="http://www.planetmarshall.co.uk/wp-content/gallery/debates/thumbs/thumbs_cameron_nouns.png" width="100" height="64" />
							</a>
		</div>
	</div>
	
		
 		
	<div id="ngg-image-133" class="ngg-gallery-thumbnail-box"  >
		<div class="ngg-gallery-thumbnail" >
			<a href="http://www.planetmarshall.co.uk/wp-content/gallery/debates/clegg_nouns.png" title="Nouns used by Nick Clegg in the first debate" class="shutterset_set_13" >
								<img title="Nouns used by Nick Clegg" alt="Nouns used by Nick Clegg" src="http://www.planetmarshall.co.uk/wp-content/gallery/debates/thumbs/thumbs_clegg_nouns.png" width="100" height="64" />
							</a>
		</div>
	</div>
	
		
 		
	<div id="ngg-image-135" class="ngg-gallery-thumbnail-box"  >
		<div class="ngg-gallery-thumbnail" >
			<a href="http://www.planetmarshall.co.uk/wp-content/gallery/debates/brown_verbs.png" title="Verbs used by Gordon Brown in the first debate" class="shutterset_set_13" >
								<img title="Verbs used by Gordon Brown" alt="Verbs used by Gordon Brown" src="http://www.planetmarshall.co.uk/wp-content/gallery/debates/thumbs/thumbs_brown_verbs.png" width="100" height="64" />
							</a>
		</div>
	</div>
	
		
 		
	<div id="ngg-image-137" class="ngg-gallery-thumbnail-box"  >
		<div class="ngg-gallery-thumbnail" >
			<a href="http://www.planetmarshall.co.uk/wp-content/gallery/debates/cameron_verbs.png" title="Verbs used by David Cameron in the first debate" class="shutterset_set_13" >
								<img title="Verbs used by David Cameron" alt="Verbs used by David Cameron" src="http://www.planetmarshall.co.uk/wp-content/gallery/debates/thumbs/thumbs_cameron_verbs.png" width="100" height="65" />
							</a>
		</div>
	</div>
	
		
 		
	<div id="ngg-image-139" class="ngg-gallery-thumbnail-box"  >
		<div class="ngg-gallery-thumbnail" >
			<a href="http://www.planetmarshall.co.uk/wp-content/gallery/debates/clegg_verbs.png" title="Verbs used by Nick Clegg in the first debate" class="shutterset_set_13" >
								<img title="Verbs used by Nick Clegg" alt="Verbs used by Nick Clegg" src="http://www.planetmarshall.co.uk/wp-content/gallery/debates/thumbs/thumbs_clegg_verbs.png" width="100" height="65" />
							</a>
		</div>
	</div>
	
		
 		
	<div id="ngg-image-134" class="ngg-gallery-thumbnail-box"  >
		<div class="ngg-gallery-thumbnail" >
			<a href="http://www.planetmarshall.co.uk/wp-content/gallery/debates/brown_adj.png" title="Adjectives and Adverbs used by Gordon Brown in the first debate" class="shutterset_set_13" >
								<img title="Adjectives and Adverbs used by Gordon Brown" alt="Adjectives and Adverbs used by Gordon Brown" src="http://www.planetmarshall.co.uk/wp-content/gallery/debates/thumbs/thumbs_brown_adj.png" width="100" height="63" />
							</a>
		</div>
	</div>
	
		
 		
	<div id="ngg-image-136" class="ngg-gallery-thumbnail-box"  >
		<div class="ngg-gallery-thumbnail" >
			<a href="http://www.planetmarshall.co.uk/wp-content/gallery/debates/cameron_adj.png" title="Adjectives and Adverbs used by David Cameron  in the first debate" class="shutterset_set_13" >
								<img title="Adjectives and Adverbs used by David Cameron" alt="Adjectives and Adverbs used by David Cameron" src="http://www.planetmarshall.co.uk/wp-content/gallery/debates/thumbs/thumbs_cameron_adj.png" width="100" height="65" />
							</a>
		</div>
	</div>
	
		
 		
	<div id="ngg-image-138" class="ngg-gallery-thumbnail-box"  >
		<div class="ngg-gallery-thumbnail" >
			<a href="http://www.planetmarshall.co.uk/wp-content/gallery/debates/clegg_adj.png" title="Adjectives and Adverbs used by Nick Clegg in the first debate" class="shutterset_set_13" >
								<img title="Adjectives and Adverbs used by Nick Clegg" alt="Adjectives and Adverbs used by Nick Clegg" src="http://www.planetmarshall.co.uk/wp-content/gallery/debates/thumbs/thumbs_clegg_adj.png" width="100" height="65" />
							</a>
		</div>
	</div>
	
		
 	 	
	<!-- Pagination -->
 	<div class="ngg-clear"></div> 	
</div>


]]></content:encoded>
			<wfw:commentRss>http://www.planetmarshall.co.uk/2010/04/visualizing-the-prime-ministerial-debates/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

