<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>PlanetMarshall &#187; CUDA</title>
	<atom:link href="http://www.planetmarshall.co.uk/index.php/tag/cuda/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.planetmarshall.co.uk</link>
	<description>Andrew Marshall's blog.</description>
	<lastBuildDate>Mon, 05 Jul 2010 10:21:08 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Silverlight and CUDA interop</title>
		<link>http://www.planetmarshall.co.uk/2010/01/silverlight-and-cuda-interop/</link>
		<comments>http://www.planetmarshall.co.uk/2010/01/silverlight-and-cuda-interop/#comments</comments>
		<pubDate>Fri, 15 Jan 2010 02:11:58 +0000</pubDate>
		<dc:creator>Andrew</dc:creator>
				<category><![CDATA[Software]]></category>
		<category><![CDATA[C#]]></category>
		<category><![CDATA[CUDA]]></category>
		<category><![CDATA[Silverlight]]></category>

		<guid isPermaLink="false">http://www.planetmarshall.co.uk/?p=389</guid>
		<description><![CDATA[Update &#8211; source code now available Microsoft have recently released a beta of Silverlight 4, which has limited support for native interoperation using COM. Potentially, this example could be applied to any number of native interop scenarios, however for this example I have chosen to use Nvidia&#8217;s CUDA technology. Disclaimer : This is an example [...]]]></description>
			<content:encoded><![CDATA[
<a href="http://www.planetmarshall.co.uk/wp-content/gallery/cuda-interop/mandrill.jpg" title="" class="thickbox" rel="singlepic78" >
	<img class="ngg-singlepic ngg-right" src="http://www.planetmarshall.co.uk/wp-content/gallery/cache/78__x96_mandrill.jpg" alt="mandrill" title="mandrill" />
</a>
<br />
<em>Update &#8211; <a href="#source">source code</a> now available</em></p>
<p class="pm_first">Microsoft have recently released a beta of Silverlight 4, which has limited support for native interoperation using COM. Potentially, this example could be applied to any number of native interop scenarios, however for this example I have chosen to use Nvidia&#8217;s CUDA technology.</p>
<blockquote><p>Disclaimer : This is an example of what can be done, not necessarily, and in all likelihood, an example of how it should be done.</p></blockquote>
<h3>About CUDA</h3>
<p>Up until around 2001 PC graphics cards, though powerful, implemented a fixed function pipeline that limited use to whatever was exposed by the APIs, usually Direct3D or OpenGL. The addition of a programmable pixel pipeline led to the use of graphics cards for more general computation tasks; at first using shaders directly, followed by higher level GPU specific programming languages, such as Brook, SH, and later NVidia&#8217;s CUDA. Most of this work was, and is, documented by the <a title="GPGPU" href="http://gpgpu.org/" target="_blank">GPGPU</a> group. <a href="http://www.nvidia.com/object/cuda_home.html#" target="_blank">NVIDIA&#8217;s website</a> shows CUDA being used in a wide variety of applications but in practice it is best employed in so called &#8220;<a title="Wikipedia : Embarrassingly Parallel" href="http://en.wikipedia.org/wiki/Embarrassingly_parallel" target="_blank">embarassingly parallel</a>&#8221; problems.<br />
<span id="more-389"></span></p>
<h3>The demonstration application</h3>
<p>The demo below shows a Silverlight 4 beta application, which implements a recursive gaussian filter. Note that this is not the same algorithm provided by the sample in the CUDA SDK, but a more efficient method, which is described in detail in <a href="#young">[1]</a> for those interested. The main advantage of a filter implemented in this way is that the computation time is independent of the width of the filter.<br />
To enable CUDA interop, you&#8217;ll need a CUDA compatible graphics card. Then do the following,</p>
<ol>
<li>Install the MFC COM application (link below). The installer should register the application with COM automatically.</li>
<li>Right click on the Silverlight App and install it for running outside of the browser. The CUDA option should now be available from the Combo box.</li>
</ol>
<p>Source code : <a title="Download source code" name="source" href="http://planetmarshall.co.uk/silverlight/cuda_interop/SilverlightCudaInteropDemo.zip">SilverlightCudaInteropDemo.zip</a><br />
<a title="Install CUDA Server application" href="http://planetmarshall.co.uk/silverlight/cuda_interop/CudaServer.msi">Install MFC COM Application (5.5 Mb)</a><br />
<div class="pm_header"><a onclick="pm_toggleCodeBlock(this,'4c837824da8f3')">&#x25bc;</a> Silverlight Application</div><div id="4c837824da8f3" style="display:" class="silverlightControlHost"><object data="data:application/x-silverlight-2," type="application/x-silverlight-2" width="520" height="580"><param name="source" value="http://www.planetmarshall.co.uk/silverlight/cuda_interop/SlCudaInteropDemo.xap"/><param name="background" value="#212121" /><!--<param name="minRuntimeVersion" value="2.0.31005.0" />--><param name="enableHtmlAccess" value="true" /><a href="http://go.microsoft.com/fwlink/?LinkID=124807" style="text-decoration: none;"><img src="http://storage.timheuer.com/sl4wp-ph.png" alt="Install Microsoft Silverlight" style="border-style: none; width:520px; height:580px"/></a></object><iframe style='visibility:hidden;height:0;width:0;border:0px'></iframe></div><br />
<h3>The native component</h3>
<p>The native component takes the form of a COM Automation server, implemented as a client side MFC application.</p>
<blockquote><p>Note: Make sure you run Visual Studio with Administrator privileges, otherwise registering the automation server with COM will fail.</p></blockquote>
<p>MFC and Automation are beyond the scope of this article, but the basic process I followed was thus</p>
<ol>
<li>Create an MFC Dialog application using the Wizard. Make sure to enable Automation support</li>
<li>Add a method to the autmation interface using the add Method wizard from the Class View</li>
<li>Add a dual interface using this <a title="TN065: Dual-Interface Support for OLE Automation Servers" href="http://msdn.microsoft.com/en-us/library/4h56szat%28VS.100%29.aspx" target="_blank">Technical Note</a> from MSDN.</li>
<li>If you get link errors, make sure to include the output of MIDL in the application class ( the one that contains OnInitInstance). I couldn&#8217;t find any reference to this step, but it&#8217;s how the samples work.</li>
<li>Make sure that the run time library options passed to nvcc and msvc match, ie they should all use a DLL or Static linking, not a mixture of both</li>
<li>If you get stuck, take a look at the <a title="MFC Samples" href="http://msdn.microsoft.com/en-us/library/482ck6x8%28VS.100%29.aspx" target="_blank">MFC Samples</a>, particularly <a title="ACDUAL Sample: Adds Dual Interfaces to an Automation Application" href="http://msdn.microsoft.com/en-us/library/xfx55tf8%28VS.100%29.aspx" target="_blank">acdual</a>.</li>
</ol>
<p>when you pass a native array through COM Automation, it is converted to a <a title="Array Manipulation Functions from MSDN" href="http://msdn.microsoft.com/en-us/library/ms221145%28VS.100%29.aspx" target="_blank"><code>SAFEARRAY</code></a> on the native side. Note that I couldn&#8217;t find any documentation on this, I discovered it through experience. The code snippets below show sending and receiving array data between Silverlight and the MFC application.</p>

<div class="wp_syntax">
<div class="wp_header"><a onclick="pm_toggleCodeBlock(this,'4c837824e0d0e')">&#x25ba;</a> Listing : Calling COM from Silverlight</div><div id="4c837824e0d0e" style="display:none;" class="code"><div class="csharp pm_syntax"><span class="co1">// note that ComAutomationFactory has become AutomationFactory</span><br />
<span class="co1">// in Silverlight 4 RC</span><br />
<span class="kw4">dynamic</span> cuda <span class="sy0">=</span> AutomationFactory.<span class="me1">CreateObject</span><span class="br0">&#40;</span><span class="st0">&quot;CudaServer.Application&quot;</span><span class="br0">&#41;</span><span class="sy0">;</span><br />
<span class="kw4">float</span><span class="br0">&#91;</span><span class="br0">&#93;</span> data <span class="sy0">=</span> <span class="kw3">new</span> <span class="br0">&#91;</span><span class="br0">&#93;</span> <span class="br0">&#123;</span>1.0f, 3.14f <span class="br0">&#125;</span><span class="sy0">;</span><br />
<span class="kw4">dynamic</span> retData <span class="sy0">=</span> cuda.<span class="me1">Process</span><span class="br0">&#40;</span> data <span class="br0">&#41;</span><span class="sy0">;</span><br />
<span class="co1">// retData is a managed float array</span></div></div></div>


<div class="wp_syntax">
<div class="wp_header"><a onclick="pm_toggleCodeBlock(this,'4c837824e36e7')">&#x25ba;</a> Listing : Returning data to Silverlight from MFC via COM</div><div id="4c837824e36e7" style="display:none;" class="code"><div class="cpp pm_syntax">VARIANT CCudaServer<span class="sy4">::</span><span class="me2">Process</span><span class="br0">&#40;</span>VARIANT <span class="sy3">&amp;</span>amp<span class="sy4">;</span>data<span class="br0">&#41;</span><br />
<span class="br0">&#123;</span><br />
&nbsp; SAFEARRAY <span class="sy2">*</span>pSrcData <span class="sy1">=</span> &nbsp;data.<span class="me1">parray</span><span class="sy4">;</span><br />
<br />
&nbsp; <span class="co1">// this will copy the safe array into the variant</span><br />
&nbsp; CComVariant var<span class="br0">&#40;</span>pSrcData<span class="br0">&#41;</span><span class="sy4">;</span><br />
<br />
&nbsp; <span class="co1">// when we return the VARIANT containing the SAFEARRAY</span><br />
&nbsp; <span class="co1">// it will be marshaled to Silverlight as a managed array</span><br />
&nbsp; VARIANT retVal<span class="sy4">;</span><br />
&nbsp; VariantInit<span class="br0">&#40;</span> <span class="sy3">&amp;</span>amp<span class="sy4">;</span>retVal <span class="br0">&#41;</span><span class="sy4">;</span><br />
&nbsp; var.<span class="me1">Detach</span><span class="br0">&#40;</span> <span class="sy3">&amp;</span>amp<span class="sy4">;</span>retVal <span class="br0">&#41;</span><span class="sy4">;</span><br />
&nbsp; retVal.<span class="me1">vt</span> <span class="sy1">=</span> VT_ARRAY <span class="sy3">|</span> VT_R4<span class="sy4">;</span><br />
&nbsp; <span class="kw1">return</span> retVal<span class="sy4">;</span><br />
<span class="br0">&#125;</span></div></div></div>

<h3>Using MEF to implement the application</h3>
<p>The <a title="Managed Extensibility Framework at Codeplex" href="http://www.codeplex.com/MEF" target="_blank">Managed Extensibility Framework</a> is an extensible plugin framework for .NET applications and Silverlight. I have used it to dynamically discover implementations of <code>IProcessorProvider</code> based on the permissions available to the Silverlight application. The figure below shows the component structure of the application.</p>

<!-- collapsible header -->

<div class="pm_header"><a onclick="pm_toggleCodeBlock(this,'4c837824dfc0d')">&#x25ba;</a> Figure : Component diagram for demo application</div>
<div id="4c837824dfc0d" style="display:none;">
<a href="http://www.planetmarshall.co.uk/wp-content/gallery/cuda-interop/slcuda_component.png" title="Component diagram for demo application" class="thickbox" rel="singlepic77" >
	<img class="ngg-singlepic" src="http://www.planetmarshall.co.uk/wp-content/gallery/cache/77__475x_slcuda_component.png" alt="slcuda_component" title="slcuda_component" />
</a>
</div>

<h3>Performance notes</h3>
<h4>Silverlight</h4>
<p>Unlike the <a title="My Reaction-Diffusion simulator" href="http://www.planetmarshall.co.uk/index.php/2009/03/reaction-diffusion-models/">Reaction Diffusion simulation</a>, for this application I have chosen to use Silverlight&#8217;s <a title="WirteableBitmap in Silverlight 3, from MSDN" href="http://msdn.microsoft.com/en-us/library/system.windows.media.imaging.writeablebitmap%28VS.95%29.aspx" target="_blank"><code>WriteableBitmap</code></a>, introduced in Silverlight 3, rather than <a title="Joe Stegman's PNG Encoder for Silverlight" href="http://blogs.msdn.com/jstegman/archive/2008/04/21/dynamic-image-generation-in-silverlight.aspx" target="_blank">dynamic PNG encoding</a>. This revealed an interesting performance issue when using a typical double loop to iterate over the pixels. Initial timings revealed that the vast majority of the time was spent in updating the <code>WriteableBitmap</code> rather than actually performing the image processing. The initial update loop used the <code>PixelWidth</code> and <code>PixelHeight</code> properties to bound the loop counters, taking about 200ms to iterate over the loop.</p>

<div class="wp_syntax">
<div class="wp_header"><a onclick="pm_toggleCodeBlock(this,'4c837824e72c4')">&#x25ba;</a> Listing : Updating bitmap using property accessors</div><div id="4c837824e72c4" style="display:none;" class="code"><div class="csharp pm_syntax"><span class="kw1">for</span> <span class="br0">&#40;</span><span class="kw4">int</span> j <span class="sy0">=</span> <span class="nu0">0</span><span class="sy0">;</span> j <span class="sy0">&amp;</span>lt<span class="sy0">;</span> bmp.<span class="me1">PixelHeight</span><span class="sy0">;</span> <span class="sy0">++</span>j<span class="br0">&#41;</span><br />
<span class="br0">&#123;</span><br />
&nbsp; <span class="kw1">for</span> <span class="br0">&#40;</span><span class="kw4">int</span> i <span class="sy0">=</span> <span class="nu0">0</span><span class="sy0">;</span> i <span class="sy0">&amp;</span>lt<span class="sy0">;</span> bmp.<span class="me1">PixelWidth</span><span class="sy0">;</span> <span class="sy0">++</span>i<span class="br0">&#41;</span><br />
&nbsp; <span class="br0">&#123;</span><br />
&nbsp; &nbsp; &nbsp;<span class="co1">// update pixels</span><br />
&nbsp; &nbsp;<span class="br0">&#125;</span><br />
<span class="br0">&#125;</span></div></div></div>

<p>By caching the bitmap properties in local variables, the timing was reduced to ~5ms. Needless to say I was shocked by how much of a difference such a seemingly trivial change made.</p>

<div class="wp_syntax">
<div class="wp_header"><a onclick="pm_toggleCodeBlock(this,'4c837824e9a7a')">&#x25ba;</a> Listing : Updating bitmap with cached variables</div><div id="4c837824e9a7a" style="display:none;" class="code"><div class="csharp pm_syntax"><span class="kw4">int</span> pxWidth <span class="sy0">=</span>  bmp.<span class="me1">PixelWidth</span><span class="sy0">;</span><br />
<span class="kw4">int</span> pxHeight <span class="sy0">=</span> bmp.<span class="me1">PixelHeight</span><span class="sy0">;</span><br />
<span class="kw1">for</span> <span class="br0">&#40;</span><span class="kw4">int</span> j <span class="sy0">=</span> <span class="nu0">0</span><span class="sy0">;</span> j <span class="sy0">&amp;</span>lt<span class="sy0">;</span> pxHeight<span class="sy0">;</span> <span class="sy0">++</span>j<span class="br0">&#41;</span><br />
<span class="br0">&#123;</span><br />
&nbsp; <span class="kw1">for</span> <span class="br0">&#40;</span><span class="kw4">int</span> i <span class="sy0">=</span> <span class="nu0">0</span><span class="sy0">;</span> i <span class="sy0">&amp;</span>lt<span class="sy0">;</span> pxWidth<span class="sy0">;</span> <span class="sy0">++</span>i<span class="br0">&#41;</span><br />
&nbsp; <span class="br0">&#123;</span><br />
&nbsp; &nbsp; &nbsp;<span class="co1">// update pixels</span><br />
&nbsp; &nbsp;<span class="br0">&#125;</span><br />
<span class="br0">&#125;</span></div></div></div>

<h4>OLE Automation</h4>
<p>The guidelines for building performant automation code is much the same as that for other unmanaged interop scenarios in .NET : avoid chatty interfaces. Note that this is exactly what I have not done here. In fact, the time it takes CUDA to perform the image processing is dwarfed by the time it takes to marshal the data between Silverlight and COM. This can be mitigated somewhat by splitting the blur call into two operations, one to load the image, which is called only upon initialization, and one to perform the blur.</p>
<h4>CUDA</h4>
<p>CUDA operations are extremely sensitive to data alignment and the order in which threads access data. Kernels should be written in such a way that threads access adjacent data elements, meaning that the row major access pattern familiar to C and C# developers would produce suboptimal performance ( sometimes by as much as an order of magnitude ). Instead, array accesses should be performed in a manner more reminiscent of FORTRAN. In addition, 2D arrays should be padded out so that threads access data elements that are correctly aligned ( see the CUDA documentation for the correct alignment values ). A full exposition of performance optimization for CUDA is really beyond the scope of this article, there are many examples in the <a title="Learn More about CUDA - NVIDIA" href="http://www.nvidia.com/object/cuda_education.html" target="_blank">NVIDIA documentation</a> although the terminology can be somewhat opaque. One of the clearest explanations I have found is this <a title="Supercomputing 2007 CUDA Tutorial" href="http://gpgpu.org/sc2007" target="_blank">presentation </a>from Mark Harris at Supercomputing 2007.</p>

<div class="wp_syntax">
<div class="wp_header"><a onclick="pm_toggleCodeBlock(this,'4c837824ec3da')">&#x25ba;</a> Listing : Row major access pattern</div><div id="4c837824ec3da" style="display:none;" class="code"><div class="cuda pm_syntax"><span class="kw2">__global__</span> <span class="kw4">void</span> kernel<span class="br0">&#40;</span> <span class="kw4">float</span> <span class="sy0">*</span>destData<span class="sy0">,</span> <span class="kw4">float</span> <span class="sy0">*</span>srcData<span class="sy0">,</span> <span class="kw4">int</span> stride<span class="sy0">,</span> <span class="kw4">int</span> height <span class="br0">&#41;</span><br />
<span class="br0">&#123;</span><br />
&nbsp; <span class="co1">// suboptimal access. Each thread accesses elements</span><br />
&nbsp;<span class="co1">// in a striding pattern</span><br />
&nbsp; <span class="kw4">int</span> rowStart <span class="sy0">=</span> <span class="br0">&#40;</span><span class="kw3">blockDim</span>.<span class="me1">x</span><span class="sy0">*</span><span class="kw3">blockIdx</span>.<span class="me1">x</span><span class="sy0">+</span><span class="kw3">threadIdx</span>.<span class="me1">x</span><span class="br0">&#41;</span><span class="sy0">*</span>stride<span class="sy0">;</span><br />
&nbsp; <span class="kw1">for</span> <span class="br0">&#40;</span> <span class="kw4">int</span> i <span class="sy0">=</span> rowStart<span class="sy0">;</span> i <span class="sy0">&lt;</span> rowStart<span class="sy0">+</span>stride<span class="sy0">;</span> <span class="sy0">++</span>i <span class="br0">&#41;</span> <span class="br0">&#123;</span><br />
&nbsp; &nbsp; destData<span class="br0">&#91;</span>i<span class="br0">&#93;</span> <span class="sy0">=</span> srcData<span class="br0">&#91;</span>i<span class="br0">&#93;</span><span class="sy0">;</span><br />
&nbsp; <span class="br0">&#125;</span><br />
<span class="br0">&#125;</span></div></div></div>


<div class="wp_syntax">
<div class="wp_header"><a onclick="pm_toggleCodeBlock(this,'4c837824edf27')">&#x25ba;</a> Listing : Column major access patern</div><div id="4c837824edf27" style="display:none;" class="code"><div class="cuda pm_syntax"><span class="kw2">__global__</span> <span class="kw4">void</span> kernel<span class="br0">&#40;</span> <span class="kw4">float</span> <span class="sy0">*</span>destData<span class="sy0">,</span> <span class="kw4">float</span> <span class="sy0">*</span>srcData<span class="sy0">,</span> <span class="kw4">int</span> stride<span class="sy0">,</span> <span class="kw4">int</span> height <span class="br0">&#41;</span><br />
<span class="br0">&#123;</span><br />
&nbsp; <span class="co1">// optimal access pattern, each thread accesses adjacent elements</span><br />
&nbsp; <span class="kw4">int</span> colStart <span class="sy0">=</span> <span class="kw3">blockDim</span>.<span class="me1">x</span><span class="sy0">*</span><span class="kw3">blockIdx</span>.<span class="me1">x</span><span class="sy0">+</span><span class="kw3">threadIdx</span>.<span class="me1">x</span><span class="sy0">;</span><br />
&nbsp; <span class="co1">// this case, 16*sizeof(float)= 64 bytes</span><br />
&nbsp; <span class="kw1">for</span> <span class="br0">&#40;</span> <span class="kw4">int</span> i <span class="sy0">=</span> colStart <span class="sy0">;</span> i <span class="sy0">&lt;</span> colStart<span class="sy0">+</span><span class="br0">&#40;</span>stride<span class="sy0">*</span>height<span class="br0">&#41;</span><span class="sy0">;</span> i<span class="sy0">+=</span>stride <span class="br0">&#41;</span> <span class="br0">&#123;</span><br />
&nbsp; &nbsp; destData<span class="br0">&#91;</span>i<span class="br0">&#93;</span> <span class="sy0">=</span> srcData<span class="br0">&#91;</span>i<span class="br0">&#93;</span><span class="sy0">;</span><br />
&nbsp; <span class="br0">&#125;</span><br />
<span class="br0">&#125;</span></div></div></div>

<h3>References</h3>
<ol>
<li><span class="p1"><a name="young"></a>Young, I.T. &amp; van Vliet,L.J, 1995. <a title="Recursive Implementation of the Gaussian Filter" href="http://www.sciencedirect.com/science?_ob=ArticleURL&amp;_udi=B6V18-3YS90HC-D&amp;_user=10&amp;_coverDate=06%2F30%2F1995&amp;_rdoc=2&amp;_fmt=high&amp;_orig=browse&amp;_srch=doc-info%28%23toc%235668%231995%23999559997%23172292%23FLP%23display%23Volume%29&amp;_cdi=5668&amp;_sort=d&amp;_docanchor=&amp;_ct=11&amp;_acct=C000050221&amp;_version=1&amp;_urlVersion=0&amp;_userid=10&amp;md5=cdfad44c178fc20739d26562c5f26e04" target="_blank">Recursive implementation of the Gaussian filter</a>. <em>Signal Processing</em>, 44, pp.139-151. </span></li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://www.planetmarshall.co.uk/2010/01/silverlight-and-cuda-interop/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>
