Silverlight and CUDA interop

Update – source code now available

Microsoft have recently released a beta of Silverlight 4, which has limited support for native interoperation using COM. Potentially, this example could be applied to any number of native interop scenarios, however for this example I have chosen to use Nvidia’s CUDA technology.

Disclaimer : This is an example of what can be done, not necessarily, and in all likelihood, an example of how it should be done.

Up until around 2001 PC graphics cards, though powerful, implemented a fixed function pipeline that limited use to whatever was exposed by the APIs, usually Direct3D or OpenGL. The addition of a programmable pixel pipeline led to the use of graphics cards for more general computation tasks; at first using shaders directly, followed by higher level GPU specific programming languages, such as Brook, SH, and later NVidia’s CUDA. Most of this work was, and is, documented by the GPGPU group. NVIDIA’s website shows CUDA being used in a wide variety of applications but in practice it is best employed in so called “embarassingly parallel” problems.

The demonstration application

The demo below shows a Silverlight 4 beta application, which implements a recursive gaussian filter. Note that this is not the same algorithm provided by the sample in the CUDA SDK, but a more efficient method, which is described in detail in [1] for those interested. The main advantage of a filter implemented in this way is that the computation time is independent of the width of the filter.
To enable CUDA interop, you’ll need a CUDA compatible graphics card. Then do the following,

1. Install the MFC COM application (link below). The installer should register the application with COM automatically.
2. Right click on the Silverlight App and install it for running outside of the browser. The CUDA option should now be available from the Combo box.

Source code : SilverlightCudaInteropDemo.zip
Install MFC COM Application (5.5 Mb)
[silverlight: cuda_interop/SlCudaInteropDemo.xap,520,580,false]

The native component

The native component takes the form of a COM Automation server, implemented as a client side MFC application.

Note: Make sure you run Visual Studio with Administrator privileges, otherwise registering the automation server with COM will fail.

MFC and Automation are beyond the scope of this article, but the basic process I followed was thus

1. Create an MFC Dialog application using the Wizard. Make sure to enable Automation support
2. Add a method to the autmation interface using the add Method wizard from the Class View
3. Add a dual interface using this Technical Note from MSDN.
4. If you get link errors, make sure to include the output of MIDL in the application class ( the one that contains OnInitInstance). I couldn’t find any reference to this step, but it’s how the samples work.
5. Make sure that the run time library options passed to nvcc and msvc match, ie they should all use a DLL or Static linking, not a mixture of both
6. If you get stuck, take a look at the MFC Samples, particularly acdual.

when you pass a native array through COM Automation, it is converted to a SAFEARRAY on the native side. Note that I couldn’t find any documentation on this, I discovered it through experience. The code snippets below show sending and receiving array data between Silverlight and the MFC application.

// note that ComAutomationFactory has become AutomationFactory
// in Silverlight 4 RC
dynamic cuda = AutomationFactory.CreateObject("CudaServer.Application");
float[] data = new [] {1.0f, 3.14f };
dynamic retData = cuda.Process( data );
// retData is a managed float array

VARIANT CCudaServer::Process(VARIANT &amp;data)
{
SAFEARRAY *pSrcData =  data.parray;

// this will copy the safe array into the variant
CComVariant var(pSrcData);

// when we return the VARIANT containing the SAFEARRAY
// it will be marshaled to Silverlight as a managed array
VARIANT retVal;
VariantInit( &amp;retVal );
var.Detach( &amp;retVal );
retVal.vt = VT_ARRAY | VT_R4;
return retVal;
}


Using MEF to implement the application

The Managed Extensibility Framework is an extensible plugin framework for .NET applications and Silverlight. I have used it to dynamically discover implementations of IProcessorProvider based on the permissions available to the Silverlight application. The figure below shows the component structure of the application.

Performance notes

Silverlight

Unlike the Reaction Diffusion simulation, for this application I have chosen to use Silverlight’s WriteableBitmap, introduced in Silverlight 3, rather than dynamic PNG encoding. This revealed an interesting performance issue when using a typical double loop to iterate over the pixels. Initial timings revealed that the vast majority of the time was spent in updating the WriteableBitmap rather than actually performing the image processing. The initial update loop used the PixelWidth and PixelHeight properties to bound the loop counters, taking about 200ms to iterate over the loop.

for (int j = 0; j &lt; bmp.PixelHeight; ++j)
{
for (int i = 0; i &lt; bmp.PixelWidth; ++i)
{
// update pixels
}
}


By caching the bitmap properties in local variables, the timing was reduced to ~5ms. Needless to say I was shocked by how much of a difference such a seemingly trivial change made.

int pxWidth =  bmp.PixelWidth;
int pxHeight = bmp.PixelHeight;
for (int j = 0; j &lt; pxHeight; ++j)
{
for (int i = 0; i &lt; pxWidth; ++i)
{
// update pixels
}
}


OLE Automation

The guidelines for building performant automation code is much the same as that for other unmanaged interop scenarios in .NET : avoid chatty interfaces. Note that this is exactly what I have not done here. In fact, the time it takes CUDA to perform the image processing is dwarfed by the time it takes to marshal the data between Silverlight and COM. This can be mitigated somewhat by splitting the blur call into two operations, one to load the image, which is called only upon initialization, and one to perform the blur.

CUDA

CUDA operations are extremely sensitive to data alignment and the order in which threads access data. Kernels should be written in such a way that threads access adjacent data elements, meaning that the row major access pattern familiar to C and C# developers would produce suboptimal performance ( sometimes by as much as an order of magnitude ). Instead, array accesses should be performed in a manner more reminiscent of FORTRAN. In addition, 2D arrays should be padded out so that threads access data elements that are correctly aligned ( see the CUDA documentation for the correct alignment values ). A full exposition of performance optimization for CUDA is really beyond the scope of this article, there are many examples in the NVIDIA documentation although the terminology can be somewhat opaque. One of the clearest explanations I have found is this presentation from Mark Harris at Supercomputing 2007.

__global__ void kernel( float *destData, float *srcData, int stride, int height )
{
// suboptimal access. Each thread accesses elements
// in a striding pattern
for ( int i = rowStart; i &lt; rowStart+stride; ++i ) {
destData[i] = srcData[i];
}
}

__global__ void kernel( float *destData, float *srcData, int stride, int height )
{
// this case, 16*sizeof(float)= 64 bytes
for ( int i = colStart ; i &lt; colStart+(stride*height); i+=stride ) {
destData[i] = srcData[i];
}
}


References

1. Young, I.T. & van Vliet,L.J, 1995. Recursive implementation of the Gaussian filter. Signal Processing, 44, pp.139-151.
• http://blogs.dotnet-braunschweig.de/Florian Florian Mätschke

Nice article!

Great work!

• Sam

I would like to see your CUDA application in action but i get an error.. “This application was created for an expired Beta of Silverlight”..
I just got the latest 4. build 4.0.50303.0

• http://www.planetmarshall.co.uk Andrew

Hi Sam,

I’m in the process of updating the sample to the Silverlight 4 RC, it should be done shortly.

Regards,
Andrew.

• Leo Camello

Is it possible to do things like 3D rendering using this technique?

• http://www.planetmarshall.co.uk Andrew

Leo,
While it would be technically possible to render a scene in DirectX, say, to an off screen bitmap and then pass it back for display in Silverlight via COM, there’s no way to escape the bandwidth problem that I observed in the image processing example.
You are always limited by the time it takes to transfer the data between COM and Silverlight, which I could never reduce to much less than 40ms on my system.

You might be better off investigating something like Kit3D which provides a native Silverlight 3D engine.

http://kit3d.codeplex.com/

• Leo Camello

Thank you Andrew! =]

I’ll try to investigate this Kit3D

• mr_x

Cool article.
You have even inspired our teacher for bsc/bse diploma subject for us, don’t do this again