Building your own super computer
If you fancy getting stuck into tasks such as this, you could buy dedicated hardware from the likes of HP or Cray, but this is probably overkill, and would certainly be tremendously expensive. The Cray XK6, for example, can perform more than one petaflop, but system prices start at around half a million dollars. A cheaper option is to make use of hosted computing services such as Microsoft Azure or Amazon Web Services. But if you want to own and control your own hardware, a home-brew approach can provide a usable measure of supercomputing power at a comparatively realistic price.
What does a homemade supercomputer look like? As we’ve noted, there’s no formal definition of a supercomputer. One thing that’s likely to characterise your hardware, however, is parallelisation: historically, parallel processing is the means that has allowed supercomputers to achieve their exceptional levels of performance.
Almost every modern CPU on the market has two or more physical cores built directly into the chip package so arguably you could install a mainstream CPU in a regular motherboard and call it a supercomputer. Indeed, a modern Core i7 system will deliver computing power on a similar scale to that of a real supercomputer from 20 years ago, such as the Intel Paragon, which cost a million dollars and filled half a room.
However, the term supercomputer implies something beyond the norm, and these days, an eight-core system is comparatively run-of-the-mill. A 16-core system might qualify. A 48-core system? Now we’re getting somewhere.
How do you go about assembling a system like this? One option is to invest in a motherboard that supports multiple processors. Another is to combine many computers into a cluster that functions as a single supercomputer. Alternatively, you could look beyond the CPU to add-on cards that place huge quantities of raw number-crunching power in the hands of the CPU. Or you could use the hundreds of stream processors on a graphics card to the same end. Let’s look at each of these approaches in turn.
Mainstream desktop chips aren’t ordinarily used in multiprocessor configurations, and you’ll find very little hardware support for doing so. If you want to run multiple CPUs in parallel, you’re basically limited to workstation or server architectures. On Intel hardware, this means LGA 2011 chips, most of which come under the Xeon brand. If you prefer AMD, you can use the still-supported Socket G34 platform, or the newer Socket C32 that supports the latest Opteron models.
None of this is cheap – the hardware is aimed at businesses, which are typically willing to pay for heavy-duty hardware. Dual Intel socket 2011 motherboards start at around $350, and processors at $300+ each for the Core i7-3820. Move up to the top-of-the-range eight-core Xeon E5-2690 and you’re looking at much more.
This approach has one major benefit, however: Windows is designed to “just work” in multiprocessor environments, so any program that can make sensible use of a dual-core processor should automatically scale up to run in a 16-core environment. This makes a multiprocessor model appealing if you want to use your supercomputer to run mainstream multithreaded applications such as 3D-rendering tools or media encoders.
Forming a cluster
The multiprocessor approach has limitations. Once you’ve installed your two expensive processors in your expensive motherboard, there’s almost no scope to expand organically; you could install more RAM, or swap out your processors for a pair of more powerful models, but basically what you have is a closed system. A more flexible approach is clustering.
A cluster is a group of computers, typically connected via a local area network, which acts as if it were a single system. Clusters can be used for all sorts of purposes, such as providing load balancing and fault tolerance for network services, but the model lends itself particularly well to supercomputing applications. Indeed, a clustering approach has been the basis of most of the best-known supercomputers in history, including Fujitsu’s world-beating K computer.
The philosophy behind supercomputing clustering is simple. One physical (or virtual) machine is configured as the “master” system or the “head node”, and it’s on this system that the main application code runs. The other nodes do nothing but sit and wait for the master system to delegate workloads to them; when these are received, they do the work and return the results as quickly as possible.
A computational cluster can be seen as a macrocosm of a multiprocessor system, with multiple computers working on their individual tasks in parallel.
The difference is that nodes can be added to your cluster, or removed, as easily as connecting a new PC to a network; and, what’s more, there’s no requirement at all for the node hardware to use any particular architecture. If you wanted, you could assemble a cluster from a mix of systems including laptops, workstations and high-performance servers. The only requirement is that each node is running suitable client software.
Arguably, the best-known examples of computing clusters are the SETI@home and Folding@home projects – but the term “cluster” more usually implies a centrally managed system (projects that combine the power of remote computers are referred to instead as “grid computing”).
The nodes of a cluster are also usually connected via a much faster link than a regular internet connection, to minimise the latency involved in sending workloads back and forth. In your home cluster, that might be Gigabit; the K computer uses a proprietary interconnect called “Tofu”, which provides 100GB/sec of bandwidth.
Windows-based clusters can be assembled quite easily using the Windows HPC Server 2008 operating system, and Microsoft provides guidelines for creating “cluster-aware” applications that will make use of cluster resources when run on such a system. Alternatively, there are various free Linux distributions that are designed for clustering, such as openMosix and ClusterKnoppix. These provide a user-friendly experience that makes it almost effortless to set up a cluster of any size using the popular Beowulf system.
Whichever route you choose, however, one limitation that you’re likely to encounter is a dearth of pre-existing applications that are designed to make use of cluster resources. This isn’t necessarily a problem, as supercomputer tasks are typically carried out by bespoke code (see Supercomputer coding, p86).
Intel’s Knights Corner squeezes 48 CPU cores into a PCI Express card
The cluster approach is flexible, but quite wasteful – it basically means leaving an entire computer switched on and drawing power when you’re typically making use of only a few functions of the processor. A more energy-efficient approach is to mount a large number of processor cores on one expansion card and use these cores as a virtual cluster.
This was the thinking behind Intel’s ill-fated Larrabee project, which sought to integrate 32 x86 cores – processor cores such as you might find in a regular PC – onto a single PCI Express card. An early demonstration of the hardware showed a Larrabee card achieving performance of just over one teraflop, and the idea was that its huge parallel-processing power could be used to render complex, high-quality graphics in real-time.
Larrabee couldn’t be made to work as a graphics-orientated product, and the project was officially shelved in 2010. But Intel kept working on a more general-purpose Larrabee-type architecture – called the Many Integrated Core architecture, or MIC for short – which could be used for any sort of parallel processing. A prototype 32-core PCI Express card, codenamed Knights Ferry, was trialled in 2010 at the Leibniz Supercomputing Centre and at CERN, and proved capable of providing around 750 gigaflops of computing power. Its successor, codenamed Knights Corner, is expected to go on general sale later this year, and will probably sport 48 cores or more.
Knights Corner looks set to be a neat and power-efficient way to turn your desktop PC into a supercomputer, but it’s a specialist market, so hardware costs are likely to be steep: it could actually work out cheaper to buy an entire cluster of multicore PCs. And the applications you run will need to be written specifically for parallelised execution.
Your last option for supercomputing is to eschew conventional CPU cores entirely, and instead exploit the power of your graphics card. After all, the shaders in a GPU (or stream processors, as they’re also called) are designed to carry out large numbers of calculations in parallel at very high speeds – which is exactly what supercomputers are traditionally best at doing. As we’ve noted above, supercomputers have often been used by professional studios for rendering 3D scenes.
GPUs offer far greater parallelism than CPUs. While a high-end CPU might have eight cores, even a mid-range desktop graphics card typically has more than 100 stream processors, and today’s high-end models have more than 2000. This enables a top-of-the-range AMD Radeon HD 7970 to turn over nearly four teraflops – almost 40 times the computational power of a Core i7-980X. Note that GPU performance is typically cited in terms of “single-precision” calculations, which can lead to rounding errors. Working with double-precision values, for accuracy comparable to that of a CPU, roughly halves performance.
Even so, using graphics hardware is vastly more economical than conventional processors. The reason GPU stream processors are so cheap by comparison to CPUs is that they’re massively simpler – their capabilities are largely limited to performing straightforward mathematical operations. A GPU would be very ill-suited to running applications, but for supercomputing it’s just the ticket.
Since GPU architectures are fundamentally different to CPU designs, applications must be written specifically to use the GPU as a computing resource (an approach known as GPGPU, short for “general-purpose graphics processing unit”, computing). However, this needn’t mean learning a whole new programming paradigm. Nvidia cards use what’s called the Compute Unified Device Architecture (CUDA), which means that they can be programmed in a variant of C – and, with recent hardware, C++ – with extensions to access GPU-specific functions.
Windows programmers can alternatively make use of a library of DirectX functions called DirectCompute, which sends tasks to the graphics hardware. A third option is OpenCL, which can be used to create GPU-bound functions in a C-like language. Both frameworks will work on any AMD or Nvidia graphics card, and even with Intel’s integrated GPUs, so your code needn’t be tied to any platform.
If you choose to take the GPU route, you can start very cheaply with mainstream hardware. But both Nvidia and AMD also offer premium cards designed specifically for GPGPU applications (branded “Tesla” and “FireStream” respectively). These include performance optimisations that are potentially valuable to the supercomputing market, such as improved performance in double-precision calculations, giving them even more of a lead over conventional desktop processors. These cards aren’t cheap – a Tesla model with 512 stream processors could cost around $4000. But it’s still cheaper than 512 CPU