Multicore processors are causing much consternation in the software-development community. Traditional single-threaded programs essentially gain nothing by running on microprocessors with multiple cores. Indeed, the program might even run worse. Multicore processors are in vogue because the power-dissipation penalties of higher clock speeds force CPU architects to find alternatives. The newly popular alternative is to integrate multiple processor cores in a single chip, clock the cores at a lower frequency, and tell programmers to rewrite their software.
Traditional programming languages and techniques have difficulty taking advantage of multiple processor cores. Even using a fairly recent language like Java, which is inherently multithreaded, doesn't guarantee that the underlying virtual machine or operating system will map application-level threads onto multiple cores. Sometimes, multiple threads are simply time-sliced on a single processor. Another problem is that multithreaded programs are harder to visualize, implement, debug, and verify. Subtle threadlock bugs can lurk for years before discovery. (See MPR 4/30/07-02, "The Dread of Threads.")
If programmers are having trouble managing a relative handful of threads, how will they cope with manycore processors and massively parallel processors? Whereas today's most powerful PC processors have four cores per chip, some embedded processors already have hundreds or even thousands of cores per chip. Massive integration is the future of microprocessor design. Explicitly coded multithreading probably isn't the answer for massive parallelism.
The solution that seems to be emerging is explicitly coded data-level parallelism. There isn't an industry-standard, widely accepted programming language for this solution yet, although several parties are working on it. Instead, there are different languages or language extensions from different sources. Usually those sources are working independently.
Two examples I've analyzed recently are Nvidia's CUDA (Compute Unified Device Architecture) and RapidMind's Multicore Development Platform. CUDA works only with Nvidia's graphics processors. RapidMind works with graphics processors from both Nvidia and ATI, as well as with x86 processors and IBM's Cell Broadband Engine. Both CUDA and RapidMind require programmers to learn proprietary C/C++ extensions and techniques. Programmers must rewrite at least some of their code to expose hidden data-level parallelism. (See MPR 1/28/08-01, "Parallel Processing With CUDA," and MPR 11/26/07-01, "Parallel Processing For the x86.")
Sequential vs. Parallel Thinking
Ironically, chip designers and FPGA developers often find it easier to visualize parallelism than conventionally trained software programmers do. Hardware-design languages such as Verilog and VHDL embody the concept that multiple parts of a machine can operate simultaneously. Likewise for FPGA development tools. These languages and tools can express concepts of timing and synchronization to ensure that everything meshes together without chaos. In contrast, most software languages are structured around sequential processing. One thing happens after another, instead of multiple things happening at once.
Sequential processing breeds sequential thinking. Programmers tend to solve problems by writing algorithms that are step-by-step solutions, like cake recipes. Put all the ingredients into the bowl, then run the mixer, then dump the batter into a cake pan. Bake the cake, then apply the frosting. If 10,000 cakes need baking, the conventionally trained programmer will wrap that sequential algorithm inside a loop with 10,000 iterations. The loop won't run any faster on a quad-core processor than it will run on a single-core processor.
The answer is to "think parallel." Instead of visualizing a solution as a step-by-step process carried out by a series of sequential instructions, programmers must carefully analyze the data on which the algorithms operate. How much of that data can a program manipulate in parallel? Are there data dependencies that prevent parallelism? If so, what is the smallest chunk of data that's free of dependencies? That chunk becomes the smallest piece of the problem that can run independently of other pieces on an individual processor core. Instead of baking 10,000 cakes one at a time, distribute the task among dozens or hundreds or thousands of bakers. Explicit data-level parallelism requires programmers to perform this analysis before coding the solution. It's a different way of thinking. Do it once, however, and the principles quickly fall into place.
The concept of "thinking parallel" didn't fully crystallize for me until recently. As I talked with engineers at Nvidia and RapidMind, suddenly I remembered a program I had written 20 years ago. Using Microsoft QuickBASIC, MS-DOS, and PCs with Intel 286 processors, I wrote a massively parallel program in the 1980s. I can't claim credit for brilliant foresight, because it happened completely by accident. In fact, I didn't grasp the potential parallelism until after my program was finished, tested, and deployed at the company where I worked. Impossible? Here's the story.
Indexing the Civil War
In the late 1980s, I joined a small startup that published books and periodicals. One of my first jobs was to edit a trilogy of Civil War history books. Great fun for a Civil War buff like me. But when it came time to compile the index, I was flummoxed. History books are stuffed with names of people and places. I could see that the index would be tons of work. Even with the help of an executive secretary, the task consumed a whole week. And that tedious effort was only for the first book in the trilogy. Two more books to go! There had to be a better way.
The result was a program named R.I.P.Robotic Indexing Program. I wrote the program in QuickBASIC instead of much faster Turbo Pascal because QuickBASIC had better string handling, and this was a string-intensive program. R.I.P. had to "read" the whole book and compare each word against an index list I had prepared while editing the book. For each match, the program noted the post-layout page number, extracted 100 words of text surrounding the matched word, and saved this information in a text file. Later, I reviewed each match for accuracy, using the 100 words surrounding each hit to evaluate the context. (Manual evaluation was unavoidable. The program couldn't determine whether a match such as "Lee" referred to General Robert E. Lee, a different person named Lee, or even a town named Lee. Disambiguation would have required a level of artificial intelligence that's still elusive today.)
When I began writing R.I.P., I discovered that mimicking the way a human book editor compiles an index wasn't the most efficient method for a computer. A person reads the book once, looking for matches in the whole index list. But that method led to some ugly and inefficient program code. I determined that my program would be more efficient if it read the entire book once for each word in the index list. If the index contained 400 words, then R.I.P. would read the book 400 times. Obviously, a (sane) editor wouldn't do it that way.
With high hopes, I turned my finished program loose on the second book in the Civil War trilogy. It worked! Well, sort of. R.I.P. crawled like a groggy snail on the 14MHz 286 processor in my 1980s PC. The book was so thick, and my index so comprehensive, that I estimated R.I.P. would take weeks to finish the job.
Division of Labor Saves the Day
Then came my first revelation. It occurred to me that if I divided a 400-word index into two parts and ran R.I.P. on two different computers, they would finish the job twice as fast. This division of labor was possible because I had written the program to read the entire book for each index word. There were no data dependencies among the words. The computers needn't communicate with each other. As far as they knew, each instance of R.I.P. was compiling a 200-word index for a book. When the two computers were done, I could integrate their results into a single 400-word index. By accident, I had written a program capable of two-way data parallelism.
Mere seconds after this revelation came the really big brainstorm. If I divided a 400-word index into four parts, then I could run R.I.P. on four different computers and finish the job four times faster. On eight computers, it would finish eight times faster, and so on. In theory, I could have as many as 400 computers working simultaneously on a 400-word index.
In practice, my ambition for massive parallelism was limited by the number of computers available at our small company. Pretty soon R.I.P. was running on about ten computers. As employees departed in the evening, I commandeered their PCs, launched R.I.P., and let the computers churn all night. Next morning, I collected the results on floppy disks and brought them back to my office. (We didn't have a LAN in those days.)
This unorthodox solution worked so well that the third volume in the Civil War trilogy boasted the biggest index of all. Every proper noun in the book was indexed, even if it occurred only once. Every person, every place name, every regiment, every ship.
That was 20 years ago. I had almost forgotten about R.I.P. until multicore processors began appearing and software developers began wringing their hands over the new programming challenge. Of course, a multicore processor is simply a multiprocessor system on a chip, so the challenge isn't really new. But it's new to most programmers. It requires a whole new way of thinking.
Better Tools for Data Parallelism
Today, the new reality is that sequential, single-threaded programming is becoming obsolete for applications needing higher performance. Even explicitly coded multithreaded programming is too limiting. It's workable on processors with a few cores, but it gets too hairy on manycore or massively parallel processors with dozens or hundreds of cores. In the long run, explicitly coded data parallelism appears to be the only practical answer. This reality is seeping into the heads of software developers very slowly.
To put it bluntly, programs with little or no exploitable data parallelism are reaching a performance plateau. Their performance will improve only incrementally, in step with the incremental improvements in microprocessor clock frequencies. CPU architects are now investing the benefits of Moore's law in multicore chips. If a program can't take advantage of multiple cores, the programmers must find another approach to higher performance. (Such as writing better sequential code.) The long, free ride on higher-frequency microprocessors is endingor, at least, it's dramatically slowing.
Unfortunately, software-development tools and languages aren't keeping up. Programmers desperately need an industry-standard language that lets them express data parallelism without binding the code to a specific CPU architecture or microarchitecture. Ideally, a program like R.I.P. would be able to query the system to discover the number of processors available. Then it would automatically distribute the workload among as many of those processors as possible, while avoiding any conflicts caused by data dependencies. Nobody wants to write data-parallel code for a quad-core processor only to rewrite it a year later for a six-core processor, then rewrite it again for an eight-core processor, and so on. The software must adapt to the hardware.
One new project working in this direction is the Parallel Computing Lab at the University of California at Berkeley, directed by David A. Patterson, a computer-science professor and famous coauthor of Computer Architecture: A Quantitative Approach. According to EE Times, Intel and Microsoft are funding the five-year project with $10 million. (See EE Times 2/21/08, "Berkeley Researcher Defines Parallel Path.")
An entirely new programming language could solve the problem, but it would have to overcome considerable industry resistance. In the 1990s, establishing Java required the marketing muscle of a rising Sun Microsystems, over stiff opposition from Microsoft and others. Even mighty Microsoft is having trouble popularizing its newest language, C#. Most likely, the eventual solution will be a more-or-less standard extension to C++, much as C++ added object-oriented programming to procedural C. And even that compromise won't be easy to achieve. Ultimately, however, the irresistible demand for higher performance will overcome the immovable object of industry inertia.