More, Faster, Easier
Highly parallel processors are promising performance on the order of tera operations per second. Programmability gives these processors an advantage over hard-wired accelerators, but it's not yet clear if hundreds of cores on a chip are good for anything except data-intensive workloads. General-purpose computers are still waiting for a solution that provides frequency-like performance increases. And, speaking of performance on general-purpose computing, I think most of us wish its evolution would continuemaking our computers faster and better, yet keeping the compatibility that preserves the value of our expensive programs.
Take for example the evolution of transportation: from the horse and carriage, to the automobile, tram, train, ship, airplane, to rocket-powered shuttles, to the space station. We have invented so many means of transportation that can speed up and improve the way we move from one place to another. We can do more travel and accomplish it in a shorter timeand most vehicles are becoming easier to use.
Until a few years ago, the computer was like that. Increase its memory and it can do more; raise its frequency and it does it faster; make a better interface and it's easier to use.
Frequency was especially important. Not just because it sold desktops and notebooks. It allowed us to go past the barrier of the program, Alan Turing's algorithm. To be more precise, past the basic operations at the foundation of our math and logic and program control. Yes, we can fuse several instructions and refer to them by one mnemonic. But it doesn't matter how we name a group of instructions or how elegant it may look to compiler writers. Whether invoked individually or as a group, an integer addition, a logic operation, or a conditional branch must still be executed. To add, you still need to bring two operands into the adder and write the result to a register or memory.
We can't make these primitive operations happen faster except via higher frequency. Unfortunately, frequency has just crashed into the silicon wall of semiconductor physics. There may be hope: like the internal-combustion engine yielding to the jet engine for speed, we may still develop a technology that will carry frequency forward. After all, computers have already transitioned from vacuum tubes to transistors. But until frequency increases come back, we can speed up the algorithm only by executing parallel sequences of the primitive operations making it up.
Here is our status. We know how to program single engines, but we can't increase their frequency. We know how to design hundreds of cores on chip, but we don't have productive software to program them.
Still, we're lucky to have parallel computing thrust upon us. The human brain is not a single processor operating in the THz range. Parallel computing is forcing us to dissect the single-processor implementation of the algorithm more than before. This is not just a hardware task. Nevertheless, instead of the breakthroughs one would hope to see in concepts and software, hardware architecture design has taken the lead again, providing hundreds of processor variants. The development was predictable. Compared with the difficulties encountered many years ago, better tools have made it relatively easy to implement a processor. We now have a better grasp of the basic principles of computer architecture, and good ISA definitions are easier to produce. Universities have joined processor companies in announcing new architectures. A few examples are Stream Processors from Stanford University, RAW from MIT (TILE64 from Tilera), and TRIPS from the University of Texas at Austin.
Yet compared with past achievements, no real progress was made in software. Possibly because many years ago there were fewer architectures around.
During the seventies and eighties, development difficulties and costs of implementation allowed the creation of just a few, mostly single-processor general-purpose architectures. The few targets were a blessing. They held the promise of volume sales of system and application programs. They focused large numbers of programmers on developing better tools and even larger numbers on using them. Software was able to take advantage of increasing memory sizes and higher processor performance. It allowed applications programmers to leave behind assembly language in favor of the less efficient but more productive higher-level languages like C and C++. Programmers could accomplish more, more quickly, and with visual languagesit became easier to work.
By comparison, we'll find that the variety of core ISAs in parallel configurations being offered today are mostly incompatible with each other. They have fragmented the target for software tools and applications. Processor vendors can offer only slightly modified single-processor tools, since ISVs have no strong incentive to support any specific processor.
Application programmers must now isolate "by hand" independent sections of code and assign them to many processors. They need to optimize the code to take advantage of, or cope with, special on-chip buses and memory configurations. This type of work used to be done by patient mathematicians and other scientists for whom parallel computing was the only way to get timely results.
Undeterred, using established architecture concepts and lessons from scientific programming, a number of hardware and software designers have created good data-intensive engines and code aimed at high-performance applications such as audio/video, infrastructure, networking, and imaging. Most of the cores' ISAs are traditional, without support for parallel structures. The programming tools are still C and C++, good for programming the single core.
General-purpose processing, claimed by a few parallel engines, has yet to be demonstrated. Where does this leave the millions of programmers writing code for general-purpose processors such as the x86?
Intel, AMD, and others employing several on-chip cores, admittedly delivering uneven performance gains, may have the best answer. These companies have a good chance of educating and taking with them the millions of programmers as, through careful research and attention to compatibility, the long threads on a few cores turn into shorter threads on many cores.
Embedded and general-purpose computer designers have always made different choices in system implementationand their employment of parallel processing continues to show the effect of the different workloads.
In the embedded world most processors target data-intensive applications. Marketing messages praise the fully programmable parallel engines' ability to track technology, compared with the less flexible heterogeneous configurations containing hard-wired cores.
Some of the massively parallel processors requiring heavy support in applications, and system code may be very difficult to program. The complexity of reprogramming them or adding new features and standards may come close to the development time and expense of heterogeneous architectures combining a simple parallel section, hard-wired accelerators, and a processor that can run the OS. Today's success will be assured only for the few homogeneous parallel processors that-through good features, tools, and excellent sales-will make it into embedded applications.
Desktops and notebook processor designs must be more conservative. They will evolve at a slower pace, preferring to use a few cores instead of many tens or hundreds. Research concentrates on communication among a handful of cores and intelligent caches. Software development tools are affected minimally, but the overall performance improvements are spottier and less formidable than claimed by the massively parallel processors in data-intensive applications.
We're waiting for desktop performance to increase, to continue its all-purpose support for learning and thinking and working. Until then, I guess we'll play games and watch movies in high definition.
|