Wednesday, 24 September 2014

What is Parallel Computing?

What is Parallel Computing?
A Not Too Serious Explanation.

Before I explain parallel computing, it's important to understand that :

You can run, but you can't hide.

I'll come back to this later.

Suppose you have a lot of work to be done, and want to get it done much faster, so you hire 100 workers. If the work is 100 separate jobs that don't depend on each other, and they all take the same amount of time and can be easily parceled out to the workers, then you'll get it done about 100 times faster. This is so easy that it is called embarrassingly parallel. Just because it is embarrassing doesn't mean you shouldn't do it, and in fact it is probably exactly what you should do. A different option would be to parallel each job, as discussed below, and then run these parallelizations one after another. However, as will be shown, this is probably a less efficient way of doing the work. Occasionally this isn't true because on computers, doing all of the job on one processor may require storing many results on disk, while the parallel job may spread the intermediate results in the RAM of the different processors. RAM is much faster than disks. However, if the program isn't spending a lot of time using the disk then embarrassingly parallel is the smart way to go. Assume this is what you should do unless you analyze the situation and determine that it isn't. Embarrassingly parallel is simple, and if you can get the workers do it for free then it is the cheapest solution as well.

What if the jobs take widely different amounts of time, but still have no interaction? If you have enough of them, just start handing them out until every worker has one, and then when workers are done they come back and request another. This will work fairly well, especially if you can put longer jobs first and shorter ones later. This form of parallelization is still pretty simple. Many of the departmental computers that have more than one processor are run in this fashion, servicing jobs as they arrive. This is also how many airlines run their check-in queue: in queuing theory this is known as a "single queue multiple server" system, which is more efficient than the "multiple queue multiple server" systems common in checkout lines at grocery stores.
Suppose instead that this work you have is only a single job, but it takes a very long time. Now you have to do something to reorganize the job, somehow breaking it into pieces that can be done concurrently. For example, if the job is to build a house, it can be broken up into plumbing, electrical, etc. However, while many jobs can be done at the same time, some have specific orderings, such as putting in the foundation before the walls can go up. If all of the workers are there all of the time, then there will be periods when most of them are just waiting around for some task (such as the foundation) to be finished. Not very cost-effective, and you are not getting the job done 100 times faster. Such is the life of a parallel programmer.

DefinitionParallel computing is the use of two or more processors (cores, computers) in combination to solve a single problem.
The programmer has to figure out how to break the problem into pieces, and has to figure out how the pieces relate to each other. For example, a parallel program to play chess might look at all the possible first moves it could make. Each different first move could be explored by a different processor, to see how the game would continue from that point. At the end, these results have to be combined to figure out which is the best first move. Actually, the situation is even more complicated, because if the program is looking ahead several moves, then different starts can end up at the same board position. To be efficient, the program would have to keep track of this, so that if one processor had already evaluated that position, then others would not waste time duplicating the effort. This is how must parallel chess-playing systems work, including the famous IBM Deep Blue machine that beat Kasparov.

Is Parallel Computing Easier or Harder than Serial Computing?

(Standard computing is also known as "serial computing", so you've been doing serial computing for years!)

Easier:

Most of the universe is inherently parallel, with numerous things happening simultaneously. You may put your pants on one leg at a time, but millions of people are putting their pants on simultaneously. Many problems have an abundance of natural parallelism that a programmer can exploit. Sometimes people are forced to rethink the problem when they try to develop a parallel program, and they change their entire approach in order to directly utilize the inherent parallelism.

Harder:

  • Ablate poor performance: it is easier to see that you're not doing well. On a serial computer the processor may wait as it moves data from RAM to cache, but it basically appears busy all the time and it is only after you do a more detailed analysis that you determine that it is actually performing poorly. (See "Using Cache Saves Cash") On a parallel computer of 100 processors most people quickly notice that a program is not running 100 times faster. The serial program may not have been very good (see "Why Algorithms Experts Deserve Big Bucks") but it is not generally as obvious as the fact that many of the parallel processors are often idle.
  • Little experience: most programmers have little or no experience with parallel computing, and there are few parallel programs to use off-the shelf or even good examples to copy from. Hence people often have to reinvent the parallel wheel (see "Parallelism needs Classes for the Masses"). People developing parallel systems software are similarly behind on their learning curves. Often they reuse material developed for serial systems, even when it causes performance problems (see "Serial Sins in a Parallel World").
  • Poor portability: a program may work work on one machine, but when the program is ported to a new machine it may be so different that drastic changes need to be made just to permit the program to be run. Most serial computers have the same basic organization, but this is not so for parallel computers. Some parallel computers, such as the popular cluster systems, are essentially just a collection of computers linked together with Ethernet. They use simple commands, similar to read and write, to communicate among the processors. (The most common system for doing this communication is MPI.) Such parallel computers are known as message-passing systems, or distributed memory computers. Clusters are usually quite inexpensive because they are built from commodity parts, and they have become quite common in businesses and universities. Distributed memory computers with better performance have faster interconnections among the processors, and software better tuned to support parallelism, but this increases the cost because many of the parts are more non-standard.
    Other parallel computers, such as some of the systems from SGI and Cray, are shared-memory systems, where every processor can directly read and write data from every memory location. Think of this as being like everyone using a blackboard to do their calculations, where everyone has chalk and an eraser and can decide to write anywhere on the board. For these systems, there is a standard known as OpenMP, which is a collection of language extensions to Fortran and C/C++. Shared memory computers must use many non-commodity parts to support the sharing, and hence tend to be more expensive than distributed memory systems. However, it is often easier to write a program for shared memory systems than for distributed memory ones.
    Yet other computers exploit graphics processors, GPUs, to achieve parallelism. However, while a GPU may have 100 cores, there are serious restrictions on their programability, and serious performance problems if there is much data movement. Further, the languages to use for GPUs are rapidly evolving, so it is unclear if you should use CUDA, OpenCL, OpenACC, accelerator extensions being incorporated in OpenMP, etc.
  • Amdahl's law: if the foundation takes 5% of the time, and only one worker can do it while everything else has to wait until the foundation is done, then you can never get the job done in less than 5% of the original time, no matter how many workers you have. Incidentally, this helps explain why it is much easier to be a successful parallel programmer on a small system. In such a situation, if everything else parallelizes perfectly, then 100 workers would take 5% (serial) + 95%/100 (perfect parallel) = 5.95% of the original time, compared to 100%/100= 1% of the original time if everything was perfectly parallel. 1/(100*0.0595) ≈ 0.17, so the efficiency is only about 17%, where efficiency is (time for 1 worker)/[(number of workers)*(the time they take)]. You can think of this as the wages if one worker does it, divided by the total wages when the group does it. 20 workers would work at about 51% efficiency, and 10 workers at about 69%. The fewer the workers, the higher the efficiency.
  • Weakest links: if a group of workers all depend on each other, then the group can go no faster than the slowest worker. If one worker is talking on the cellphone trying to arrange a date, everyone slows down, especially if they are trying to listen in. Other weak links can be the compiler, operating system, communication system, etc.

Why can't you hide?

In the past, when someone needed to solve bigger problems they could often wait a bit, and a faster computer would come out. This was mainly due to Moore's law, which people interpreted to mean that the speed of computers would double about every two years. However, this isn't true any longer: if you look at the GHz speed of the processors in a desktop computer 2 years ago, versus the speed now, it has barely, if any, increased. Basically this happened because they can't make reliable low-cost processors that run significantly faster.
What has doubled, however, is the number of processors on a single chip. Some smartphones now have 8 cores (processors), you can buy CPU chips with 12 cores, and this number will soon increase. These are called "multi-core" or "many-core" chips. There are also graphics processing units (GPUs) with over 100 highly specialized processors. This is closer to what Moore was talking about, because what he really said was that the number of transistors would keep doubling. Eventually that too will slow down, but not for at least a decade.
If there are only a couple of coress in your computer then it is pretty easy to find tasks that can be done simultaneously, such as waiting for keystrokes and running a browser, i.e., embarrassingly parallel. However, as the number of processors keeps increasing, the parallelization problems mentioned above become increasingly serious. There are now parallel computers that have > 1,000,000 cores, and people are planning for ones that will have > 100,000,000.
Overall, this means that there is a massive need to make use of the parallelism in multi-core chips for almost any problem, and to use many of these combined together for large problems. Job security for me! (and you, if you learn parallel computing).

No comments:

Post a Comment