Bark of the byte

Running with bytes and other wild creatures

Part 0: Concurrent versus parallel

No matter how you arrive at concurrent programming, whether it is because you want to have two independent systems collaborate on some task, use separate threads of execution within a program or do some task asynchronously you end up in a twilight zone called ‘the difference between concurrent and parallel’. 

The purpose of this post is to describe the definitions that we will use for concurrent and parallel, not to define a universally applicable definition. Because of the conflicting information and definitions it was actually pretty difficult to find commonality and agreement in defining these terms. So, I ended up doing a bit of a deeper dive then I expected. If you are interested at how I came to these definitions, then you can read about it in this post. However, if you just came here as a reference page from another post, then just stay on this page as the forging of the definitions is only relevant if you really want to get into the details. 

The main reason that we even need to concern ourselves with the difference between concurrent and parallel is performance as will be demonstrated in this post.

The definitions

The following table shows the definitions. Don't worry if it doesn't immediately make sense, everything is discussed in detail in the rest of this blog post.

    Concurrent Parallel
System Description Several operations occur in overlapping time periods. Many operations occur simultaneously.
Example Multitasking single-core computer.Multiple core computer. Distributed network of computers. Bees buzzing around a hive. Multiple core computer. Distributed network of computers. Bees buzzing around a hive.
Computing / processing Description Several computations are executed during overlapping time periods. Many computations are executed simultaneously.
Example Multitasking single-core processor.Multiple core processor. Multiple core processor.
Algorithm Description A set of instructions that are executed in overlapping time periods. A set of instructions that are executed simultaneously.
Example  A single threaded program that interleaves tasks by executing timers at particular time intervals. A web server. A deterministic parallel Quicksort algorithm. A web server (non-deterministically parallel). A deterministic QuickSort algorithm (deterministically parallel). 


‘Overlapping time periods’ vs ‘simultaneously’

Looking at the descriptions for concurrent and parallel, they differ only in that concurrent has the phrase ‘overlapping time periods’ where parallel has the word ‘simultaneously’. So, if we understand the difference between the two, then we understand the difference between concurrent and parallel (in the context of this blog). Note that concurrent contains parallel. That means, for example, that all parallel systems are concurrent, but not necessarily that all concurrent systems are parallel. That is, operations that occur simultaneously occur in overlapping time periods, but operations that occur in overlapping time periods do not necessarily occur in parallel. So, what is an example of a concurrent processor that is not parallel? The answer is a single core processor that can multitask by switching between tasks. This is best seen with a few examples.

Take as an example a program that needs to move two blocks simultaneously (from the perspective of the user). Here is a graphic representation, go ahead and click ‘Go’.




Think of the movement of the block from the beginning to the end as a set of instructions to achieve a specific task. For example, the movement of the blue block from start to finish might represent a program to calculate the payroll for a month, while the movement of the red block from start to finish might represent a program for the compression of an object in memory. Each step of the blue block could represent some payroll calculation that can be performed in one step by the CPU. Each step of the red block might indicate some compression calculation that can be performed in one step. It just so happens that they each take the same number of steps to complete.

Now, there are two points to note: -

  1. Depending on the performance of your device, the single-core gave the impression of moving both blocks at the same time by switching between them very quickly. 
  2. The dual core moved the blocks twice as fast as the single processor. This is because for each cycle, the dual core was able to move both blocks (one core moved each block) while the single core was only able to move one block, then swap to the other block in the next steps.

If you click the checkbox after Step so that it is checked and click Gomultiple times you will notice that the single-core moved one block at a time, while the dual-core moved both blocks simultaneously. The table below shows this:

Key: M=move *=could not be processed this cycle

Single-core blue: M * M * M * M * M *
Single-core red: * M * M * M * M * M
Dual-core blue: M M M M M M M M M M
Dual-core red: M M M M M M M M M M

As you can see there are twice as many M (moves) for the dual-core. The single core processor only performed a single instruction per cycle, so every second column the row has a * because the other block is being moved. Each core of the dual core machine can handle a single operation at exactly the same time, so each block is moved one step. 

Let’s take a look at those phrases again concurrency = ‘overlapping time periods’ while parallel = ‘simultaneously’. Well, from the beginning of the run to the end of the run, the single core moved both blocks in the same time period, so the time period for moving the blue block overlapped the time period for moving the red block. However, it did not move and was not capable of moving both blocks simultaneously (at the exact same time). The dual core also moved both blocks in the same time period, but it also moved both blocks simultaneously. So, the single-core multi-tasking processing was concurrent, but not parallel while the dual-core processing was both concurrent and parallel.

So, the single-core is really giving the appearance of running in parallel while just swapping from one instruction to the next very quickly. In fact, if we make the single-core processor cycle twice as fast as the dual-core processors (or halve the speed of the dual-core processors) the speed is the same and the single-core is a close simulation of the dual-core. Click Go.




Now, depending on the performance of your device, there will be little noticeable difference between the single core and the multi-core results. Both appear to move the blocks at the same time. This means that the single core had to be twice as fast (do twice as many cycles) to achieve the same distance as the multicore. This is what is happening:

Key: M=move *=could not be processed this cycle
Single core blue: M * M * M * M * M *
Single-core red: * M * M * M * M * M
Dual-core red: M M M M M
Dual-core blue: M M M M M


The main purpose of this example was to show that given sufficient processing speed, the concurrent non-parallel system can have the same appearance to the user as a parallel sytem. Regarding performance, it was an unfair race because the multicore was limited to half the number of processing steps of the single core.

Single-core concurrent vs multi-core parallel performance

Now, given the same processor speeds, will the dual-core programs always run faster than the single-core? No.

Say, for example the only task is to move the blue box from beginning to end. There is only one blue box and it has to transition through a set of sequential steps. Applying a second processor to this task will not help at all. Moving one box is an example of a sequential algorithm – a set of instructions that are executed in sequence. Moving two boxes is a concurrent algorithm because we have two processes that can be run concurrently. There are times when even concurrent algorithms do not benefit from running in parallel. Lets move our boxes again, but in this case there is a twist, after moving a box we need to wait one step before the box can be moved again. Click Go below to see what happens:




Notice two points:

  1. The single-core and the dual-core completed at the same time.
  2. The dual-core speed was halved, while the single-core remained the same.

If you check Step and Go multiple times, you will see what is shown by the table below:

Key: M=move W=wait *=could not be processed this cycle
Single-core blue: M * M * M * M * M *
Single-core red: * M * M * M * M * M
Dual-core blue: M W M W M W M W M W
Dual-core red: M W M W M W M W M W

Basically, the single core performance is unaffected because where a task would be waiting, it is blocked in any case. On the other hand, the dual-core’s performance is halved because it is now waiting where it would have been moved. Note how the black Waiting box appeared when you stepped for the dual core, but not the single-core. The single core is always busy on either task, while the dual-core has to wait every second steps. Let’s think of something a little closer to familiar concepts. For example, if the the algorithm for each block is:

  1. Move (M).
  2. Send a request to an off-processor camera to take a snapshot of the block in position (Sr).
  3. After n cycles the snapshot completes and the task receives the completion response (W…Rr).
  4. Go to 1.

This type of sequence is common in programming, for example when a program: -

  • Writes a message to a web service and waits for a response.
  • Send a request to a USB device and waits for a response.
  • Writes to a database and waits for a response.

OK, click Go and see what happens.

Wait: Step: Longest waiting:



Note the following: -

  1. Unless you changed it, the number of wait steps was 1 which is the same as the previous example.
  2. The dual-core processor is back to performing better than the single-core.

Select step and Go multiple times and then take a look at this table:

Key: M=move W=wait Sr=Send request Rr=Receive response *=could not be processed this cycle
Single-core blue: M Sr W * Rr M Sr W * * Rr
Single-core red: * * M Sr W * * Rr M Sr W
Dual-core blue: M Sr W Rr M Sr W Rr M Sr W
Dual-core red: M Sr W Rr M Sr W Rr M Sr W

The 11 steps shown in the table will repeat exactly for the rest of the run. In these 11 steps, the single-core moved each block twice, while the dual core moved each block three times. The single core was penalised at 7 steps with a * where a task would have been run, but the processor was servicing the other task. This penalty was sufficiently bad that the single wait step was not enough to catch up.

CPU-bound performance

Now let’s take a look at what happens when the snapshot process takes 3 cycles. Change the Wait to 3. Hmm, they both performed the same. Let’s take a look at the steps in slow-motion again and then look at the table.

Key: M=move W=wait Sr=Send request Rr=Receive response *=could not be processed this cycle
Single-core blue: M Sr W W W Rr M Sr W W W
Single-core red: * * M Sr W W W * Rr M Sr
Dual-core blue: M Sr W W W Rr M Sr W W W
Dual-core red: M Sr W W W Rr M Sr W W W

There are sufficient waits (W) for the single-core to move the red box to catch up and recover from any collisions (*). Also, the fact that there are more waits means that collisions are less likely because there is a greater chance that a task will be waiting when another task needs to process and instruction.

The main reason for showing this is as background for the original question. We asked whether it is always better to run a program on multiple processors and the answer is ‘no’ as we have seen in the above examples. 

So, how do we tell? Well, as a start we found, that when a processor is spending much of its time waiting, the benefit to running operations in parallel decreases. So would a busy processor be a good indication. I actually commonly come across the recommendation that if a processor is running at 100% CPU on a one core machine then this is a good indication that a program can benefit from being run in parallel.

However, it is not that clear-cut. It is very easy to write a tight sequential algorithm  that would not benefit from running on multiple CPUs at all. This program would use 100% of one core, but it does not mean that it is a good candidate for running in parallel (or even concurrently).

Well, how about the inverse - if the processor is not using 100% of a core then it is not a good candidate. Well consider the example below where the blue block is moved every third cycle and the red block every second.

Key: M=move W=wait Sr=Send request Rr=Receive response *=could not be processed this cycle

Single-core blue:MWWMWWMWWMW
Single-core red:*MW*MW*MW*M
Dual-core blue:MWWMWWMWWMW
Dual-core red:MWMWMWMWMWM

The single-core nearing CPU limits, uses 66% of the CPU (every 3rd step both tasks are waiting). However, it would benefit from being run in a parallel manner – red moves 1.5 times as far in the same time on the dual-core as the single-core. 

Rather than basing your decision on CPU usage, there are actually two main situations to identify: -

  1. As we have noticed, when there are a high number of collisions (* when not waiting) on a single-core where an instruction would have been run, but could not because another task was being processed, then we will benefit by moving these tasks onto a separate processor so that it can be processed in parallel. It is often very difficult to determine this in practice as we will discuss in the next section.  
  2. A processor-intensive algorithm that can be subdivided into independently operating sub-algorithms. CVS.

We will now investigate points 1 and 2 in more detail. 

Point 1 - Scheduling, context switching, locks, non-determinism

You may have noticed that in our simulation, the single-core gives priority to a task that previously processed an instruction that was not a wait – i.e. the busy task. For example if during a step blue processed M and red waited (W) then on the subsequent step blue would process Sr and red would be *. So, the sequence would be move red, send request red, move blue, send request blue. You may think that this is unfair to blue, so what happens if we change this algorithm to be round-robin which would result in move red, move blue, send request red, send request blue, etc. Will this improve the performance? Well, there is a handy checkbox called longest waiting that we can select. Go ahead and check it, make sure that Wait is still 3, step is not checked and click Go.

Surprise. This is actually worse than the previous algorithm. Let’s look at the table:

Key: M=move W=wait Sr=Send request Rr=Receive response *=could not be processed this cycle
Single-core blue: M * Sr W W W Rr * M * Sr
Single-core red: * M * Sr W W W Rr * M *
Dual-core blue: M Sr W W W Rr M Sr W W W
Dual-core red: M Sr W W W Rr M Sr W W W

Note that the number of collisions is much higher with this algorithm so it is not possible to catch up to the dual core processor during the waits.

This touches on the problem of how does the processor decide which task to service for optimal system performance. One would have thought that the round-robbin approach would be the fairest, but it actually had worse performance than processing the non-wait instructions for the active task then swapping to the other task. This is a complex area and thankfully, there are clever algorithms which try to keep everything balanced and optimal.

More importantly for us, it shows that not only do we have to deal with the complexity of our own programs, but we also have the underlying architecture to deal with. This actually only just scrapes the surface, there are caches and other implementation details that affect how well a program scales in performance when run on multiple processors in parallel. We will investigate some of these complexities in the up-coming posts. 

Finally, not all tasks are as independent as shown in these examples of moving boxes.  Returning to the blue block representing pay-roll and the red block compressing an object. What  object that is being compressed is the same object that the payroll task is working on? The compression could be done, so long as the payroll task was not using that part of the object. Then, we need some way to prevent them both from accessing and modifying the same part of the object at the same time. This has implications for parallel performance because there will be times when both processes would want to access the same part of the object. Enter the world of non-determinism, locks, mutexes, semaphores, signals, barriers and friends. We will investigate these in upcoming posts. 

The point of this section is that in our wild world it is often very difficult to predict whether a process will benefit from spawning tasks on separate threads of execution. The algorithms might be quite different and the points of collision hard to predict . The details of the underlying architecture also has an impact on our programs. The concurrent tasks within our programs may have dependencies that need to be marshalled and the points at which these dependencies occur might be non-deterministic.

This is why, it is often something of an art form to get the balance right. It is always best when extending the program to enhance parallel processing, to first measure, then make the change then measure again to see if the change was beneficial. The good news is that there are a number of helpful guidelines to follow and new developments that make this type of programming easier and easier. 

Point 2 - Deterministic side-effect free parallel programs

I will not go into a lot of detail here as it will be covered in subsequent posts, but there is a class of program that is free (ideally) from most of the headaches of the previous section - side-effect free parallel programs. These are programs that can be written in such a manner that automatically scaled to run in parallel on the system processes without having to worry about how the different threads of execution are going to interact. 

OK, to wrap it all up and test the concept, you may want to play a few mind games

Pingbacks and trackbacks (1)+