Bark of the byte

Running with bytes and other wild creatures

Part 2: Clobbered

In part 1 we looked at sequential computing. Now we leave those calm waters where events happen one by one and in an orderly manner.

HMS Optimize

When we were last on board HMS Optimize the cadet named Copper had a very successful trial run of the Materializer CHS (Concurrent Helper System). Today the system goes live in anger which means that the other two Space Cadets on board will be using the system: Chef and Jade. Each cadet has specific objects that they need to matarialize, Jade needs large copper bricks, Chef needs small saccharide cubes and Jade needs medium sphere cubes.

Top Brass is watching. Tensions are high. For background information have a quick glance at page 2 of the Materializer CHS manual. Click  ‘Join’ to hook into the live feed and watch events unfold.

 

Disaster! Hardly any objects were materialized correctly. Even when the correct object was materialized it was often sent to the wrong cadet. People were nearly poisoned when the chef used jade and copper instead of saccharide. Rescue vessels melted on the runway when the supporting struts that were supposed to be hard tempered copper dissolved into a sugary sweat tasty treat. For a whole day, the HMS Optimize was an uncontrollable mess. An emergency meeting was held and after careful analysis of the the Materializer log files it was concluded that the main causes of the problem were:

1. The Cadets’ recipe ingredients  interleaved and overwrote each other, so the factory received invalid recipes.

2. The Cadets’ requests to the factory overlapped. They requested the factory to materialize when there were partial recipes and when the factory was already materializing. Fortunately, the factory had some in-built safeguards against this. If the recipe was invalid or already in the process of materializing it just ignored the request and sent a response to the cadet that it was complete, so the cadet went on to their next recipe.

Well, Top Brass was not happy. Demotion, deportation, decommissioning and recycling were all on the cards. An emergency search party was sent out to look for the Alchemist who's role was made redundant by the Materializer CHS and who recently went missing, leaving only a note say 'I, the Alchemist, raze!'. 

After the debriefing, the cadets quickly called the Materializer CHS Support But No Refunds line. The operator drone listened to their saga and laughed a slow mechanical laugh. ‘Well, it is clear that you didn’t read page 256 of the appendix. You should have used the Orb of Power’.

Discussion

What happened in the above saga was: -

  1. events happened in overlapping time periods AND
  2. the occurrence of the events had an overlapping effect AND
  3. there was no way to tell what the results of the events occurring at the same time would be.

Note that in our case the events of interest were the setting of the recipe ingredients and starting the factory process. Lets look at each of points 1 to 3.

First, the events happening in overlapping time periods. This is concurrency. If the events could be changed so that they did not occur in overlapping time periods our problems would not have happened. For example, if one cadet sent all their instructions, kicked off the materializer, waited for completion before the next cadete started setting their ingredients then the events would no longer be happening in overlapping time periods and sanity would have prevailed.

Second, the occurrence of the events had an overlapping effect. If the events did not interfere with each other then all would have been fine. For example, if there were three distinct factories with three separate recipes, one for each cadet then it would not matter when a cadet set an ingredient as it would not overwrite or interfere with the others.

Thirdly, there was no way to tell what the results of the events occurring at the same time would be. If the instructions were always perfectly interleaved, i.e. size from cadet 1, size from cadet 2, size from cadet 3, shape from cadet 1, shape from cadet 2, etc then we could design the system to work around those patterns. For example, the factory could predict the interleaving and order of the requests. Unfortunately, as is often the case in the real world, the events from the cadets arrived at random. Unfortunately, we are not able to impose order on the world around us, so this is often not an alternative. It is however, useful to identify situations where we are not able to predict what will happen (non-determinism) when it is coupled with concurrency so that we control events in overlapping time periods (1 above) and/or the effect of those events (2 above) to save the day.

If that did not fully make sense then don’t worry as it will be covered in detail in the upcoming posts.

A code example

This behaviour is commonly observed in multi-threaded programming where the threads access and modify a shared resource:

Considering the following code where DoWork is run by several threads at the same time. global_threadCount is the number of threads and global_workerIterations is the number of iterations. More importantly, global_totalOperations is the total number of operations that are performed by all threads combined. global_results contains the result of all the calculations for that thread.

        const int global_threadCount = 2;
        const int global_workerIterations = 10000000;
        int global_totalOperations = 0;
        double[] global_results = new double[global_threadCount];
        void DoWork( object parm )
        {            
            const double divideBy = global_workerIterations/10;
            double runningCalc = 0;
            for( int i = 0; i < global_workerIterations; i++ )
            {
                runningCalc = runningCalc + i / divideBy;
                global_totalOperations++;
            }
            global_results[(int)parm] = runningCalc;
        }


Now we would expect that global_results should be the same for each thread because they each do identical calculations in isolation. Also, they index global_results separately (use different buckets in the array), so they are quite separate. What about global_totalOperations? It would be sensible to think that it should be equal to the number of threads times the number of iterations in each thread (global_threadCount * global_workerIterations). So, what happens when we run this test:

        [TestMethod]
        public void Conc2_multiThreadContention()
        {
            var threads = new List<Thread>();
            for( int count = 0; count < global_threadCount; count++ )
            {
                Thread thread = new Thread( DoWork );
                thread.Start(count);                
                threads.Add( thread );
            }
            for( int count = 0; count < global_threadCount; count++ )
            {
                Thread thread = threads[count];
                thread.Join();
            }
            double expectedResult = 0;
            for( int count = 0; count < global_threadCount; count++ )
            {
                if( count == 0 )
                {
                    expectedResult = global_results[count];
                }
                else
                {
                    Assert.AreEqual( expectedResult, global_results[count] );
                }
            }
            Assert.AreEqual( global_threadCount * global_workerIterations, global_totalOperations );
        }

This code will create the threads running, wait for them to stop, check that they all calculated the same result and finally check that the expected number of operations ran.

Can you guess where the test fails? Well, it fails on the final AreEqual (shown in red). The test expected 20000000, but the result was 12758377. Why is this? Well, it boils down to the fact that, like the setting of the recipe on HMS optimise, multiple steps are involved in incrementing an integer. Basically when an integer is incremented a few operations occur, first the value is read from memory into a processor register, second the value is incremented and finally the value is written back to memory.

Thread 1

Thread 2

Memory (global_totalOperations)

Read 0

 

0

 

Read 0

0

Increment 1

 

0

 

Increment 1

0

Write 1

 

1

 

Write 1

1

Note that:

  • After two increments global_totalOperations has the value 1.
  • The final write overwrites the previous value (which is also 1).
  • At no points were two operations carried out at exactly the same time. So, this problem will happen whether running on a single code processor that is context switching between threads or on a multi-core processor running operations in parallel.

Much like the drama that occurred on HMS Optimize: -

  1. Events happened in overlapping time periods (reading, incrementing and writing) AND
  2. the occurrence of the events had an overlapping effect (both threads wrote to the same memory location) AND
  3. there was no way to tell what the results of the events occurring at the same time would be.

To expand on point 3, this is not deterministic. When I run the same program again, it is unlikely that I get the same value for global_totalOperations. It is even possible that the test will pass if the first thread finishes processing before the second thread starts, in which case global_workerIterations needs to be increased to cause the failure (or Barriers can be used to synchronize the startup of the threads).

In the next post we investigate how to cage the beast.

Loading