OS2: Process Synchronization-Unit 1: Race Condition

A race condition or race hazard is a flaw in a system or process whereby the output of the process is unexpectedly and critically dependent on the sequence or timing of other events. The term originates with the idea of two signals racing each other to influence the output first. Race conditions arise in software when separate processes or threads of execution depend on some shared state. 2.0 Objectives At the end of this unit you should be able to: • Define Race condition • Describe some real life examples of race condition • Describe computer security in view of race condition

Race Condition

As said earlier, race condition is a flaw in a system or process whereby the output of the process is unexpectedly and critically dependent on the sequence or timing of other events. To illustrate this, let us look at this simple example:

Let us assume that two threads T1 and T2 each wants to increment the value of a global integer by one. Ideally, the following sequence of operations would take place:

1. Integer i = 0;
2. T1 reads the value of i from memory into a register: 0
3. T1 increments the value of i in the register: (register contents) + 1 = 1
4. T1 stores the value of the register in memory: 1
5. T2 reads the value of i from memory into a register: 1
6. T2 increments the value of i in the register: (register contents) + 1 = 2
7. T2 stores the value of the register in memory: 2
8. Integer i = 2

In the case shown above, the final value of i is 2, as expected. However, if the two threads run simultaneously without locking or synchronization, the outcome of the operation could be wrong. The alternative sequence of operations below demonstrates this scenario:

1. Integer i = 0;
2. T1 reads the value of i from memory into a register: 0
3. T2 reads the value of i from memory into a register: 0
4. T1 increments the value of i in the register: (register contents) + 1 = 1
5. T2 increments the value of i in the register: (register contents) + 1 = 1
6. T1 stores the value of the register in memory: 1
7. T2 stores the value of the register in memory: 1
8. Integer i = 1

The final value of i is 1 instead of the expected result of 2. This occurs because the increment operations of the second case are non-atomic. Atomic operations are those that cannot be interrupted while accessing some resource, such as a memory location. In the first case, T1 was not interrupted while accessing the variable i, so its operation was atomic.

For another example, consider the following two tasks, in pseudocode:

global integer A = 0;

// increments the value of A and print “RX”

{

A = A + 1;  print “RX”;

}

// prints out only the even numbers

// is activated every second

{

if (A is divisible by 2)

{

print A;

}

}

Output would look something like:

0

0

0

RX

RX

2

RX

RX

4

4

Now consider this chain of events, which might occur next:

1. timeout occurs, activating task Timeout
2. task Timeout evaluates A and finds it is divisible by 2, so elects to execute the “print A” next.
3. data is received on the serial port, causing an interrupt and a switch to task

2. control returns to task Timeout
3. task timeout executes print A, using the current value of A, which is 5.

Mutexes are used to address this problem in concurrent programming.

3.2

Real life examples

3.2.1 File systems

In file systems, two or more programs may “collide” in their attempts to modify or access a file, which could result in data corruption. File locking provides a commonly-used solution. A more cumbersome remedy involves reorganizing the system in such a way that one unique process (running a daemon or the like) has exclusive access to the file, and all other processes that need to access the data in that file do so only via interprocess communication with that one process (which of course requires synchronization at the process level).

A different form of race condition exists in file systems where unrelated programs may affect each other by suddenly using up available resources such as disk space (or memory, or processor cycles). Software not carefully designed to anticipate and handle this rare situation may then become quite fragile and unpredictable. Such a risk may be overlooked for a long time in a system that seems very reliable. But eventually enough data may accumulate or enough other software may be added to critically destabilize many parts of a system. Probably the best known example of this occurred with the near- loss of the Mars Rover “Spirit” not long after landing, but this is a commonly overlooked hazard in many computer systems. A solution is for software to request and reserve all the resources it will need before beginning a task; if this request fails then the task is postponed, avoiding the many points where failure could have occurred. (Alternately, each of those points can be equipped with error handling, or the success of the entire task can be verified afterwards, before continuing on.) A more common but incorrect approach is to simply verify that enough disk space (for example) is available before starting a task; this is not adequate because in complex systems the actions of other running programs can be unpredictable.

3.2.2 Networking

In networking, consider a distributed chat network like Internet relay chat (IRC), where a user acquires channel-operator privileges in any channel he starts. If two users on different servers, on different ends of the same network, try to start the same-named channel at the same time, each user’s respective server will grant channel-operator privileges to each user, since neither server will yet have received the other server’s signal that it has allocated that channel. (Note that this problem has been largely solved by various IRC server implementations.)

In this case of a race condition, the concept of the “shared resource” covers the state of the network (what channels exist, as well as what users started them and therefore have what privileges), which each server can freely change as long as it signals the other servers on the network about the changes so that they can update their conception of the state of the network. However, the latency across the network makes possible the kind of race condition described. In this case, heading off race conditions by imposing a form of control over access to the shared resource—say, appointing one server to control who holds what privileges—would mean turning the distributed network into a centralized one (at least for that one part of the network operation). Where users find such a solution unacceptable, a pragmatic solution can have the system:

1. Recognize when a race condition has occurred; and
2. Repair the ill effects.

3.2.3 Life-Critical Systems

Software flaws in life-critical systems can be disastrous. Race conditions were among the flaws in the Therac-25 radiation therapy machine, which led to the death of five patients and injuries to several more. Another example is the Energy Management System provided by GE Energy and used by Ohio-based FirstEnergy Corp. (and by many other power facilities as well). A race condition existed in the alarm subsystem; when three sagging power lines were tripped simultaneously, the condition prevented alerts from being raised to the monitoring technicians, delaying their awareness of the problem. This software flaw eventually led to the North American Blackout of 2003. (GE Energy later developed a software patch to correct the previously undiscovered error.)

3.3

Computer Security

A specific kind of race condition involves checking for a predicate (e.g. for authentication), then acting on the predicate, while the state can change between the time of check and the time of use. When this kind of bug exists in security-conscious code, a security vulnerability called a time-of-check-to-time-of-use (TOCTTOU) bug is created.

3.4

Asynchronous Finite State Machines

Even after ensuring that single bit transitions occur between states, the asynchronous machine will fail if multiple inputs change at the same time. The solution to this is to design a machine so that each state is sensitive to only one input change.

4.0

Conclusion

In this unit you have learnt about race condition, its cause, life examples and computer examples of race condition. In the next unit(s) you will be exposed to ways of preventing the occurrence of race condition especially unexpected race condition.

YOU