## CS444/544 Operating Systems II Prof. Sibin Mohan

Spring 2022 | Lec. 13: Locks 2

Adapted from content originally created by: Prof. Yeongjin Jang

### 2<sup>nd</sup> Candidate: xchg\_lock Result

• Consistent!











CPU 1 Data Value = 2















### Back to xchg

- Atomic xchg instruction loads/stores data at the same time
  - There is no gap for race condition
- But it could cause cache contention!
  - Many threads update the same 'lock' variable
  - Multiple CPUs cache '**lock**' variable
  - Update to lock invalidates cache!

| <pre>[jangye@os2 (master) ~/test/lock-example\$] ./lock xchg</pre> |         |         |            |      |  |  |
|--------------------------------------------------------------------|---------|---------|------------|------|--|--|
| Counting 10000                                                     | with 30 | threads | using XCHG | LOCK |  |  |
| Count: 300000,                                                     | elapsed | Time:   | 946.416 m  | IS   |  |  |

# The Problem with xchg

### xchg and Cache Coherence

- xchg always updates the value
- Every xchg instruction swaps in "1" into the memory location, **lock**



Loaded into **cache** (while loop)

# Cache invalidations for all other threads!



May 23, 2022



















# Hang on a minute...

#### What If...



#### What If...



#### Multiple Threads > Multiple Cache Invalidations!

- Previous example was for two threads
- In our **lock** implementation, we have **30** threads!
- Only one thread can be in the critical section
- Remaining 29 threads → causing cache invalidations!!!
- xchg implementation can be **really slow**!
- How slow?

#### Let's Measure the Cache Misses

- $perf \rightarrow$  built in Linux command to monitor hardware events
  - e.g., cache misses

| <pre>[jangye@os2 (master) ~/test/lock-example\$] taskset -c 1 ./perf-lock.sh xchg<br/>Counting 10000 with 30 threads using XCHG_LOCK<br/>Count: 300000, elapsed Time: 3612.080 ms</pre> |  |  |  |  |  |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|
| Performance counter stats for './lock xchg':                                                                                                                                            |  |  |  |  |  |
| 84,130 L1-dcache-load-misses:u                                                                                                                                                          |  |  |  |  |  |
| 3.613420345 seconds time elapsed                                                                                                                                                        |  |  |  |  |  |
| 3.571214000 seconds user<br>0.032928000 seconds sys                                                                                                                                     |  |  |  |  |  |
|                                                                                                                                                                                         |  |  |  |  |  |
| <pre>[jangye@os2 (master) ~/test/lock-example\$] ./perf-lock.sh xchg<br/>Counting 10000 with 30 threads using XCHG_LOCK<br/>Count: 300000, elapsed Time: 943.568 ms</pre>               |  |  |  |  |  |
| Performance counter stats for './lock xchg':                                                                                                                                            |  |  |  |  |  |
| 16,825,378 L1-dcache-load-misses:u                                                                                                                                                      |  |  |  |  |  |
| 0.946774344 seconds time elapsed                                                                                                                                                        |  |  |  |  |  |

23.707364000 seconds user 0.097770000 seconds sys Single CPU no cache coherence invalidations 84,130 L1 cache misses

30 CPUs

many cache coherence invalidations
16,825,378 L1 cache misses

#### 200x worse!

#### Test-and-Set (xchg)

#### Pros

# • Synchronizes threads well!

Cons •SLOW •Lots of cache miss

#### How do we solve this? Can we solve it?

#### Solution | test and test-and-set

- Why update the lock if its value is already '1'?
- `test and test-and-set'
- check value first!



#### Test and Test-and-set in x86

#### •lock cmpxchg [update-value], [memory]

- Compare the value in [memory] with %eax
- If matched, exchange value in [memory] with [update-value]

tesi

test-a

- Otherwise, do not perform exchange
- Must use with 'lock-prefix' for thread synchronization
- xchg(lock, 1)
  - Lock = 1
  - Returns old value of the lock

#### • cmpxchg(lock, 0, 1)

- Arguments: Lock, test value, update value
- Returns old value of lock



- xchg is an atomic operation in x86
- **cmpxchg** is **not** an atomic operation in x86
  - Must be used with lock prefix to guarantee atomicity
- lock cmpxchg

### 3<sup>rd</sup> Candidate: cmpxchg\_lock

#### • cmpxchg\_lock

- Use cmpxchg to set lock = 1
- Do not update if lock == 1
- Only write 1 to lock if lock == 0

#### • xchg\_unlock

- Use xchg\_unlock to set lock = 0
- Because we have 1 writer and
- This always succeeds



#### 3<sup>rd</sup> Candidate: cmpxchg\_lock Cache Results

#### • Consistent!

#### But still showing lots of cache misses → more than xchg! Why????

#### Intel CPU is TOO COMPLEX

This *[cmpxchg]* instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processors bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)

 $cmpxchg \rightarrow designed$  to be test and test-and-set instruction Intel CPU complexity  $\rightarrow$  so always update value regardless the result of comparison

#### Lame! 😳

#### Let's implement test and test-and-set in software instead

#### 4<sup>th</sup> Candidate: Test and Test & Set [Software?]

38

- •tts\_xchg\_lock
- Wait until lock becomes 0
- After lock == 0
  - xchg (lock, 1)
  - This only updates lock = 1 if lock was 0
- Why xchg, why not **\*lock = 1** directly?
  - while and xchg are not atomic
  - Load/Store must happen at same time!



#### 4<sup>th</sup> Candidate TTS Result

• Consistent!

```
[jangye@os2 (master) ~/test/lock-example$] ./perf-lock.sh tts
Counting 10000 with 30 threads using TTS_LOCK...
Count: 300000, elapsed Time: 498.578 ms
Performance counter stats for './lock tts':
        14,426,153 L1-dcache-load-misses:u
        0.501079419 seconds time elapsed
        14.039150000 seconds user
        0.105730000 seconds sys
```

- Fewer cache misses (by a bit)
- Faster (~500ms vs. 900 ~ 1200 ms)

#### Still Slow and Many Cache Misses..

- Why do we still have so many misses?
- A thread **acquires** the lock [update  $0 \rightarrow 1$ ]
  - Invalidate caches in 29 other cores



#### Still Slow and Many Cache Misses..

- Why do we still have so many misses?
- A thread **acquires** the lock [update  $0 \rightarrow 1$ ]
  - Invalidate caches in 29 other cores
- A thread **releases** the lock [update  $1 \rightarrow 0$ ]
  - Invalidate caches in 29 other cores



#### Still Slow and Many Cache Misses.

- 29 other cores are all reading the variable lock
  - Immediately after invalidate, they load data to cache
  - Then invalidated again by either lock/release
  - This happens every 3~4 cycles



#### 5<sup>th</sup> Candidate: Backoff Lock

- Too much contention on reading locks while only 1 thread runs critical sec
  - All other 29 cores running  $\rightarrow$  while (\*lock == 1);
  - This is the slow down factor
- Idea: can we slow down that check?
- Let's set a wait time once CPU checks whether the value of the lock is '1'
- Say, exponential backoff



#### 5<sup>th</sup> Candidate: Backoff Lock

- backoff\_cmpxchg\_lock(lock)
- Try cmpxchg
  - If success, acquire the lock
  - If fail
    - Wait 1 cycle (pause) for 1<sup>st</sup> trial
    - Wait 2 cycles for 2<sup>nd</sup> trial
    - Wait 4 cycles for 3<sup>rd</sup> trial
    - ...
    - Wait 65536 cycles for 17<sup>th</sup> trial
    - Wait 65536 cycles for 18<sup>th</sup> trial
- <u>https://en.wikipedia.org/wiki/Exponential\_backoff</u>

### 5<sup>th</sup> Candidate: Backoff Result

#### • Consistent!

#### faster than pthread\_mutex()!

| Counting 10000 with 30 threads using BACKOF<br>Count: 300000, elapsed Time: 196.576 ms<br>Performance counter stats for './lock backo<br>232,980 L1-dcache-load-misso<br>0.198420582 seconds time elapsed | _Cou<br>Pe | rformance counter stats for '., | 57.688 ms<br>/lock mutex':<br>·load-misses:u |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|---------------------------------|----------------------------------------------|
| 4.143351000 seconds user                                                                                                                                                                                  | Lock       | Cache Misses [approx.]          | Time [ms]                                    |
| 0.128103000 seconds sys                                                                                                                                                                                   | xchg       | 17 million                      | 944                                          |
| Nauch four or the misses                                                                                                                                                                                  | cmpxchng   | 19 million                      | 1124                                         |
| Much fewer cache misses                                                                                                                                                                                   | tts        | 14 million                      | 500                                          |

• Faster! [less than 200ms!]

#### Summary

46

- Mutex is implemented with **Spinlock** 
  - Waits until lock == 0 with a while loop (why it's called spinlock)
- Naïve code implementation never works
  - Load/Store must be atomic
- **xchg** is a "test and set" atomic instruction
  - Consistent, however, many cache misses, slow! (950ms)
- Lock **cmpxchg** is a "**test and test&set**" **atomic** instruction
  - But Intel implemented this as xchg... **slow**! (1150ms)
- We can implement test-and-test-and-set (tts) with while + xchg
  - Faster! (500ms)
- We can also implement **exponential backoff** to reduce contention
  - Much faster! (200ms)