In computer science, false sharing is a performance-degrading usage pattern that can arise in systems with distributed, coherent caches at the size of the smallest resource block managed by the caching mechanism. When a system participant attempts to periodically access data that is not being altered by another party, but that data shares a cache block with data that is being altered, the caching protocol may force the first participant to reload the whole cache block despite a lack of logical necessity.[1] The caching system is unaware of activity within this block and forces the first participant to bear the caching system overhead required by true shared access of a resource.

Multiprocessor CPU caches

By far the most common usage of this term is in modern multiprocessor CPU caches, where memory is cached in lines of some small power of two word size (e.g., 64 aligned, contiguous bytes). If two processors operate on independent data in the same memory address region storable in a single line, the cache coherency mechanisms in the system may force the whole line across the bus or interconnect with every data write, forcing memory stalls in addition to wasting system bandwidth. In some cases, the elimination of false sharing can result in order-of-magnitude performance improvements.[2] False sharing is an inherent artifact of automatically synchronized cache protocols and can also exist in environments such as distributed file systems or databases, but current prevalence is limited to RAM caches.

Example

#include <iostream>
#include <thread>
#include <new>
#include <atomic>
#include <chrono>
#include <latch>
#include <vector>

using namespace std;
using namespace chrono;

#if defined(__cpp_lib_hardware_interference_size)
// default cacheline size from runtime
constexpr size_t CL_SIZE = hardware_constructive_interference_size;
#else
// most common cacheline size otherwise
constexpr size_t CL_SIZE = 64;
#endif

int main()
{
    vector<jthread> threads;
    int hc = jthread::hardware_concurrency();
    hc = hc <= CL_SIZE ? hc : CL_SIZE;
    for( int nThreads = 1; nThreads <= hc; ++nThreads )
    {
        // synchronize beginning of threads coarse on kernel level
        latch coarseSync( nThreads );
        // fine synch via atomic in userspace
        atomic_uint fineSync( nThreads );
        // as much chars as would fit into a cacheline
        struct alignas(CL_SIZE) { char shareds[CL_SIZE]; } cacheLine;
        // sum of all threads execution times
        atomic_int64_t nsSum( 0 );
        for( int t = 0; t != nThreads; ++t )
            threads.emplace_back(
                [&]( char volatile &c )
                {
                    coarseSync.arrive_and_wait(); // synch beginning of thread execution on kernel-level
                    if( fineSync.fetch_sub( 1, memory_order::relaxed ) != 1 ) // fine-synch on user-level
                        while( fineSync.load( memory_order::relaxed ) );
                    auto start = high_resolution_clock::now();
                    for( size_t r = 10'000'000; r--; )
                        c = c + 1;
                    nsSum += duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count();
                }, ref( cacheLine.shareds[t] ) );
        threads.resize( 0 ); // join all threads
        cout << nThreads << ": " << (int)(nsSum / (1.0e7 * nThreads) + 0.5) << endl;
    }
}

This code shows the effect of false sharing. It creates an increasing number of threads from one thread to the number of physical threads in the system. Each thread sequentially increments one byte of a cache line, which as a whole is shared among all threads. The higher the level of contention between threads, the longer each increment takes. This are the results on a Zen4 system with 16 cores and 32 threads:

1: 1
2: 4
3: 6
4: 9
5: 11
6: 13
7: 15
8: 17
9: 16
10: 18
11: 21
12: 25
13: 29
14: 35
15: 39
16: 41
17: 43
18: 44
19: 48
20: 49
21: 51
22: 53
23: 58
24: 61
25: 68
26: 75
27: 79
28: 82
29: 85
30: 88
31: 91
32: 94

As you can see, on the system in question it can take up to a 100 nanoseconds to complete an increment operation on the shared cache line, which corresponds to approx. 420 clock cycles on this CPU.

Mitigation

There are ways of mitigating the effects of false sharing. For instance, false sharing in CPU caches can be prevented by reordering variables or adding padding (unused bytes) between variables. However, some of these program changes may increase the size of the objects, leading to higher memory use.[2] Compile-time data transformations can also mitigate false-sharing.[3] However, some of these transformations may not always be allowed. For instance, the C++ programming language standard draft of C++23 mandates that data members must be laid out so that later members have higher addresses.[4]

There are tools for detecting false sharing.[5][6] There are also systems that both detect and repair false sharing in executing programs. However, these systems incur some execution overhead.[7][8]

References

  1. ^ Patterson, David (2012). Computer organization and design: the hardware/software interface. Waltham, MA: Morgan Kaufmann. p. 537. ISBN 978-0-12-374750-1. OCLC 746618653.
  2. ^ a b Bolosky, William J.; Scott, Michael L. (1993-09-22). "False sharing and its effect on shared memory performance". Sedms'93: USENIX Systems on USENIX Experiences with Distributed and Multiprocessor Systems. 4. Retrieved 11 July 2021.
  3. ^ Jeremiassen, Tor E.; Eggers, Susan J. (1995). "Reducing false sharing on shared memory multiprocessors through compile time data transformations". ACM SIGPLAN Notices. Association for Computing Machinery (ACM). 30 (8): 179–188. doi:10.1145/209937.209955. ISSN 0362-1340.
  4. ^ "Working Draft, Standard for Programming Language C++ [class]". eel.is. Retrieved 2021-07-11.
  5. ^ "perf-c2c(1)". Linux manual page. 2016-09-01. Retrieved 2021-08-08.
  6. ^ Chabbi, Milind; Wen, Shasha; Liu, Xu (2018-02-10). "Featherlight on-the-fly false-sharing detection". Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York, NY, USA: ACM. pp. 152–167. doi:10.1145/3178487.3178499. ISBN 9781450349826.
  7. ^ Nanavati, Mihir; Spear, Mark; Taylor, Nathan; Rajagopalan, Shriram; Meyer, Dutch T.; Aiello, William; Warfield, Andrew (2013). "Whose cache line is it anyway?". Proceedings of the 8th ACM European Conference on Computer Systems. New York, New York, USA: ACM Press. pp. 141–154. doi:10.1145/2465351.2465366. ISBN 9781450319942.
  8. ^ Liu, Tongping; Berger, Emery D. (2011-10-18). "SHERIFF: precise detection and automatic mitigation of false sharing". ACM SIGPLAN Notices. Association for Computing Machinery (ACM). 46 (10): 3–18. doi:10.1145/2076021.2048070. ISSN 0362-1340.