1*911ac797SFeng Tang.. SPDX-License-Identifier: GPL-2.0 2*911ac797SFeng Tang 3*911ac797SFeng Tang============= 4*911ac797SFeng TangFalse Sharing 5*911ac797SFeng Tang============= 6*911ac797SFeng Tang 7*911ac797SFeng TangWhat is False Sharing 8*911ac797SFeng Tang===================== 9*911ac797SFeng TangFalse sharing is related with cache mechanism of maintaining the data 10*911ac797SFeng Tangcoherence of one cache line stored in multiple CPU's caches; then 11*911ac797SFeng Tangacademic definition for it is in [1]_. Consider a struct with a 12*911ac797SFeng Tangrefcount and a string:: 13*911ac797SFeng Tang 14*911ac797SFeng Tang struct foo { 15*911ac797SFeng Tang refcount_t refcount; 16*911ac797SFeng Tang ... 17*911ac797SFeng Tang char name[16]; 18*911ac797SFeng Tang } ____cacheline_internodealigned_in_smp; 19*911ac797SFeng Tang 20*911ac797SFeng TangMember 'refcount'(A) and 'name'(B) _share_ one cache line like below:: 21*911ac797SFeng Tang 22*911ac797SFeng Tang +-----------+ +-----------+ 23*911ac797SFeng Tang | CPU 0 | | CPU 1 | 24*911ac797SFeng Tang +-----------+ +-----------+ 25*911ac797SFeng Tang / | 26*911ac797SFeng Tang / | 27*911ac797SFeng Tang V V 28*911ac797SFeng Tang +----------------------+ +----------------------+ 29*911ac797SFeng Tang | A B | Cache 0 | A B | Cache 1 30*911ac797SFeng Tang +----------------------+ +----------------------+ 31*911ac797SFeng Tang | | 32*911ac797SFeng Tang ---------------------------+------------------+----------------------------- 33*911ac797SFeng Tang | | 34*911ac797SFeng Tang +----------------------+ 35*911ac797SFeng Tang | | 36*911ac797SFeng Tang +----------------------+ 37*911ac797SFeng Tang Main Memory | A B | 38*911ac797SFeng Tang +----------------------+ 39*911ac797SFeng Tang 40*911ac797SFeng Tang'refcount' is modified frequently, but 'name' is set once at object 41*911ac797SFeng Tangcreation time and is never modified. When many CPUs access 'foo' at 42*911ac797SFeng Tangthe same time, with 'refcount' being only bumped by one CPU frequently 43*911ac797SFeng Tangand 'name' being read by other CPUs, all those reading CPUs have to 44*911ac797SFeng Tangreload the whole cache line over and over due to the 'sharing', even 45*911ac797SFeng Tangthough 'name' is never changed. 46*911ac797SFeng Tang 47*911ac797SFeng TangThere are many real-world cases of performance regressions caused by 48*911ac797SFeng Tangfalse sharing. One of these is a rw_semaphore 'mmap_lock' inside 49*911ac797SFeng Tangmm_struct struct, whose cache line layout change triggered a 50*911ac797SFeng Tangregression and Linus analyzed in [2]_. 51*911ac797SFeng Tang 52*911ac797SFeng TangThere are two key factors for a harmful false sharing: 53*911ac797SFeng Tang 54*911ac797SFeng Tang* A global datum accessed (shared) by many CPUs 55*911ac797SFeng Tang* In the concurrent accesses to the data, there is at least one write 56*911ac797SFeng Tang operation: write/write or write/read cases. 57*911ac797SFeng Tang 58*911ac797SFeng TangThe sharing could be from totally unrelated kernel components, or 59*911ac797SFeng Tangdifferent code paths of the same kernel component. 60*911ac797SFeng Tang 61*911ac797SFeng Tang 62*911ac797SFeng TangFalse Sharing Pitfalls 63*911ac797SFeng Tang====================== 64*911ac797SFeng TangBack in time when one platform had only one or a few CPUs, hot data 65*911ac797SFeng Tangmembers could be purposely put in the same cache line to make them 66*911ac797SFeng Tangcache hot and save cacheline/TLB, like a lock and the data protected 67*911ac797SFeng Tangby it. But for recent large system with hundreds of CPUs, this may 68*911ac797SFeng Tangnot work when the lock is heavily contended, as the lock owner CPU 69*911ac797SFeng Tangcould write to the data, while other CPUs are busy spinning the lock. 70*911ac797SFeng Tang 71*911ac797SFeng TangLooking at past cases, there are several frequently occurring patterns 72*911ac797SFeng Tangfor false sharing: 73*911ac797SFeng Tang 74*911ac797SFeng Tang* lock (spinlock/mutex/semaphore) and data protected by it are 75*911ac797SFeng Tang purposely put in one cache line. 76*911ac797SFeng Tang* global data being put together in one cache line. Some kernel 77*911ac797SFeng Tang subsystems have many global parameters of small size (4 bytes), 78*911ac797SFeng Tang which can easily be grouped together and put into one cache line. 79*911ac797SFeng Tang* data members of a big data structure randomly sitting together 80*911ac797SFeng Tang without being noticed (cache line is usually 64 bytes or more), 81*911ac797SFeng Tang like 'mem_cgroup' struct. 82*911ac797SFeng Tang 83*911ac797SFeng TangFollowing 'mitigation' section provides real-world examples. 84*911ac797SFeng Tang 85*911ac797SFeng TangFalse sharing could easily happen unless they are intentionally 86*911ac797SFeng Tangchecked, and it is valuable to run specific tools for performance 87*911ac797SFeng Tangcritical workloads to detect false sharing affecting performance case 88*911ac797SFeng Tangand optimize accordingly. 89*911ac797SFeng Tang 90*911ac797SFeng Tang 91*911ac797SFeng TangHow to detect and analyze False Sharing 92*911ac797SFeng Tang======================================== 93*911ac797SFeng Tangperf record/report/stat are widely used for performance tuning, and 94*911ac797SFeng Tangonce hotspots are detected, tools like 'perf-c2c' and 'pahole' can 95*911ac797SFeng Tangbe further used to detect and pinpoint the possible false sharing 96*911ac797SFeng Tangdata structures. 'addr2line' is also good at decoding instruction 97*911ac797SFeng Tangpointer when there are multiple layers of inline functions. 98*911ac797SFeng Tang 99*911ac797SFeng Tangperf-c2c can capture the cache lines with most false sharing hits, 100*911ac797SFeng Tangdecoded functions (line number of file) accessing that cache line, 101*911ac797SFeng Tangand in-line offset of the data. Simple commands are:: 102*911ac797SFeng Tang 103*911ac797SFeng Tang $ perf c2c record -ag sleep 3 104*911ac797SFeng Tang $ perf c2c report --call-graph none -k vmlinux 105*911ac797SFeng Tang 106*911ac797SFeng TangWhen running above during testing will-it-scale's tlb_flush1 case, 107*911ac797SFeng Tangperf reports something like:: 108*911ac797SFeng Tang 109*911ac797SFeng Tang Total records : 1658231 110*911ac797SFeng Tang Locked Load/Store Operations : 89439 111*911ac797SFeng Tang Load Operations : 623219 112*911ac797SFeng Tang Load Local HITM : 92117 113*911ac797SFeng Tang Load Remote HITM : 139 114*911ac797SFeng Tang 115*911ac797SFeng Tang #---------------------------------------------------------------------- 116*911ac797SFeng Tang 4 0 2374 0 0 0 0xff1100088366d880 117*911ac797SFeng Tang #---------------------------------------------------------------------- 118*911ac797SFeng Tang 0.00% 42.29% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81373b7b 0 231 129 5312 64 [k] __mod_lruvec_page_state [kernel.vmlinux] memcontrol.h:752 1 119*911ac797SFeng Tang 0.00% 13.10% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81374718 0 226 97 3551 64 [k] folio_lruvec_lock_irqsave [kernel.vmlinux] memcontrol.h:752 1 120*911ac797SFeng Tang 0.00% 11.20% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c29bf 0 170 136 555 64 [k] lru_add_fn [kernel.vmlinux] mm_inline.h:41 1 121*911ac797SFeng Tang 0.00% 7.62% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c3ec5 0 175 108 632 64 [k] release_pages [kernel.vmlinux] mm_inline.h:41 1 122*911ac797SFeng Tang 0.00% 23.29% 0.00% 0.00% 0.00% 0x10 1 1 0xffffffff81372d0a 0 234 279 1051 64 [k] __mod_memcg_lruvec_state [kernel.vmlinux] memcontrol.c:736 1 123*911ac797SFeng Tang 124*911ac797SFeng TangA nice introduction for perf-c2c is [3]_. 125*911ac797SFeng Tang 126*911ac797SFeng Tang'pahole' decodes data structure layouts delimited in cache line 127*911ac797SFeng Tanggranularity. Users can match the offset in perf-c2c output with 128*911ac797SFeng Tangpahole's decoding to locate the exact data members. For global 129*911ac797SFeng Tangdata, users can search the data address in System.map. 130*911ac797SFeng Tang 131*911ac797SFeng Tang 132*911ac797SFeng TangPossible Mitigations 133*911ac797SFeng Tang==================== 134*911ac797SFeng TangFalse sharing does not always need to be mitigated. False sharing 135*911ac797SFeng Tangmitigations should balance performance gains with complexity and 136*911ac797SFeng Tangspace consumption. Sometimes, lower performance is OK, and it's 137*911ac797SFeng Tangunnecessary to hyper-optimize every rarely used data structure or 138*911ac797SFeng Tanga cold data path. 139*911ac797SFeng Tang 140*911ac797SFeng TangFalse sharing hurting performance cases are seen more frequently with 141*911ac797SFeng Tangcore count increasing. Because of these detrimental effects, many 142*911ac797SFeng Tangpatches have been proposed across variety of subsystems (like 143*911ac797SFeng Tangnetworking and memory management) and merged. Some common mitigations 144*911ac797SFeng Tang(with examples) are: 145*911ac797SFeng Tang 146*911ac797SFeng Tang* Separate hot global data in its own dedicated cache line, even if it 147*911ac797SFeng Tang is just a 'short' type. The downside is more consumption of memory, 148*911ac797SFeng Tang cache line and TLB entries. 149*911ac797SFeng Tang 150*911ac797SFeng Tang - Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated") 151*911ac797SFeng Tang 152*911ac797SFeng Tang* Reorganize the data structure, separate the interfering members to 153*911ac797SFeng Tang different cache lines. One downside is it may introduce new false 154*911ac797SFeng Tang sharing of other members. 155*911ac797SFeng Tang 156*911ac797SFeng Tang - Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing") 157*911ac797SFeng Tang 158*911ac797SFeng Tang* Replace 'write' with 'read' when possible, especially in loops. 159*911ac797SFeng Tang Like for some global variable, use compare(read)-then-write instead 160*911ac797SFeng Tang of unconditional write. For example, use:: 161*911ac797SFeng Tang 162*911ac797SFeng Tang if (!test_bit(XXX)) 163*911ac797SFeng Tang set_bit(XXX); 164*911ac797SFeng Tang 165*911ac797SFeng Tang instead of directly "set_bit(XXX);", similarly for atomic_t data:: 166*911ac797SFeng Tang 167*911ac797SFeng Tang if (atomic_read(XXX) == AAA) 168*911ac797SFeng Tang atomic_set(XXX, BBB); 169*911ac797SFeng Tang 170*911ac797SFeng Tang - Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing") 171*911ac797SFeng Tang - Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP") 172*911ac797SFeng Tang 173*911ac797SFeng Tang* Turn hot global data to 'per-cpu data + global data' when possible, 174*911ac797SFeng Tang or reasonably increase the threshold for syncing per-cpu data to 175*911ac797SFeng Tang global data, to reduce or postpone the 'write' to that global data. 176*911ac797SFeng Tang 177*911ac797SFeng Tang - Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses") 178*911ac797SFeng Tang - Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy") 179*911ac797SFeng Tang 180*911ac797SFeng TangSurely, all mitigations should be carefully verified to not cause side 181*911ac797SFeng Tangeffects. To avoid introducing false sharing when coding, it's better 182*911ac797SFeng Tangto: 183*911ac797SFeng Tang 184*911ac797SFeng Tang* Be aware of cache line boundaries 185*911ac797SFeng Tang* Group mostly read-only fields together 186*911ac797SFeng Tang* Group things that are written at the same time together 187*911ac797SFeng Tang* Separate frequently read and frequently written fields on 188*911ac797SFeng Tang different cache lines. 189*911ac797SFeng Tang 190*911ac797SFeng Tangand better add a comment stating the false sharing consideration. 191*911ac797SFeng Tang 192*911ac797SFeng TangOne note is, sometimes even after a severe false sharing is detected 193*911ac797SFeng Tangand solved, the performance may still have no obvious improvement as 194*911ac797SFeng Tangthe hotspot switches to a new place. 195*911ac797SFeng Tang 196*911ac797SFeng Tang 197*911ac797SFeng TangMiscellaneous 198*911ac797SFeng Tang============= 199*911ac797SFeng TangOne open issue is that kernel has an optional data structure 200*911ac797SFeng Tangrandomization mechanism, which also randomizes the situation of cache 201*911ac797SFeng Tangline sharing of data members. 202*911ac797SFeng Tang 203*911ac797SFeng Tang 204*911ac797SFeng Tang.. [1] https://en.wikipedia.org/wiki/False_sharing 205*911ac797SFeng Tang.. [2] https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com/ 206*911ac797SFeng Tang.. [3] https://joemario.github.io/blog/2016/09/01/c2c-blog/ 207