xref: /openbmc/linux/Documentation/kernel-hacking/false-sharing.rst (revision 1ac731c529cd4d6adbce134754b51ff7d822b145)
1*911ac797SFeng Tang.. SPDX-License-Identifier: GPL-2.0
2*911ac797SFeng Tang
3*911ac797SFeng Tang=============
4*911ac797SFeng TangFalse Sharing
5*911ac797SFeng Tang=============
6*911ac797SFeng Tang
7*911ac797SFeng TangWhat is False Sharing
8*911ac797SFeng Tang=====================
9*911ac797SFeng TangFalse sharing is related with cache mechanism of maintaining the data
10*911ac797SFeng Tangcoherence of one cache line stored in multiple CPU's caches; then
11*911ac797SFeng Tangacademic definition for it is in [1]_. Consider a struct with a
12*911ac797SFeng Tangrefcount and a string::
13*911ac797SFeng Tang
14*911ac797SFeng Tang	struct foo {
15*911ac797SFeng Tang		refcount_t refcount;
16*911ac797SFeng Tang		...
17*911ac797SFeng Tang		char name[16];
18*911ac797SFeng Tang	} ____cacheline_internodealigned_in_smp;
19*911ac797SFeng Tang
20*911ac797SFeng TangMember 'refcount'(A) and 'name'(B) _share_ one cache line like below::
21*911ac797SFeng Tang
22*911ac797SFeng Tang                +-----------+                     +-----------+
23*911ac797SFeng Tang                |   CPU 0   |                     |   CPU 1   |
24*911ac797SFeng Tang                +-----------+                     +-----------+
25*911ac797SFeng Tang               /                                        |
26*911ac797SFeng Tang              /                                         |
27*911ac797SFeng Tang             V                                          V
28*911ac797SFeng Tang         +----------------------+             +----------------------+
29*911ac797SFeng Tang         | A      B             | Cache 0     | A       B            | Cache 1
30*911ac797SFeng Tang         +----------------------+             +----------------------+
31*911ac797SFeng Tang                             |                  |
32*911ac797SFeng Tang  ---------------------------+------------------+-----------------------------
33*911ac797SFeng Tang                             |                  |
34*911ac797SFeng Tang                           +----------------------+
35*911ac797SFeng Tang                           |                      |
36*911ac797SFeng Tang                           +----------------------+
37*911ac797SFeng Tang              Main Memory  | A       B            |
38*911ac797SFeng Tang                           +----------------------+
39*911ac797SFeng Tang
40*911ac797SFeng Tang'refcount' is modified frequently, but 'name' is set once at object
41*911ac797SFeng Tangcreation time and is never modified.  When many CPUs access 'foo' at
42*911ac797SFeng Tangthe same time, with 'refcount' being only bumped by one CPU frequently
43*911ac797SFeng Tangand 'name' being read by other CPUs, all those reading CPUs have to
44*911ac797SFeng Tangreload the whole cache line over and over due to the 'sharing', even
45*911ac797SFeng Tangthough 'name' is never changed.
46*911ac797SFeng Tang
47*911ac797SFeng TangThere are many real-world cases of performance regressions caused by
48*911ac797SFeng Tangfalse sharing.  One of these is a rw_semaphore 'mmap_lock' inside
49*911ac797SFeng Tangmm_struct struct, whose cache line layout change triggered a
50*911ac797SFeng Tangregression and Linus analyzed in [2]_.
51*911ac797SFeng Tang
52*911ac797SFeng TangThere are two key factors for a harmful false sharing:
53*911ac797SFeng Tang
54*911ac797SFeng Tang* A global datum accessed (shared) by many CPUs
55*911ac797SFeng Tang* In the concurrent accesses to the data, there is at least one write
56*911ac797SFeng Tang  operation: write/write or write/read cases.
57*911ac797SFeng Tang
58*911ac797SFeng TangThe sharing could be from totally unrelated kernel components, or
59*911ac797SFeng Tangdifferent code paths of the same kernel component.
60*911ac797SFeng Tang
61*911ac797SFeng Tang
62*911ac797SFeng TangFalse Sharing Pitfalls
63*911ac797SFeng Tang======================
64*911ac797SFeng TangBack in time when one platform had only one or a few CPUs, hot data
65*911ac797SFeng Tangmembers could be purposely put in the same cache line to make them
66*911ac797SFeng Tangcache hot and save cacheline/TLB, like a lock and the data protected
67*911ac797SFeng Tangby it.  But for recent large system with hundreds of CPUs, this may
68*911ac797SFeng Tangnot work when the lock is heavily contended, as the lock owner CPU
69*911ac797SFeng Tangcould write to the data, while other CPUs are busy spinning the lock.
70*911ac797SFeng Tang
71*911ac797SFeng TangLooking at past cases, there are several frequently occurring patterns
72*911ac797SFeng Tangfor false sharing:
73*911ac797SFeng Tang
74*911ac797SFeng Tang* lock (spinlock/mutex/semaphore) and data protected by it are
75*911ac797SFeng Tang  purposely put in one cache line.
76*911ac797SFeng Tang* global data being put together in one cache line. Some kernel
77*911ac797SFeng Tang  subsystems have many global parameters of small size (4 bytes),
78*911ac797SFeng Tang  which can easily be grouped together and put into one cache line.
79*911ac797SFeng Tang* data members of a big data structure randomly sitting together
80*911ac797SFeng Tang  without being noticed (cache line is usually 64 bytes or more),
81*911ac797SFeng Tang  like 'mem_cgroup' struct.
82*911ac797SFeng Tang
83*911ac797SFeng TangFollowing 'mitigation' section provides real-world examples.
84*911ac797SFeng Tang
85*911ac797SFeng TangFalse sharing could easily happen unless they are intentionally
86*911ac797SFeng Tangchecked, and it is valuable to run specific tools for performance
87*911ac797SFeng Tangcritical workloads to detect false sharing affecting performance case
88*911ac797SFeng Tangand optimize accordingly.
89*911ac797SFeng Tang
90*911ac797SFeng Tang
91*911ac797SFeng TangHow to detect and analyze False Sharing
92*911ac797SFeng Tang========================================
93*911ac797SFeng Tangperf record/report/stat are widely used for performance tuning, and
94*911ac797SFeng Tangonce hotspots are detected, tools like 'perf-c2c' and 'pahole' can
95*911ac797SFeng Tangbe further used to detect and pinpoint the possible false sharing
96*911ac797SFeng Tangdata structures.  'addr2line' is also good at decoding instruction
97*911ac797SFeng Tangpointer when there are multiple layers of inline functions.
98*911ac797SFeng Tang
99*911ac797SFeng Tangperf-c2c can capture the cache lines with most false sharing hits,
100*911ac797SFeng Tangdecoded functions (line number of file) accessing that cache line,
101*911ac797SFeng Tangand in-line offset of the data. Simple commands are::
102*911ac797SFeng Tang
103*911ac797SFeng Tang  $ perf c2c record -ag sleep 3
104*911ac797SFeng Tang  $ perf c2c report --call-graph none -k vmlinux
105*911ac797SFeng Tang
106*911ac797SFeng TangWhen running above during testing will-it-scale's tlb_flush1 case,
107*911ac797SFeng Tangperf reports something like::
108*911ac797SFeng Tang
109*911ac797SFeng Tang  Total records                     :    1658231
110*911ac797SFeng Tang  Locked Load/Store Operations      :      89439
111*911ac797SFeng Tang  Load Operations                   :     623219
112*911ac797SFeng Tang  Load Local HITM                   :      92117
113*911ac797SFeng Tang  Load Remote HITM                  :        139
114*911ac797SFeng Tang
115*911ac797SFeng Tang  #----------------------------------------------------------------------
116*911ac797SFeng Tang      4        0     2374        0        0        0  0xff1100088366d880
117*911ac797SFeng Tang  #----------------------------------------------------------------------
118*911ac797SFeng Tang    0.00%   42.29%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff81373b7b         0       231       129     5312        64  [k] __mod_lruvec_page_state    [kernel.vmlinux]  memcontrol.h:752   1
119*911ac797SFeng Tang    0.00%   13.10%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff81374718         0       226        97     3551        64  [k] folio_lruvec_lock_irqsave  [kernel.vmlinux]  memcontrol.h:752   1
120*911ac797SFeng Tang    0.00%   11.20%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff812c29bf         0       170       136      555        64  [k] lru_add_fn                 [kernel.vmlinux]  mm_inline.h:41     1
121*911ac797SFeng Tang    0.00%    7.62%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff812c3ec5         0       175       108      632        64  [k] release_pages              [kernel.vmlinux]  mm_inline.h:41     1
122*911ac797SFeng Tang    0.00%   23.29%    0.00%    0.00%    0.00%   0x10     1       1  0xffffffff81372d0a         0       234       279     1051        64  [k] __mod_memcg_lruvec_state   [kernel.vmlinux]  memcontrol.c:736   1
123*911ac797SFeng Tang
124*911ac797SFeng TangA nice introduction for perf-c2c is [3]_.
125*911ac797SFeng Tang
126*911ac797SFeng Tang'pahole' decodes data structure layouts delimited in cache line
127*911ac797SFeng Tanggranularity.  Users can match the offset in perf-c2c output with
128*911ac797SFeng Tangpahole's decoding to locate the exact data members.  For global
129*911ac797SFeng Tangdata, users can search the data address in System.map.
130*911ac797SFeng Tang
131*911ac797SFeng Tang
132*911ac797SFeng TangPossible Mitigations
133*911ac797SFeng Tang====================
134*911ac797SFeng TangFalse sharing does not always need to be mitigated.  False sharing
135*911ac797SFeng Tangmitigations should balance performance gains with complexity and
136*911ac797SFeng Tangspace consumption.  Sometimes, lower performance is OK, and it's
137*911ac797SFeng Tangunnecessary to hyper-optimize every rarely used data structure or
138*911ac797SFeng Tanga cold data path.
139*911ac797SFeng Tang
140*911ac797SFeng TangFalse sharing hurting performance cases are seen more frequently with
141*911ac797SFeng Tangcore count increasing.  Because of these detrimental effects, many
142*911ac797SFeng Tangpatches have been proposed across variety of subsystems (like
143*911ac797SFeng Tangnetworking and memory management) and merged.  Some common mitigations
144*911ac797SFeng Tang(with examples) are:
145*911ac797SFeng Tang
146*911ac797SFeng Tang* Separate hot global data in its own dedicated cache line, even if it
147*911ac797SFeng Tang  is just a 'short' type. The downside is more consumption of memory,
148*911ac797SFeng Tang  cache line and TLB entries.
149*911ac797SFeng Tang
150*911ac797SFeng Tang  - Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated")
151*911ac797SFeng Tang
152*911ac797SFeng Tang* Reorganize the data structure, separate the interfering members to
153*911ac797SFeng Tang  different cache lines.  One downside is it may introduce new false
154*911ac797SFeng Tang  sharing of other members.
155*911ac797SFeng Tang
156*911ac797SFeng Tang  - Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing")
157*911ac797SFeng Tang
158*911ac797SFeng Tang* Replace 'write' with 'read' when possible, especially in loops.
159*911ac797SFeng Tang  Like for some global variable, use compare(read)-then-write instead
160*911ac797SFeng Tang  of unconditional write. For example, use::
161*911ac797SFeng Tang
162*911ac797SFeng Tang	if (!test_bit(XXX))
163*911ac797SFeng Tang		set_bit(XXX);
164*911ac797SFeng Tang
165*911ac797SFeng Tang  instead of directly "set_bit(XXX);", similarly for atomic_t data::
166*911ac797SFeng Tang
167*911ac797SFeng Tang	if (atomic_read(XXX) == AAA)
168*911ac797SFeng Tang		atomic_set(XXX, BBB);
169*911ac797SFeng Tang
170*911ac797SFeng Tang  - Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing")
171*911ac797SFeng Tang  - Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP")
172*911ac797SFeng Tang
173*911ac797SFeng Tang* Turn hot global data to 'per-cpu data + global data' when possible,
174*911ac797SFeng Tang  or reasonably increase the threshold for syncing per-cpu data to
175*911ac797SFeng Tang  global data, to reduce or postpone the 'write' to that global data.
176*911ac797SFeng Tang
177*911ac797SFeng Tang  - Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses")
178*911ac797SFeng Tang  - Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy")
179*911ac797SFeng Tang
180*911ac797SFeng TangSurely, all mitigations should be carefully verified to not cause side
181*911ac797SFeng Tangeffects.  To avoid introducing false sharing when coding, it's better
182*911ac797SFeng Tangto:
183*911ac797SFeng Tang
184*911ac797SFeng Tang* Be aware of cache line boundaries
185*911ac797SFeng Tang* Group mostly read-only fields together
186*911ac797SFeng Tang* Group things that are written at the same time together
187*911ac797SFeng Tang* Separate frequently read and frequently written fields on
188*911ac797SFeng Tang  different cache lines.
189*911ac797SFeng Tang
190*911ac797SFeng Tangand better add a comment stating the false sharing consideration.
191*911ac797SFeng Tang
192*911ac797SFeng TangOne note is, sometimes even after a severe false sharing is detected
193*911ac797SFeng Tangand solved, the performance may still have no obvious improvement as
194*911ac797SFeng Tangthe hotspot switches to a new place.
195*911ac797SFeng Tang
196*911ac797SFeng Tang
197*911ac797SFeng TangMiscellaneous
198*911ac797SFeng Tang=============
199*911ac797SFeng TangOne open issue is that kernel has an optional data structure
200*911ac797SFeng Tangrandomization mechanism, which also randomizes the situation of cache
201*911ac797SFeng Tangline sharing of data members.
202*911ac797SFeng Tang
203*911ac797SFeng Tang
204*911ac797SFeng Tang.. [1] https://en.wikipedia.org/wiki/False_sharing
205*911ac797SFeng Tang.. [2] https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com/
206*911ac797SFeng Tang.. [3] https://joemario.github.io/blog/2016/09/01/c2c-blog/
207