Documentation/kernel-hacking/false-sharing.rst

*911ac797SFeng Tang.. SPDX-License-Identifier: GPL-2.0
*911ac797SFeng Tang
*911ac797SFeng Tang=============
*911ac797SFeng TangFalse Sharing
*911ac797SFeng Tang=============
*911ac797SFeng Tang
*911ac797SFeng TangWhat is False Sharing
*911ac797SFeng Tang=====================
*911ac797SFeng TangFalse sharing is related with cache mechanism of maintaining the data
*911ac797SFeng Tangcoherence of one cache line stored in multiple CPU's caches; then
*911ac797SFeng Tangacademic definition for it is in [1]_. Consider a struct with a
*911ac797SFeng Tangrefcount and a string::
*911ac797SFeng Tang
*911ac797SFeng Tang	struct foo {
*911ac797SFeng Tang		refcount_t refcount;
*911ac797SFeng Tang		...
*911ac797SFeng Tang		char name[16];
*911ac797SFeng Tang	} ____cacheline_internodealigned_in_smp;
*911ac797SFeng Tang
*911ac797SFeng TangMember 'refcount'(A) and 'name'(B) _share_ one cache line like below::
*911ac797SFeng Tang
*911ac797SFeng Tang                +-----------+                     +-----------+
*911ac797SFeng Tang                |   CPU 0   |                     |   CPU 1   |
*911ac797SFeng Tang                +-----------+                     +-----------+
*911ac797SFeng Tang               /                                        |
*911ac797SFeng Tang              /                                         |
*911ac797SFeng Tang             V                                          V
*911ac797SFeng Tang         +----------------------+             +----------------------+
*911ac797SFeng Tang         | A      B             | Cache 0     | A       B            | Cache 1
*911ac797SFeng Tang         +----------------------+             +----------------------+
*911ac797SFeng Tang                             |                  |
*911ac797SFeng Tang  ---------------------------+------------------+-----------------------------
*911ac797SFeng Tang                             |                  |
*911ac797SFeng Tang                           +----------------------+
*911ac797SFeng Tang                           |                      |
*911ac797SFeng Tang                           +----------------------+
*911ac797SFeng Tang              Main Memory  | A       B            |
*911ac797SFeng Tang                           +----------------------+
*911ac797SFeng Tang
*911ac797SFeng Tang'refcount' is modified frequently, but 'name' is set once at object
*911ac797SFeng Tangcreation time and is never modified.  When many CPUs access 'foo' at
*911ac797SFeng Tangthe same time, with 'refcount' being only bumped by one CPU frequently
*911ac797SFeng Tangand 'name' being read by other CPUs, all those reading CPUs have to
*911ac797SFeng Tangreload the whole cache line over and over due to the 'sharing', even
*911ac797SFeng Tangthough 'name' is never changed.
*911ac797SFeng Tang
*911ac797SFeng TangThere are many real-world cases of performance regressions caused by
*911ac797SFeng Tangfalse sharing.  One of these is a rw_semaphore 'mmap_lock' inside
*911ac797SFeng Tangmm_struct struct, whose cache line layout change triggered a
*911ac797SFeng Tangregression and Linus analyzed in [2]_.
*911ac797SFeng Tang
*911ac797SFeng TangThere are two key factors for a harmful false sharing:
*911ac797SFeng Tang
*911ac797SFeng Tang* A global datum accessed (shared) by many CPUs
*911ac797SFeng Tang* In the concurrent accesses to the data, there is at least one write
*911ac797SFeng Tang  operation: write/write or write/read cases.
*911ac797SFeng Tang
*911ac797SFeng TangThe sharing could be from totally unrelated kernel components, or
*911ac797SFeng Tangdifferent code paths of the same kernel component.
*911ac797SFeng Tang
*911ac797SFeng Tang
*911ac797SFeng TangFalse Sharing Pitfalls
*911ac797SFeng Tang======================
*911ac797SFeng TangBack in time when one platform had only one or a few CPUs, hot data
*911ac797SFeng Tangmembers could be purposely put in the same cache line to make them
*911ac797SFeng Tangcache hot and save cacheline/TLB, like a lock and the data protected
*911ac797SFeng Tangby it.  But for recent large system with hundreds of CPUs, this may
*911ac797SFeng Tangnot work when the lock is heavily contended, as the lock owner CPU
*911ac797SFeng Tangcould write to the data, while other CPUs are busy spinning the lock.
*911ac797SFeng Tang
*911ac797SFeng TangLooking at past cases, there are several frequently occurring patterns
*911ac797SFeng Tangfor false sharing:
*911ac797SFeng Tang
*911ac797SFeng Tang* lock (spinlock/mutex/semaphore) and data protected by it are
*911ac797SFeng Tang  purposely put in one cache line.
*911ac797SFeng Tang* global data being put together in one cache line. Some kernel
*911ac797SFeng Tang  subsystems have many global parameters of small size (4 bytes),
*911ac797SFeng Tang  which can easily be grouped together and put into one cache line.
*911ac797SFeng Tang* data members of a big data structure randomly sitting together
*911ac797SFeng Tang  without being noticed (cache line is usually 64 bytes or more),
*911ac797SFeng Tang  like 'mem_cgroup' struct.
*911ac797SFeng Tang
*911ac797SFeng TangFollowing 'mitigation' section provides real-world examples.
*911ac797SFeng Tang
*911ac797SFeng TangFalse sharing could easily happen unless they are intentionally
*911ac797SFeng Tangchecked, and it is valuable to run specific tools for performance
*911ac797SFeng Tangcritical workloads to detect false sharing affecting performance case
*911ac797SFeng Tangand optimize accordingly.
*911ac797SFeng Tang
*911ac797SFeng Tang
*911ac797SFeng TangHow to detect and analyze False Sharing
*911ac797SFeng Tang========================================
*911ac797SFeng Tangperf record/report/stat are widely used for performance tuning, and
*911ac797SFeng Tangonce hotspots are detected, tools like 'perf-c2c' and 'pahole' can
*911ac797SFeng Tangbe further used to detect and pinpoint the possible false sharing
*911ac797SFeng Tangdata structures.  'addr2line' is also good at decoding instruction
*911ac797SFeng Tangpointer when there are multiple layers of inline functions.
*911ac797SFeng Tang
*911ac797SFeng Tangperf-c2c can capture the cache lines with most false sharing hits,
*911ac797SFeng Tangdecoded functions (line number of file) accessing that cache line,
*911ac797SFeng Tangand in-line offset of the data. Simple commands are::
*911ac797SFeng Tang
*911ac797SFeng Tang  $ perf c2c record -ag sleep 3
*911ac797SFeng Tang  $ perf c2c report --call-graph none -k vmlinux
*911ac797SFeng Tang
*911ac797SFeng TangWhen running above during testing will-it-scale's tlb_flush1 case,
*911ac797SFeng Tangperf reports something like::
*911ac797SFeng Tang
*911ac797SFeng Tang  Total records                     :    1658231
*911ac797SFeng Tang  Locked Load/Store Operations      :      89439
*911ac797SFeng Tang  Load Operations                   :     623219
*911ac797SFeng Tang  Load Local HITM                   :      92117
*911ac797SFeng Tang  Load Remote HITM                  :        139
*911ac797SFeng Tang
*911ac797SFeng Tang  #----------------------------------------------------------------------
*911ac797SFeng Tang      4        0     2374        0        0        0  0xff1100088366d880
*911ac797SFeng Tang  #----------------------------------------------------------------------
*911ac797SFeng Tang    0.00%   42.29%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff81373b7b         0       231       129     5312        64  [k] __mod_lruvec_page_state    [kernel.vmlinux]  memcontrol.h:752   1
*911ac797SFeng Tang    0.00%   13.10%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff81374718         0       226        97     3551        64  [k] folio_lruvec_lock_irqsave  [kernel.vmlinux]  memcontrol.h:752   1
*911ac797SFeng Tang    0.00%   11.20%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff812c29bf         0       170       136      555        64  [k] lru_add_fn                 [kernel.vmlinux]  mm_inline.h:41     1
*911ac797SFeng Tang    0.00%    7.62%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff812c3ec5         0       175       108      632        64  [k] release_pages              [kernel.vmlinux]  mm_inline.h:41     1
*911ac797SFeng Tang    0.00%   23.29%    0.00%    0.00%    0.00%   0x10     1       1  0xffffffff81372d0a         0       234       279     1051        64  [k] __mod_memcg_lruvec_state   [kernel.vmlinux]  memcontrol.c:736   1
*911ac797SFeng Tang
*911ac797SFeng TangA nice introduction for perf-c2c is [3]_.
*911ac797SFeng Tang
*911ac797SFeng Tang'pahole' decodes data structure layouts delimited in cache line
*911ac797SFeng Tanggranularity.  Users can match the offset in perf-c2c output with
*911ac797SFeng Tangpahole's decoding to locate the exact data members.  For global
*911ac797SFeng Tangdata, users can search the data address in System.map.
*911ac797SFeng Tang
*911ac797SFeng Tang
*911ac797SFeng TangPossible Mitigations
*911ac797SFeng Tang====================
*911ac797SFeng TangFalse sharing does not always need to be mitigated.  False sharing
*911ac797SFeng Tangmitigations should balance performance gains with complexity and
*911ac797SFeng Tangspace consumption.  Sometimes, lower performance is OK, and it's
*911ac797SFeng Tangunnecessary to hyper-optimize every rarely used data structure or
*911ac797SFeng Tanga cold data path.
*911ac797SFeng Tang
*911ac797SFeng TangFalse sharing hurting performance cases are seen more frequently with
*911ac797SFeng Tangcore count increasing.  Because of these detrimental effects, many
*911ac797SFeng Tangpatches have been proposed across variety of subsystems (like
*911ac797SFeng Tangnetworking and memory management) and merged.  Some common mitigations
*911ac797SFeng Tang(with examples) are:
*911ac797SFeng Tang
*911ac797SFeng Tang* Separate hot global data in its own dedicated cache line, even if it
*911ac797SFeng Tang  is just a 'short' type. The downside is more consumption of memory,
*911ac797SFeng Tang  cache line and TLB entries.
*911ac797SFeng Tang
*911ac797SFeng Tang  - Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated")
*911ac797SFeng Tang
*911ac797SFeng Tang* Reorganize the data structure, separate the interfering members to
*911ac797SFeng Tang  different cache lines.  One downside is it may introduce new false
*911ac797SFeng Tang  sharing of other members.
*911ac797SFeng Tang
*911ac797SFeng Tang  - Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing")
*911ac797SFeng Tang
*911ac797SFeng Tang* Replace 'write' with 'read' when possible, especially in loops.
*911ac797SFeng Tang  Like for some global variable, use compare(read)-then-write instead
*911ac797SFeng Tang  of unconditional write. For example, use::
*911ac797SFeng Tang
*911ac797SFeng Tang	if (!test_bit(XXX))
*911ac797SFeng Tang		set_bit(XXX);
*911ac797SFeng Tang
*911ac797SFeng Tang  instead of directly "set_bit(XXX);", similarly for atomic_t data::
*911ac797SFeng Tang
*911ac797SFeng Tang	if (atomic_read(XXX) == AAA)
*911ac797SFeng Tang		atomic_set(XXX, BBB);
*911ac797SFeng Tang
*911ac797SFeng Tang  - Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing")
*911ac797SFeng Tang  - Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP")
*911ac797SFeng Tang
*911ac797SFeng Tang* Turn hot global data to 'per-cpu data + global data' when possible,
*911ac797SFeng Tang  or reasonably increase the threshold for syncing per-cpu data to
*911ac797SFeng Tang  global data, to reduce or postpone the 'write' to that global data.
*911ac797SFeng Tang
*911ac797SFeng Tang  - Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses")
*911ac797SFeng Tang  - Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy")
*911ac797SFeng Tang
*911ac797SFeng TangSurely, all mitigations should be carefully verified to not cause side
*911ac797SFeng Tangeffects.  To avoid introducing false sharing when coding, it's better
*911ac797SFeng Tangto:
*911ac797SFeng Tang
*911ac797SFeng Tang* Be aware of cache line boundaries
*911ac797SFeng Tang* Group mostly read-only fields together
*911ac797SFeng Tang* Group things that are written at the same time together
*911ac797SFeng Tang* Separate frequently read and frequently written fields on
*911ac797SFeng Tang  different cache lines.
*911ac797SFeng Tang
*911ac797SFeng Tangand better add a comment stating the false sharing consideration.
*911ac797SFeng Tang
*911ac797SFeng TangOne note is, sometimes even after a severe false sharing is detected
*911ac797SFeng Tangand solved, the performance may still have no obvious improvement as
*911ac797SFeng Tangthe hotspot switches to a new place.
*911ac797SFeng Tang
*911ac797SFeng Tang
*911ac797SFeng TangMiscellaneous
*911ac797SFeng Tang=============
*911ac797SFeng TangOne open issue is that kernel has an optional data structure
*911ac797SFeng Tangrandomization mechanism, which also randomizes the situation of cache
*911ac797SFeng Tangline sharing of data members.
*911ac797SFeng Tang
*911ac797SFeng Tang
*911ac797SFeng Tang.. [1] https://en.wikipedia.org/wiki/False_sharing
*911ac797SFeng Tang.. [2] https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com/
*911ac797SFeng Tang.. [3] https://joemario.github.io/blog/2016/09/01/c2c-blog/