xref: /openbmc/linux/Documentation/filesystems/gfs2-glocks.rst (revision c900529f3d9161bfde5cca0754f83b4d3c3e0220)
13700bec3SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0
23700bec3SMauro Carvalho Chehab
33700bec3SMauro Carvalho Chehab============================
43700bec3SMauro Carvalho ChehabGlock internal locking rules
53700bec3SMauro Carvalho Chehab============================
63700bec3SMauro Carvalho Chehab
73700bec3SMauro Carvalho ChehabThis documents the basic principles of the glock state machine
83700bec3SMauro Carvalho Chehabinternals. Each glock (struct gfs2_glock in fs/gfs2/incore.h)
93700bec3SMauro Carvalho Chehabhas two main (internal) locks:
103700bec3SMauro Carvalho Chehab
113700bec3SMauro Carvalho Chehab 1. A spinlock (gl_lockref.lock) which protects the internal state such
123700bec3SMauro Carvalho Chehab    as gl_state, gl_target and the list of holders (gl_holders)
133700bec3SMauro Carvalho Chehab 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other
143700bec3SMauro Carvalho Chehab    threads from making calls to the DLM, etc. at the same time. If a
153700bec3SMauro Carvalho Chehab    thread takes this lock, it must then call run_queue (usually via the
163700bec3SMauro Carvalho Chehab    workqueue) when it releases it in order to ensure any pending tasks
173700bec3SMauro Carvalho Chehab    are completed.
183700bec3SMauro Carvalho Chehab
193700bec3SMauro Carvalho ChehabThe gl_holders list contains all the queued lock requests (not
203700bec3SMauro Carvalho Chehabjust the holders) associated with the glock. If there are any
213700bec3SMauro Carvalho Chehabheld locks, then they will be contiguous entries at the head
223700bec3SMauro Carvalho Chehabof the list. Locks are granted in strictly the order that they
23*0b93bac2SAndreas Gruenbacherare queued.
243700bec3SMauro Carvalho Chehab
253700bec3SMauro Carvalho ChehabThere are three lock states that users of the glock layer can request,
263700bec3SMauro Carvalho Chehabnamely shared (SH), deferred (DF) and exclusive (EX). Those translate
273700bec3SMauro Carvalho Chehabto the following DLM lock modes:
283700bec3SMauro Carvalho Chehab
293700bec3SMauro Carvalho Chehab==========	====== =====================================================
303700bec3SMauro Carvalho ChehabGlock mode      DLM    lock mode
313700bec3SMauro Carvalho Chehab==========	====== =====================================================
323700bec3SMauro Carvalho Chehab    UN          IV/NL  Unlocked (no DLM lock associated with glock) or NL
333700bec3SMauro Carvalho Chehab    SH          PR     (Protected read)
343700bec3SMauro Carvalho Chehab    DF          CW     (Concurrent write)
353700bec3SMauro Carvalho Chehab    EX          EX     (Exclusive)
363700bec3SMauro Carvalho Chehab==========	====== =====================================================
373700bec3SMauro Carvalho Chehab
383700bec3SMauro Carvalho ChehabThus DF is basically a shared mode which is incompatible with the "normal"
393700bec3SMauro Carvalho Chehabshared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
403700bec3SMauro Carvalho Chehaboperations. The glocks are basically a lock plus some routines which deal
413700bec3SMauro Carvalho Chehabwith cache management. The following rules apply for the cache:
423700bec3SMauro Carvalho Chehab
433700bec3SMauro Carvalho Chehab==========      ==========   ==============   ==========   ==============
443700bec3SMauro Carvalho ChehabGlock mode      Cache data   Cache Metadata   Dirty Data   Dirty Metadata
453700bec3SMauro Carvalho Chehab==========      ==========   ==============   ==========   ==============
463700bec3SMauro Carvalho Chehab    UN             No              No             No            No
473700bec3SMauro Carvalho Chehab    SH             Yes             Yes            No            No
483700bec3SMauro Carvalho Chehab    DF             No              Yes            No            No
493700bec3SMauro Carvalho Chehab    EX             Yes             Yes            Yes           Yes
503700bec3SMauro Carvalho Chehab==========      ==========   ==============   ==========   ==============
513700bec3SMauro Carvalho Chehab
523700bec3SMauro Carvalho ChehabThese rules are implemented using the various glock operations which
533700bec3SMauro Carvalho Chehabare defined for each type of glock. Not all types of glocks use
543700bec3SMauro Carvalho Chehaball the modes. Only inode glocks use the DF mode for example.
553700bec3SMauro Carvalho Chehab
563700bec3SMauro Carvalho ChehabTable of glock operations and per type constants:
573700bec3SMauro Carvalho Chehab
583700bec3SMauro Carvalho Chehab=============      =============================================================
593700bec3SMauro Carvalho ChehabField              Purpose
603700bec3SMauro Carvalho Chehab=============      =============================================================
613700bec3SMauro Carvalho Chehabgo_xmote_th        Called before remote state change (e.g. to sync dirty data)
623700bec3SMauro Carvalho Chehabgo_xmote_bh        Called after remote state change (e.g. to refill cache)
633700bec3SMauro Carvalho Chehabgo_inval           Called if remote state change requires invalidating the cache
643700bec3SMauro Carvalho Chehabgo_demote_ok       Returns boolean value of whether its ok to demote a glock
653700bec3SMauro Carvalho Chehab                   (e.g. checks timeout, and that there is no cached data)
663700bec3SMauro Carvalho Chehabgo_lock            Called for the first local holder of a lock
673700bec3SMauro Carvalho Chehabgo_unlock          Called on the final local unlock of a lock
683700bec3SMauro Carvalho Chehabgo_dump            Called to print content of object for debugfs file, or on
693700bec3SMauro Carvalho Chehab                   error to dump glock to the log.
703700bec3SMauro Carvalho Chehabgo_type            The type of the glock, ``LM_TYPE_*``
713700bec3SMauro Carvalho Chehabgo_callback	   Called if the DLM sends a callback to drop this lock
723700bec3SMauro Carvalho Chehabgo_flags	   GLOF_ASPACE is set, if the glock has an address space
733700bec3SMauro Carvalho Chehab                   associated with it
743700bec3SMauro Carvalho Chehab=============      =============================================================
753700bec3SMauro Carvalho Chehab
763700bec3SMauro Carvalho ChehabThe minimum hold time for each lock is the time after a remote lock
773700bec3SMauro Carvalho Chehabgrant for which we ignore remote demote requests. This is in order to
783700bec3SMauro Carvalho Chehabprevent a situation where locks are being bounced around the cluster
793700bec3SMauro Carvalho Chehabfrom node to node with none of the nodes making any progress. This
80d56b699dSBjorn Helgaastends to show up most with shared mmapped files which are being written
813700bec3SMauro Carvalho Chehabto by multiple nodes. By delaying the demotion in response to a
823700bec3SMauro Carvalho Chehabremote callback, that gives the userspace program time to make
833700bec3SMauro Carvalho Chehabsome progress before the pages are unmapped.
843700bec3SMauro Carvalho Chehab
853700bec3SMauro Carvalho ChehabThere is a plan to try and remove the go_lock and go_unlock callbacks
863700bec3SMauro Carvalho Chehabif possible, in order to try and speed up the fast path though the locking.
873700bec3SMauro Carvalho ChehabAlso, eventually we hope to make the glock "EX" mode locally shared
883700bec3SMauro Carvalho Chehabsuch that any local locking will be done with the i_mutex as required
893700bec3SMauro Carvalho Chehabrather than via the glock.
903700bec3SMauro Carvalho Chehab
913700bec3SMauro Carvalho ChehabLocking rules for glock operations:
923700bec3SMauro Carvalho Chehab
933700bec3SMauro Carvalho Chehab=============    ======================    =============================
943700bec3SMauro Carvalho ChehabOperation        GLF_LOCK bit lock held    gl_lockref.lock spinlock held
953700bec3SMauro Carvalho Chehab=============    ======================    =============================
963700bec3SMauro Carvalho Chehabgo_xmote_th           Yes                       No
973700bec3SMauro Carvalho Chehabgo_xmote_bh           Yes                       No
983700bec3SMauro Carvalho Chehabgo_inval              Yes                       No
993700bec3SMauro Carvalho Chehabgo_demote_ok          Sometimes                 Yes
1003700bec3SMauro Carvalho Chehabgo_lock               Yes                       No
1013700bec3SMauro Carvalho Chehabgo_unlock             Yes                       No
1023700bec3SMauro Carvalho Chehabgo_dump               Sometimes                 Yes
1033700bec3SMauro Carvalho Chehabgo_callback           Sometimes (N/A)           Yes
1043700bec3SMauro Carvalho Chehab=============    ======================    =============================
1053700bec3SMauro Carvalho Chehab
1063700bec3SMauro Carvalho Chehab.. Note::
1073700bec3SMauro Carvalho Chehab
1083700bec3SMauro Carvalho Chehab   Operations must not drop either the bit lock or the spinlock
1093700bec3SMauro Carvalho Chehab   if its held on entry. go_dump and do_demote_ok must never block.
1103700bec3SMauro Carvalho Chehab   Note that go_dump will only be called if the glock's state
1113700bec3SMauro Carvalho Chehab   indicates that it is caching uptodate data.
1123700bec3SMauro Carvalho Chehab
1133700bec3SMauro Carvalho ChehabGlock locking order within GFS2:
1143700bec3SMauro Carvalho Chehab
1153700bec3SMauro Carvalho Chehab 1. i_rwsem (if required)
1163700bec3SMauro Carvalho Chehab 2. Rename glock (for rename only)
1173700bec3SMauro Carvalho Chehab 3. Inode glock(s)
1183700bec3SMauro Carvalho Chehab    (Parents before children, inodes at "same level" with same parent in
1193700bec3SMauro Carvalho Chehab    lock number order)
1203700bec3SMauro Carvalho Chehab 4. Rgrp glock(s) (for (de)allocation operations)
1213700bec3SMauro Carvalho Chehab 5. Transaction glock (via gfs2_trans_begin) for non-read operations
1223700bec3SMauro Carvalho Chehab 6. i_rw_mutex (if required)
1233700bec3SMauro Carvalho Chehab 7. Page lock  (always last, very important!)
1243700bec3SMauro Carvalho Chehab
1253700bec3SMauro Carvalho ChehabThere are two glocks per inode. One deals with access to the inode
1263700bec3SMauro Carvalho Chehabitself (locking order as above), and the other, known as the iopen
1273700bec3SMauro Carvalho Chehabglock is used in conjunction with the i_nlink field in the inode to
1283700bec3SMauro Carvalho Chehabdetermine the lifetime of the inode in question. Locking of inodes
1293700bec3SMauro Carvalho Chehabis on a per-inode basis. Locking of rgrps is on a per rgrp basis.
1303700bec3SMauro Carvalho ChehabIn general we prefer to lock local locks prior to cluster locks.
1313700bec3SMauro Carvalho Chehab
1323700bec3SMauro Carvalho ChehabGlock Statistics
1333700bec3SMauro Carvalho Chehab----------------
1343700bec3SMauro Carvalho Chehab
1353700bec3SMauro Carvalho ChehabThe stats are divided into two sets: those relating to the
1363700bec3SMauro Carvalho Chehabsuper block and those relating to an individual glock. The
1373700bec3SMauro Carvalho Chehabsuper block stats are done on a per cpu basis in order to
1383700bec3SMauro Carvalho Chehabtry and reduce the overhead of gathering them. They are also
1393700bec3SMauro Carvalho Chehabfurther divided by glock type. All timings are in nanoseconds.
1403700bec3SMauro Carvalho Chehab
1413700bec3SMauro Carvalho ChehabIn the case of both the super block and glock statistics,
1423700bec3SMauro Carvalho Chehabthe same information is gathered in each case. The super
1433700bec3SMauro Carvalho Chehabblock timing statistics are used to provide default values for
1443700bec3SMauro Carvalho Chehabthe glock timing statistics, so that newly created glocks
1453700bec3SMauro Carvalho Chehabshould have, as far as possible, a sensible starting point.
1463700bec3SMauro Carvalho ChehabThe per-glock counters are initialised to zero when the
1473700bec3SMauro Carvalho Chehabglock is created. The per-glock statistics are lost when
1483700bec3SMauro Carvalho Chehabthe glock is ejected from memory.
1493700bec3SMauro Carvalho Chehab
1503700bec3SMauro Carvalho ChehabThe statistics are divided into three pairs of mean and
1513700bec3SMauro Carvalho Chehabvariance, plus two counters. The mean/variance pairs are
1523700bec3SMauro Carvalho Chehabsmoothed exponential estimates and the algorithm used is
1533700bec3SMauro Carvalho Chehabone which will be very familiar to those used to calculation
1543700bec3SMauro Carvalho Chehabof round trip times in network code. See "TCP/IP Illustrated,
1553700bec3SMauro Carvalho ChehabVolume 1", W. Richard Stevens, sect 21.3, "Round-Trip Time Measurement",
1563700bec3SMauro Carvalho Chehabp. 299 and onwards. Also, Volume 2, Sect. 25.10, p. 838 and onwards.
1573700bec3SMauro Carvalho ChehabUnlike the TCP/IP Illustrated case, the mean and variance are
1583700bec3SMauro Carvalho Chehabnot scaled, but are in units of integer nanoseconds.
1593700bec3SMauro Carvalho Chehab
1603700bec3SMauro Carvalho ChehabThe three pairs of mean/variance measure the following
1613700bec3SMauro Carvalho Chehabthings:
1623700bec3SMauro Carvalho Chehab
1633700bec3SMauro Carvalho Chehab 1. DLM lock time (non-blocking requests)
1643700bec3SMauro Carvalho Chehab 2. DLM lock time (blocking requests)
1653700bec3SMauro Carvalho Chehab 3. Inter-request time (again to the DLM)
1663700bec3SMauro Carvalho Chehab
1673700bec3SMauro Carvalho ChehabA non-blocking request is one which will complete right
1683700bec3SMauro Carvalho Chehabaway, whatever the state of the DLM lock in question. That
1693700bec3SMauro Carvalho Chehabcurrently means any requests when (a) the current state of
1703700bec3SMauro Carvalho Chehabthe lock is exclusive, i.e. a lock demotion (b) the requested
1713700bec3SMauro Carvalho Chehabstate is either null or unlocked (again, a demotion) or (c) the
1723700bec3SMauro Carvalho Chehab"try lock" flag is set. A blocking request covers all the other
1733700bec3SMauro Carvalho Chehablock requests.
1743700bec3SMauro Carvalho Chehab
1753700bec3SMauro Carvalho ChehabThere are two counters. The first is there primarily to show
1763700bec3SMauro Carvalho Chehabhow many lock requests have been made, and thus how much data
1773700bec3SMauro Carvalho Chehabhas gone into the mean/variance calculations. The other counter
1783700bec3SMauro Carvalho Chehabis counting queuing of holders at the top layer of the glock
1793700bec3SMauro Carvalho Chehabcode. Hopefully that number will be a lot larger than the number
1803700bec3SMauro Carvalho Chehabof dlm lock requests issued.
1813700bec3SMauro Carvalho Chehab
1823700bec3SMauro Carvalho ChehabSo why gather these statistics? There are several reasons
1833700bec3SMauro Carvalho Chehabwe'd like to get a better idea of these timings:
1843700bec3SMauro Carvalho Chehab
1853700bec3SMauro Carvalho Chehab1. To be able to better set the glock "min hold time"
1863700bec3SMauro Carvalho Chehab2. To spot performance issues more easily
1873700bec3SMauro Carvalho Chehab3. To improve the algorithm for selecting resource groups for
1883700bec3SMauro Carvalho Chehab   allocation (to base it on lock wait time, rather than blindly
1893700bec3SMauro Carvalho Chehab   using a "try lock")
1903700bec3SMauro Carvalho Chehab
1913700bec3SMauro Carvalho ChehabDue to the smoothing action of the updates, a step change in
1923700bec3SMauro Carvalho Chehabsome input quantity being sampled will only fully be taken
1933700bec3SMauro Carvalho Chehabinto account after 8 samples (or 4 for the variance) and this
1943700bec3SMauro Carvalho Chehabneeds to be carefully considered when interpreting the
1953700bec3SMauro Carvalho Chehabresults.
1963700bec3SMauro Carvalho Chehab
1973700bec3SMauro Carvalho ChehabKnowing both the time it takes a lock request to complete and
1983700bec3SMauro Carvalho Chehabthe average time between lock requests for a glock means we
1993700bec3SMauro Carvalho Chehabcan compute the total percentage of the time for which the
2003700bec3SMauro Carvalho Chehabnode is able to use a glock vs. time that the rest of the
2013700bec3SMauro Carvalho Chehabcluster has its share. That will be very useful when setting
2023700bec3SMauro Carvalho Chehabthe lock min hold time.
2033700bec3SMauro Carvalho Chehab
2043700bec3SMauro Carvalho ChehabGreat care has been taken to ensure that we
2053700bec3SMauro Carvalho Chehabmeasure exactly the quantities that we want, as accurately
2063700bec3SMauro Carvalho Chehabas possible. There are always inaccuracies in any
2073700bec3SMauro Carvalho Chehabmeasuring system, but I hope this is as accurate as we
2083700bec3SMauro Carvalho Chehabcan reasonably make it.
2093700bec3SMauro Carvalho Chehab
2103700bec3SMauro Carvalho ChehabPer sb stats can be found here::
2113700bec3SMauro Carvalho Chehab
2123700bec3SMauro Carvalho Chehab    /sys/kernel/debug/gfs2/<fsname>/sbstats
2133700bec3SMauro Carvalho Chehab
2143700bec3SMauro Carvalho ChehabPer glock stats can be found here::
2153700bec3SMauro Carvalho Chehab
2163700bec3SMauro Carvalho Chehab    /sys/kernel/debug/gfs2/<fsname>/glstats
2173700bec3SMauro Carvalho Chehab
2183700bec3SMauro Carvalho ChehabAssuming that debugfs is mounted on /sys/kernel/debug and also
2193700bec3SMauro Carvalho Chehabthat <fsname> is replaced with the name of the gfs2 filesystem
2203700bec3SMauro Carvalho Chehabin question.
2213700bec3SMauro Carvalho Chehab
2223700bec3SMauro Carvalho ChehabThe abbreviations used in the output as are follows:
2233700bec3SMauro Carvalho Chehab
2243700bec3SMauro Carvalho Chehab=========  ================================================================
2253700bec3SMauro Carvalho Chehabsrtt       Smoothed round trip time for non blocking dlm requests
2263700bec3SMauro Carvalho Chehabsrttvar    Variance estimate for srtt
2273700bec3SMauro Carvalho Chehabsrttb      Smoothed round trip time for (potentially) blocking dlm requests
2283700bec3SMauro Carvalho Chehabsrttvarb   Variance estimate for srttb
2293700bec3SMauro Carvalho Chehabsirt       Smoothed inter request time (for dlm requests)
2303700bec3SMauro Carvalho Chehabsirtvar    Variance estimate for sirt
2313700bec3SMauro Carvalho Chehabdlm        Number of dlm requests made (dcnt in glstats file)
2323700bec3SMauro Carvalho Chehabqueue      Number of glock requests queued (qcnt in glstats file)
2333700bec3SMauro Carvalho Chehab=========  ================================================================
2343700bec3SMauro Carvalho Chehab
2353700bec3SMauro Carvalho ChehabThe sbstats file contains a set of these stats for each glock type (so 8 lines
2363700bec3SMauro Carvalho Chehabfor each type) and for each cpu (one column per cpu). The glstats file contains
2373700bec3SMauro Carvalho Chehaba set of these stats for each glock in a similar format to the glocks file, but
2383700bec3SMauro Carvalho Chehabusing the format mean/variance for each of the timing stats.
2393700bec3SMauro Carvalho Chehab
2403700bec3SMauro Carvalho ChehabThe gfs2_glock_lock_time tracepoint prints out the current values of the stats
2413700bec3SMauro Carvalho Chehabfor the glock in question, along with some addition information on each dlm
2423700bec3SMauro Carvalho Chehabreply that is received:
2433700bec3SMauro Carvalho Chehab
2443700bec3SMauro Carvalho Chehab======   =======================================
2453700bec3SMauro Carvalho Chehabstatus   The status of the dlm request
2463700bec3SMauro Carvalho Chehabflags    The dlm request flags
2473700bec3SMauro Carvalho Chehabtdiff    The time taken by this specific request
2483700bec3SMauro Carvalho Chehab======   =======================================
2493700bec3SMauro Carvalho Chehab
2503700bec3SMauro Carvalho Chehab(remaining fields as per above list)
2513700bec3SMauro Carvalho Chehab
2523700bec3SMauro Carvalho Chehab
253