xref: /openbmc/linux/Documentation/filesystems/gfs2-glocks.rst (revision 3700bec3323ebe90924156775be0124e93094f78)
1*3700bec3SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0
2*3700bec3SMauro Carvalho Chehab
3*3700bec3SMauro Carvalho Chehab============================
4*3700bec3SMauro Carvalho ChehabGlock internal locking rules
5*3700bec3SMauro Carvalho Chehab============================
6*3700bec3SMauro Carvalho Chehab
7*3700bec3SMauro Carvalho ChehabThis documents the basic principles of the glock state machine
8*3700bec3SMauro Carvalho Chehabinternals. Each glock (struct gfs2_glock in fs/gfs2/incore.h)
9*3700bec3SMauro Carvalho Chehabhas two main (internal) locks:
10*3700bec3SMauro Carvalho Chehab
11*3700bec3SMauro Carvalho Chehab 1. A spinlock (gl_lockref.lock) which protects the internal state such
12*3700bec3SMauro Carvalho Chehab    as gl_state, gl_target and the list of holders (gl_holders)
13*3700bec3SMauro Carvalho Chehab 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other
14*3700bec3SMauro Carvalho Chehab    threads from making calls to the DLM, etc. at the same time. If a
15*3700bec3SMauro Carvalho Chehab    thread takes this lock, it must then call run_queue (usually via the
16*3700bec3SMauro Carvalho Chehab    workqueue) when it releases it in order to ensure any pending tasks
17*3700bec3SMauro Carvalho Chehab    are completed.
18*3700bec3SMauro Carvalho Chehab
19*3700bec3SMauro Carvalho ChehabThe gl_holders list contains all the queued lock requests (not
20*3700bec3SMauro Carvalho Chehabjust the holders) associated with the glock. If there are any
21*3700bec3SMauro Carvalho Chehabheld locks, then they will be contiguous entries at the head
22*3700bec3SMauro Carvalho Chehabof the list. Locks are granted in strictly the order that they
23*3700bec3SMauro Carvalho Chehabare queued, except for those marked LM_FLAG_PRIORITY which are
24*3700bec3SMauro Carvalho Chehabused only during recovery, and even then only for journal locks.
25*3700bec3SMauro Carvalho Chehab
26*3700bec3SMauro Carvalho ChehabThere are three lock states that users of the glock layer can request,
27*3700bec3SMauro Carvalho Chehabnamely shared (SH), deferred (DF) and exclusive (EX). Those translate
28*3700bec3SMauro Carvalho Chehabto the following DLM lock modes:
29*3700bec3SMauro Carvalho Chehab
30*3700bec3SMauro Carvalho Chehab==========	====== =====================================================
31*3700bec3SMauro Carvalho ChehabGlock mode      DLM    lock mode
32*3700bec3SMauro Carvalho Chehab==========	====== =====================================================
33*3700bec3SMauro Carvalho Chehab    UN          IV/NL  Unlocked (no DLM lock associated with glock) or NL
34*3700bec3SMauro Carvalho Chehab    SH          PR     (Protected read)
35*3700bec3SMauro Carvalho Chehab    DF          CW     (Concurrent write)
36*3700bec3SMauro Carvalho Chehab    EX          EX     (Exclusive)
37*3700bec3SMauro Carvalho Chehab==========	====== =====================================================
38*3700bec3SMauro Carvalho Chehab
39*3700bec3SMauro Carvalho ChehabThus DF is basically a shared mode which is incompatible with the "normal"
40*3700bec3SMauro Carvalho Chehabshared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
41*3700bec3SMauro Carvalho Chehaboperations. The glocks are basically a lock plus some routines which deal
42*3700bec3SMauro Carvalho Chehabwith cache management. The following rules apply for the cache:
43*3700bec3SMauro Carvalho Chehab
44*3700bec3SMauro Carvalho Chehab==========      ==========   ==============   ==========   ==============
45*3700bec3SMauro Carvalho ChehabGlock mode      Cache data   Cache Metadata   Dirty Data   Dirty Metadata
46*3700bec3SMauro Carvalho Chehab==========      ==========   ==============   ==========   ==============
47*3700bec3SMauro Carvalho Chehab    UN             No              No             No            No
48*3700bec3SMauro Carvalho Chehab    SH             Yes             Yes            No            No
49*3700bec3SMauro Carvalho Chehab    DF             No              Yes            No            No
50*3700bec3SMauro Carvalho Chehab    EX             Yes             Yes            Yes           Yes
51*3700bec3SMauro Carvalho Chehab==========      ==========   ==============   ==========   ==============
52*3700bec3SMauro Carvalho Chehab
53*3700bec3SMauro Carvalho ChehabThese rules are implemented using the various glock operations which
54*3700bec3SMauro Carvalho Chehabare defined for each type of glock. Not all types of glocks use
55*3700bec3SMauro Carvalho Chehaball the modes. Only inode glocks use the DF mode for example.
56*3700bec3SMauro Carvalho Chehab
57*3700bec3SMauro Carvalho ChehabTable of glock operations and per type constants:
58*3700bec3SMauro Carvalho Chehab
59*3700bec3SMauro Carvalho Chehab=============      =============================================================
60*3700bec3SMauro Carvalho ChehabField              Purpose
61*3700bec3SMauro Carvalho Chehab=============      =============================================================
62*3700bec3SMauro Carvalho Chehabgo_xmote_th        Called before remote state change (e.g. to sync dirty data)
63*3700bec3SMauro Carvalho Chehabgo_xmote_bh        Called after remote state change (e.g. to refill cache)
64*3700bec3SMauro Carvalho Chehabgo_inval           Called if remote state change requires invalidating the cache
65*3700bec3SMauro Carvalho Chehabgo_demote_ok       Returns boolean value of whether its ok to demote a glock
66*3700bec3SMauro Carvalho Chehab                   (e.g. checks timeout, and that there is no cached data)
67*3700bec3SMauro Carvalho Chehabgo_lock            Called for the first local holder of a lock
68*3700bec3SMauro Carvalho Chehabgo_unlock          Called on the final local unlock of a lock
69*3700bec3SMauro Carvalho Chehabgo_dump            Called to print content of object for debugfs file, or on
70*3700bec3SMauro Carvalho Chehab                   error to dump glock to the log.
71*3700bec3SMauro Carvalho Chehabgo_type            The type of the glock, ``LM_TYPE_*``
72*3700bec3SMauro Carvalho Chehabgo_callback	   Called if the DLM sends a callback to drop this lock
73*3700bec3SMauro Carvalho Chehabgo_flags	   GLOF_ASPACE is set, if the glock has an address space
74*3700bec3SMauro Carvalho Chehab                   associated with it
75*3700bec3SMauro Carvalho Chehab=============      =============================================================
76*3700bec3SMauro Carvalho Chehab
77*3700bec3SMauro Carvalho ChehabThe minimum hold time for each lock is the time after a remote lock
78*3700bec3SMauro Carvalho Chehabgrant for which we ignore remote demote requests. This is in order to
79*3700bec3SMauro Carvalho Chehabprevent a situation where locks are being bounced around the cluster
80*3700bec3SMauro Carvalho Chehabfrom node to node with none of the nodes making any progress. This
81*3700bec3SMauro Carvalho Chehabtends to show up most with shared mmaped files which are being written
82*3700bec3SMauro Carvalho Chehabto by multiple nodes. By delaying the demotion in response to a
83*3700bec3SMauro Carvalho Chehabremote callback, that gives the userspace program time to make
84*3700bec3SMauro Carvalho Chehabsome progress before the pages are unmapped.
85*3700bec3SMauro Carvalho Chehab
86*3700bec3SMauro Carvalho ChehabThere is a plan to try and remove the go_lock and go_unlock callbacks
87*3700bec3SMauro Carvalho Chehabif possible, in order to try and speed up the fast path though the locking.
88*3700bec3SMauro Carvalho ChehabAlso, eventually we hope to make the glock "EX" mode locally shared
89*3700bec3SMauro Carvalho Chehabsuch that any local locking will be done with the i_mutex as required
90*3700bec3SMauro Carvalho Chehabrather than via the glock.
91*3700bec3SMauro Carvalho Chehab
92*3700bec3SMauro Carvalho ChehabLocking rules for glock operations:
93*3700bec3SMauro Carvalho Chehab
94*3700bec3SMauro Carvalho Chehab=============    ======================    =============================
95*3700bec3SMauro Carvalho ChehabOperation        GLF_LOCK bit lock held    gl_lockref.lock spinlock held
96*3700bec3SMauro Carvalho Chehab=============    ======================    =============================
97*3700bec3SMauro Carvalho Chehabgo_xmote_th           Yes                       No
98*3700bec3SMauro Carvalho Chehabgo_xmote_bh           Yes                       No
99*3700bec3SMauro Carvalho Chehabgo_inval              Yes                       No
100*3700bec3SMauro Carvalho Chehabgo_demote_ok          Sometimes                 Yes
101*3700bec3SMauro Carvalho Chehabgo_lock               Yes                       No
102*3700bec3SMauro Carvalho Chehabgo_unlock             Yes                       No
103*3700bec3SMauro Carvalho Chehabgo_dump               Sometimes                 Yes
104*3700bec3SMauro Carvalho Chehabgo_callback           Sometimes (N/A)           Yes
105*3700bec3SMauro Carvalho Chehab=============    ======================    =============================
106*3700bec3SMauro Carvalho Chehab
107*3700bec3SMauro Carvalho Chehab.. Note::
108*3700bec3SMauro Carvalho Chehab
109*3700bec3SMauro Carvalho Chehab   Operations must not drop either the bit lock or the spinlock
110*3700bec3SMauro Carvalho Chehab   if its held on entry. go_dump and do_demote_ok must never block.
111*3700bec3SMauro Carvalho Chehab   Note that go_dump will only be called if the glock's state
112*3700bec3SMauro Carvalho Chehab   indicates that it is caching uptodate data.
113*3700bec3SMauro Carvalho Chehab
114*3700bec3SMauro Carvalho ChehabGlock locking order within GFS2:
115*3700bec3SMauro Carvalho Chehab
116*3700bec3SMauro Carvalho Chehab 1. i_rwsem (if required)
117*3700bec3SMauro Carvalho Chehab 2. Rename glock (for rename only)
118*3700bec3SMauro Carvalho Chehab 3. Inode glock(s)
119*3700bec3SMauro Carvalho Chehab    (Parents before children, inodes at "same level" with same parent in
120*3700bec3SMauro Carvalho Chehab    lock number order)
121*3700bec3SMauro Carvalho Chehab 4. Rgrp glock(s) (for (de)allocation operations)
122*3700bec3SMauro Carvalho Chehab 5. Transaction glock (via gfs2_trans_begin) for non-read operations
123*3700bec3SMauro Carvalho Chehab 6. i_rw_mutex (if required)
124*3700bec3SMauro Carvalho Chehab 7. Page lock  (always last, very important!)
125*3700bec3SMauro Carvalho Chehab
126*3700bec3SMauro Carvalho ChehabThere are two glocks per inode. One deals with access to the inode
127*3700bec3SMauro Carvalho Chehabitself (locking order as above), and the other, known as the iopen
128*3700bec3SMauro Carvalho Chehabglock is used in conjunction with the i_nlink field in the inode to
129*3700bec3SMauro Carvalho Chehabdetermine the lifetime of the inode in question. Locking of inodes
130*3700bec3SMauro Carvalho Chehabis on a per-inode basis. Locking of rgrps is on a per rgrp basis.
131*3700bec3SMauro Carvalho ChehabIn general we prefer to lock local locks prior to cluster locks.
132*3700bec3SMauro Carvalho Chehab
133*3700bec3SMauro Carvalho ChehabGlock Statistics
134*3700bec3SMauro Carvalho Chehab----------------
135*3700bec3SMauro Carvalho Chehab
136*3700bec3SMauro Carvalho ChehabThe stats are divided into two sets: those relating to the
137*3700bec3SMauro Carvalho Chehabsuper block and those relating to an individual glock. The
138*3700bec3SMauro Carvalho Chehabsuper block stats are done on a per cpu basis in order to
139*3700bec3SMauro Carvalho Chehabtry and reduce the overhead of gathering them. They are also
140*3700bec3SMauro Carvalho Chehabfurther divided by glock type. All timings are in nanoseconds.
141*3700bec3SMauro Carvalho Chehab
142*3700bec3SMauro Carvalho ChehabIn the case of both the super block and glock statistics,
143*3700bec3SMauro Carvalho Chehabthe same information is gathered in each case. The super
144*3700bec3SMauro Carvalho Chehabblock timing statistics are used to provide default values for
145*3700bec3SMauro Carvalho Chehabthe glock timing statistics, so that newly created glocks
146*3700bec3SMauro Carvalho Chehabshould have, as far as possible, a sensible starting point.
147*3700bec3SMauro Carvalho ChehabThe per-glock counters are initialised to zero when the
148*3700bec3SMauro Carvalho Chehabglock is created. The per-glock statistics are lost when
149*3700bec3SMauro Carvalho Chehabthe glock is ejected from memory.
150*3700bec3SMauro Carvalho Chehab
151*3700bec3SMauro Carvalho ChehabThe statistics are divided into three pairs of mean and
152*3700bec3SMauro Carvalho Chehabvariance, plus two counters. The mean/variance pairs are
153*3700bec3SMauro Carvalho Chehabsmoothed exponential estimates and the algorithm used is
154*3700bec3SMauro Carvalho Chehabone which will be very familiar to those used to calculation
155*3700bec3SMauro Carvalho Chehabof round trip times in network code. See "TCP/IP Illustrated,
156*3700bec3SMauro Carvalho ChehabVolume 1", W. Richard Stevens, sect 21.3, "Round-Trip Time Measurement",
157*3700bec3SMauro Carvalho Chehabp. 299 and onwards. Also, Volume 2, Sect. 25.10, p. 838 and onwards.
158*3700bec3SMauro Carvalho ChehabUnlike the TCP/IP Illustrated case, the mean and variance are
159*3700bec3SMauro Carvalho Chehabnot scaled, but are in units of integer nanoseconds.
160*3700bec3SMauro Carvalho Chehab
161*3700bec3SMauro Carvalho ChehabThe three pairs of mean/variance measure the following
162*3700bec3SMauro Carvalho Chehabthings:
163*3700bec3SMauro Carvalho Chehab
164*3700bec3SMauro Carvalho Chehab 1. DLM lock time (non-blocking requests)
165*3700bec3SMauro Carvalho Chehab 2. DLM lock time (blocking requests)
166*3700bec3SMauro Carvalho Chehab 3. Inter-request time (again to the DLM)
167*3700bec3SMauro Carvalho Chehab
168*3700bec3SMauro Carvalho ChehabA non-blocking request is one which will complete right
169*3700bec3SMauro Carvalho Chehabaway, whatever the state of the DLM lock in question. That
170*3700bec3SMauro Carvalho Chehabcurrently means any requests when (a) the current state of
171*3700bec3SMauro Carvalho Chehabthe lock is exclusive, i.e. a lock demotion (b) the requested
172*3700bec3SMauro Carvalho Chehabstate is either null or unlocked (again, a demotion) or (c) the
173*3700bec3SMauro Carvalho Chehab"try lock" flag is set. A blocking request covers all the other
174*3700bec3SMauro Carvalho Chehablock requests.
175*3700bec3SMauro Carvalho Chehab
176*3700bec3SMauro Carvalho ChehabThere are two counters. The first is there primarily to show
177*3700bec3SMauro Carvalho Chehabhow many lock requests have been made, and thus how much data
178*3700bec3SMauro Carvalho Chehabhas gone into the mean/variance calculations. The other counter
179*3700bec3SMauro Carvalho Chehabis counting queuing of holders at the top layer of the glock
180*3700bec3SMauro Carvalho Chehabcode. Hopefully that number will be a lot larger than the number
181*3700bec3SMauro Carvalho Chehabof dlm lock requests issued.
182*3700bec3SMauro Carvalho Chehab
183*3700bec3SMauro Carvalho ChehabSo why gather these statistics? There are several reasons
184*3700bec3SMauro Carvalho Chehabwe'd like to get a better idea of these timings:
185*3700bec3SMauro Carvalho Chehab
186*3700bec3SMauro Carvalho Chehab1. To be able to better set the glock "min hold time"
187*3700bec3SMauro Carvalho Chehab2. To spot performance issues more easily
188*3700bec3SMauro Carvalho Chehab3. To improve the algorithm for selecting resource groups for
189*3700bec3SMauro Carvalho Chehab   allocation (to base it on lock wait time, rather than blindly
190*3700bec3SMauro Carvalho Chehab   using a "try lock")
191*3700bec3SMauro Carvalho Chehab
192*3700bec3SMauro Carvalho ChehabDue to the smoothing action of the updates, a step change in
193*3700bec3SMauro Carvalho Chehabsome input quantity being sampled will only fully be taken
194*3700bec3SMauro Carvalho Chehabinto account after 8 samples (or 4 for the variance) and this
195*3700bec3SMauro Carvalho Chehabneeds to be carefully considered when interpreting the
196*3700bec3SMauro Carvalho Chehabresults.
197*3700bec3SMauro Carvalho Chehab
198*3700bec3SMauro Carvalho ChehabKnowing both the time it takes a lock request to complete and
199*3700bec3SMauro Carvalho Chehabthe average time between lock requests for a glock means we
200*3700bec3SMauro Carvalho Chehabcan compute the total percentage of the time for which the
201*3700bec3SMauro Carvalho Chehabnode is able to use a glock vs. time that the rest of the
202*3700bec3SMauro Carvalho Chehabcluster has its share. That will be very useful when setting
203*3700bec3SMauro Carvalho Chehabthe lock min hold time.
204*3700bec3SMauro Carvalho Chehab
205*3700bec3SMauro Carvalho ChehabGreat care has been taken to ensure that we
206*3700bec3SMauro Carvalho Chehabmeasure exactly the quantities that we want, as accurately
207*3700bec3SMauro Carvalho Chehabas possible. There are always inaccuracies in any
208*3700bec3SMauro Carvalho Chehabmeasuring system, but I hope this is as accurate as we
209*3700bec3SMauro Carvalho Chehabcan reasonably make it.
210*3700bec3SMauro Carvalho Chehab
211*3700bec3SMauro Carvalho ChehabPer sb stats can be found here::
212*3700bec3SMauro Carvalho Chehab
213*3700bec3SMauro Carvalho Chehab    /sys/kernel/debug/gfs2/<fsname>/sbstats
214*3700bec3SMauro Carvalho Chehab
215*3700bec3SMauro Carvalho ChehabPer glock stats can be found here::
216*3700bec3SMauro Carvalho Chehab
217*3700bec3SMauro Carvalho Chehab    /sys/kernel/debug/gfs2/<fsname>/glstats
218*3700bec3SMauro Carvalho Chehab
219*3700bec3SMauro Carvalho ChehabAssuming that debugfs is mounted on /sys/kernel/debug and also
220*3700bec3SMauro Carvalho Chehabthat <fsname> is replaced with the name of the gfs2 filesystem
221*3700bec3SMauro Carvalho Chehabin question.
222*3700bec3SMauro Carvalho Chehab
223*3700bec3SMauro Carvalho ChehabThe abbreviations used in the output as are follows:
224*3700bec3SMauro Carvalho Chehab
225*3700bec3SMauro Carvalho Chehab=========  ================================================================
226*3700bec3SMauro Carvalho Chehabsrtt       Smoothed round trip time for non blocking dlm requests
227*3700bec3SMauro Carvalho Chehabsrttvar    Variance estimate for srtt
228*3700bec3SMauro Carvalho Chehabsrttb      Smoothed round trip time for (potentially) blocking dlm requests
229*3700bec3SMauro Carvalho Chehabsrttvarb   Variance estimate for srttb
230*3700bec3SMauro Carvalho Chehabsirt       Smoothed inter request time (for dlm requests)
231*3700bec3SMauro Carvalho Chehabsirtvar    Variance estimate for sirt
232*3700bec3SMauro Carvalho Chehabdlm        Number of dlm requests made (dcnt in glstats file)
233*3700bec3SMauro Carvalho Chehabqueue      Number of glock requests queued (qcnt in glstats file)
234*3700bec3SMauro Carvalho Chehab=========  ================================================================
235*3700bec3SMauro Carvalho Chehab
236*3700bec3SMauro Carvalho ChehabThe sbstats file contains a set of these stats for each glock type (so 8 lines
237*3700bec3SMauro Carvalho Chehabfor each type) and for each cpu (one column per cpu). The glstats file contains
238*3700bec3SMauro Carvalho Chehaba set of these stats for each glock in a similar format to the glocks file, but
239*3700bec3SMauro Carvalho Chehabusing the format mean/variance for each of the timing stats.
240*3700bec3SMauro Carvalho Chehab
241*3700bec3SMauro Carvalho ChehabThe gfs2_glock_lock_time tracepoint prints out the current values of the stats
242*3700bec3SMauro Carvalho Chehabfor the glock in question, along with some addition information on each dlm
243*3700bec3SMauro Carvalho Chehabreply that is received:
244*3700bec3SMauro Carvalho Chehab
245*3700bec3SMauro Carvalho Chehab======   =======================================
246*3700bec3SMauro Carvalho Chehabstatus   The status of the dlm request
247*3700bec3SMauro Carvalho Chehabflags    The dlm request flags
248*3700bec3SMauro Carvalho Chehabtdiff    The time taken by this specific request
249*3700bec3SMauro Carvalho Chehab======   =======================================
250*3700bec3SMauro Carvalho Chehab
251*3700bec3SMauro Carvalho Chehab(remaining fields as per above list)
252*3700bec3SMauro Carvalho Chehab
253*3700bec3SMauro Carvalho Chehab
254