1*3700bec3SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0 2*3700bec3SMauro Carvalho Chehab 3*3700bec3SMauro Carvalho Chehab============================ 4*3700bec3SMauro Carvalho ChehabGlock internal locking rules 5*3700bec3SMauro Carvalho Chehab============================ 6*3700bec3SMauro Carvalho Chehab 7*3700bec3SMauro Carvalho ChehabThis documents the basic principles of the glock state machine 8*3700bec3SMauro Carvalho Chehabinternals. Each glock (struct gfs2_glock in fs/gfs2/incore.h) 9*3700bec3SMauro Carvalho Chehabhas two main (internal) locks: 10*3700bec3SMauro Carvalho Chehab 11*3700bec3SMauro Carvalho Chehab 1. A spinlock (gl_lockref.lock) which protects the internal state such 12*3700bec3SMauro Carvalho Chehab as gl_state, gl_target and the list of holders (gl_holders) 13*3700bec3SMauro Carvalho Chehab 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other 14*3700bec3SMauro Carvalho Chehab threads from making calls to the DLM, etc. at the same time. If a 15*3700bec3SMauro Carvalho Chehab thread takes this lock, it must then call run_queue (usually via the 16*3700bec3SMauro Carvalho Chehab workqueue) when it releases it in order to ensure any pending tasks 17*3700bec3SMauro Carvalho Chehab are completed. 18*3700bec3SMauro Carvalho Chehab 19*3700bec3SMauro Carvalho ChehabThe gl_holders list contains all the queued lock requests (not 20*3700bec3SMauro Carvalho Chehabjust the holders) associated with the glock. If there are any 21*3700bec3SMauro Carvalho Chehabheld locks, then they will be contiguous entries at the head 22*3700bec3SMauro Carvalho Chehabof the list. Locks are granted in strictly the order that they 23*3700bec3SMauro Carvalho Chehabare queued, except for those marked LM_FLAG_PRIORITY which are 24*3700bec3SMauro Carvalho Chehabused only during recovery, and even then only for journal locks. 25*3700bec3SMauro Carvalho Chehab 26*3700bec3SMauro Carvalho ChehabThere are three lock states that users of the glock layer can request, 27*3700bec3SMauro Carvalho Chehabnamely shared (SH), deferred (DF) and exclusive (EX). Those translate 28*3700bec3SMauro Carvalho Chehabto the following DLM lock modes: 29*3700bec3SMauro Carvalho Chehab 30*3700bec3SMauro Carvalho Chehab========== ====== ===================================================== 31*3700bec3SMauro Carvalho ChehabGlock mode DLM lock mode 32*3700bec3SMauro Carvalho Chehab========== ====== ===================================================== 33*3700bec3SMauro Carvalho Chehab UN IV/NL Unlocked (no DLM lock associated with glock) or NL 34*3700bec3SMauro Carvalho Chehab SH PR (Protected read) 35*3700bec3SMauro Carvalho Chehab DF CW (Concurrent write) 36*3700bec3SMauro Carvalho Chehab EX EX (Exclusive) 37*3700bec3SMauro Carvalho Chehab========== ====== ===================================================== 38*3700bec3SMauro Carvalho Chehab 39*3700bec3SMauro Carvalho ChehabThus DF is basically a shared mode which is incompatible with the "normal" 40*3700bec3SMauro Carvalho Chehabshared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O 41*3700bec3SMauro Carvalho Chehaboperations. The glocks are basically a lock plus some routines which deal 42*3700bec3SMauro Carvalho Chehabwith cache management. The following rules apply for the cache: 43*3700bec3SMauro Carvalho Chehab 44*3700bec3SMauro Carvalho Chehab========== ========== ============== ========== ============== 45*3700bec3SMauro Carvalho ChehabGlock mode Cache data Cache Metadata Dirty Data Dirty Metadata 46*3700bec3SMauro Carvalho Chehab========== ========== ============== ========== ============== 47*3700bec3SMauro Carvalho Chehab UN No No No No 48*3700bec3SMauro Carvalho Chehab SH Yes Yes No No 49*3700bec3SMauro Carvalho Chehab DF No Yes No No 50*3700bec3SMauro Carvalho Chehab EX Yes Yes Yes Yes 51*3700bec3SMauro Carvalho Chehab========== ========== ============== ========== ============== 52*3700bec3SMauro Carvalho Chehab 53*3700bec3SMauro Carvalho ChehabThese rules are implemented using the various glock operations which 54*3700bec3SMauro Carvalho Chehabare defined for each type of glock. Not all types of glocks use 55*3700bec3SMauro Carvalho Chehaball the modes. Only inode glocks use the DF mode for example. 56*3700bec3SMauro Carvalho Chehab 57*3700bec3SMauro Carvalho ChehabTable of glock operations and per type constants: 58*3700bec3SMauro Carvalho Chehab 59*3700bec3SMauro Carvalho Chehab============= ============================================================= 60*3700bec3SMauro Carvalho ChehabField Purpose 61*3700bec3SMauro Carvalho Chehab============= ============================================================= 62*3700bec3SMauro Carvalho Chehabgo_xmote_th Called before remote state change (e.g. to sync dirty data) 63*3700bec3SMauro Carvalho Chehabgo_xmote_bh Called after remote state change (e.g. to refill cache) 64*3700bec3SMauro Carvalho Chehabgo_inval Called if remote state change requires invalidating the cache 65*3700bec3SMauro Carvalho Chehabgo_demote_ok Returns boolean value of whether its ok to demote a glock 66*3700bec3SMauro Carvalho Chehab (e.g. checks timeout, and that there is no cached data) 67*3700bec3SMauro Carvalho Chehabgo_lock Called for the first local holder of a lock 68*3700bec3SMauro Carvalho Chehabgo_unlock Called on the final local unlock of a lock 69*3700bec3SMauro Carvalho Chehabgo_dump Called to print content of object for debugfs file, or on 70*3700bec3SMauro Carvalho Chehab error to dump glock to the log. 71*3700bec3SMauro Carvalho Chehabgo_type The type of the glock, ``LM_TYPE_*`` 72*3700bec3SMauro Carvalho Chehabgo_callback Called if the DLM sends a callback to drop this lock 73*3700bec3SMauro Carvalho Chehabgo_flags GLOF_ASPACE is set, if the glock has an address space 74*3700bec3SMauro Carvalho Chehab associated with it 75*3700bec3SMauro Carvalho Chehab============= ============================================================= 76*3700bec3SMauro Carvalho Chehab 77*3700bec3SMauro Carvalho ChehabThe minimum hold time for each lock is the time after a remote lock 78*3700bec3SMauro Carvalho Chehabgrant for which we ignore remote demote requests. This is in order to 79*3700bec3SMauro Carvalho Chehabprevent a situation where locks are being bounced around the cluster 80*3700bec3SMauro Carvalho Chehabfrom node to node with none of the nodes making any progress. This 81*3700bec3SMauro Carvalho Chehabtends to show up most with shared mmaped files which are being written 82*3700bec3SMauro Carvalho Chehabto by multiple nodes. By delaying the demotion in response to a 83*3700bec3SMauro Carvalho Chehabremote callback, that gives the userspace program time to make 84*3700bec3SMauro Carvalho Chehabsome progress before the pages are unmapped. 85*3700bec3SMauro Carvalho Chehab 86*3700bec3SMauro Carvalho ChehabThere is a plan to try and remove the go_lock and go_unlock callbacks 87*3700bec3SMauro Carvalho Chehabif possible, in order to try and speed up the fast path though the locking. 88*3700bec3SMauro Carvalho ChehabAlso, eventually we hope to make the glock "EX" mode locally shared 89*3700bec3SMauro Carvalho Chehabsuch that any local locking will be done with the i_mutex as required 90*3700bec3SMauro Carvalho Chehabrather than via the glock. 91*3700bec3SMauro Carvalho Chehab 92*3700bec3SMauro Carvalho ChehabLocking rules for glock operations: 93*3700bec3SMauro Carvalho Chehab 94*3700bec3SMauro Carvalho Chehab============= ====================== ============================= 95*3700bec3SMauro Carvalho ChehabOperation GLF_LOCK bit lock held gl_lockref.lock spinlock held 96*3700bec3SMauro Carvalho Chehab============= ====================== ============================= 97*3700bec3SMauro Carvalho Chehabgo_xmote_th Yes No 98*3700bec3SMauro Carvalho Chehabgo_xmote_bh Yes No 99*3700bec3SMauro Carvalho Chehabgo_inval Yes No 100*3700bec3SMauro Carvalho Chehabgo_demote_ok Sometimes Yes 101*3700bec3SMauro Carvalho Chehabgo_lock Yes No 102*3700bec3SMauro Carvalho Chehabgo_unlock Yes No 103*3700bec3SMauro Carvalho Chehabgo_dump Sometimes Yes 104*3700bec3SMauro Carvalho Chehabgo_callback Sometimes (N/A) Yes 105*3700bec3SMauro Carvalho Chehab============= ====================== ============================= 106*3700bec3SMauro Carvalho Chehab 107*3700bec3SMauro Carvalho Chehab.. Note:: 108*3700bec3SMauro Carvalho Chehab 109*3700bec3SMauro Carvalho Chehab Operations must not drop either the bit lock or the spinlock 110*3700bec3SMauro Carvalho Chehab if its held on entry. go_dump and do_demote_ok must never block. 111*3700bec3SMauro Carvalho Chehab Note that go_dump will only be called if the glock's state 112*3700bec3SMauro Carvalho Chehab indicates that it is caching uptodate data. 113*3700bec3SMauro Carvalho Chehab 114*3700bec3SMauro Carvalho ChehabGlock locking order within GFS2: 115*3700bec3SMauro Carvalho Chehab 116*3700bec3SMauro Carvalho Chehab 1. i_rwsem (if required) 117*3700bec3SMauro Carvalho Chehab 2. Rename glock (for rename only) 118*3700bec3SMauro Carvalho Chehab 3. Inode glock(s) 119*3700bec3SMauro Carvalho Chehab (Parents before children, inodes at "same level" with same parent in 120*3700bec3SMauro Carvalho Chehab lock number order) 121*3700bec3SMauro Carvalho Chehab 4. Rgrp glock(s) (for (de)allocation operations) 122*3700bec3SMauro Carvalho Chehab 5. Transaction glock (via gfs2_trans_begin) for non-read operations 123*3700bec3SMauro Carvalho Chehab 6. i_rw_mutex (if required) 124*3700bec3SMauro Carvalho Chehab 7. Page lock (always last, very important!) 125*3700bec3SMauro Carvalho Chehab 126*3700bec3SMauro Carvalho ChehabThere are two glocks per inode. One deals with access to the inode 127*3700bec3SMauro Carvalho Chehabitself (locking order as above), and the other, known as the iopen 128*3700bec3SMauro Carvalho Chehabglock is used in conjunction with the i_nlink field in the inode to 129*3700bec3SMauro Carvalho Chehabdetermine the lifetime of the inode in question. Locking of inodes 130*3700bec3SMauro Carvalho Chehabis on a per-inode basis. Locking of rgrps is on a per rgrp basis. 131*3700bec3SMauro Carvalho ChehabIn general we prefer to lock local locks prior to cluster locks. 132*3700bec3SMauro Carvalho Chehab 133*3700bec3SMauro Carvalho ChehabGlock Statistics 134*3700bec3SMauro Carvalho Chehab---------------- 135*3700bec3SMauro Carvalho Chehab 136*3700bec3SMauro Carvalho ChehabThe stats are divided into two sets: those relating to the 137*3700bec3SMauro Carvalho Chehabsuper block and those relating to an individual glock. The 138*3700bec3SMauro Carvalho Chehabsuper block stats are done on a per cpu basis in order to 139*3700bec3SMauro Carvalho Chehabtry and reduce the overhead of gathering them. They are also 140*3700bec3SMauro Carvalho Chehabfurther divided by glock type. All timings are in nanoseconds. 141*3700bec3SMauro Carvalho Chehab 142*3700bec3SMauro Carvalho ChehabIn the case of both the super block and glock statistics, 143*3700bec3SMauro Carvalho Chehabthe same information is gathered in each case. The super 144*3700bec3SMauro Carvalho Chehabblock timing statistics are used to provide default values for 145*3700bec3SMauro Carvalho Chehabthe glock timing statistics, so that newly created glocks 146*3700bec3SMauro Carvalho Chehabshould have, as far as possible, a sensible starting point. 147*3700bec3SMauro Carvalho ChehabThe per-glock counters are initialised to zero when the 148*3700bec3SMauro Carvalho Chehabglock is created. The per-glock statistics are lost when 149*3700bec3SMauro Carvalho Chehabthe glock is ejected from memory. 150*3700bec3SMauro Carvalho Chehab 151*3700bec3SMauro Carvalho ChehabThe statistics are divided into three pairs of mean and 152*3700bec3SMauro Carvalho Chehabvariance, plus two counters. The mean/variance pairs are 153*3700bec3SMauro Carvalho Chehabsmoothed exponential estimates and the algorithm used is 154*3700bec3SMauro Carvalho Chehabone which will be very familiar to those used to calculation 155*3700bec3SMauro Carvalho Chehabof round trip times in network code. See "TCP/IP Illustrated, 156*3700bec3SMauro Carvalho ChehabVolume 1", W. Richard Stevens, sect 21.3, "Round-Trip Time Measurement", 157*3700bec3SMauro Carvalho Chehabp. 299 and onwards. Also, Volume 2, Sect. 25.10, p. 838 and onwards. 158*3700bec3SMauro Carvalho ChehabUnlike the TCP/IP Illustrated case, the mean and variance are 159*3700bec3SMauro Carvalho Chehabnot scaled, but are in units of integer nanoseconds. 160*3700bec3SMauro Carvalho Chehab 161*3700bec3SMauro Carvalho ChehabThe three pairs of mean/variance measure the following 162*3700bec3SMauro Carvalho Chehabthings: 163*3700bec3SMauro Carvalho Chehab 164*3700bec3SMauro Carvalho Chehab 1. DLM lock time (non-blocking requests) 165*3700bec3SMauro Carvalho Chehab 2. DLM lock time (blocking requests) 166*3700bec3SMauro Carvalho Chehab 3. Inter-request time (again to the DLM) 167*3700bec3SMauro Carvalho Chehab 168*3700bec3SMauro Carvalho ChehabA non-blocking request is one which will complete right 169*3700bec3SMauro Carvalho Chehabaway, whatever the state of the DLM lock in question. That 170*3700bec3SMauro Carvalho Chehabcurrently means any requests when (a) the current state of 171*3700bec3SMauro Carvalho Chehabthe lock is exclusive, i.e. a lock demotion (b) the requested 172*3700bec3SMauro Carvalho Chehabstate is either null or unlocked (again, a demotion) or (c) the 173*3700bec3SMauro Carvalho Chehab"try lock" flag is set. A blocking request covers all the other 174*3700bec3SMauro Carvalho Chehablock requests. 175*3700bec3SMauro Carvalho Chehab 176*3700bec3SMauro Carvalho ChehabThere are two counters. The first is there primarily to show 177*3700bec3SMauro Carvalho Chehabhow many lock requests have been made, and thus how much data 178*3700bec3SMauro Carvalho Chehabhas gone into the mean/variance calculations. The other counter 179*3700bec3SMauro Carvalho Chehabis counting queuing of holders at the top layer of the glock 180*3700bec3SMauro Carvalho Chehabcode. Hopefully that number will be a lot larger than the number 181*3700bec3SMauro Carvalho Chehabof dlm lock requests issued. 182*3700bec3SMauro Carvalho Chehab 183*3700bec3SMauro Carvalho ChehabSo why gather these statistics? There are several reasons 184*3700bec3SMauro Carvalho Chehabwe'd like to get a better idea of these timings: 185*3700bec3SMauro Carvalho Chehab 186*3700bec3SMauro Carvalho Chehab1. To be able to better set the glock "min hold time" 187*3700bec3SMauro Carvalho Chehab2. To spot performance issues more easily 188*3700bec3SMauro Carvalho Chehab3. To improve the algorithm for selecting resource groups for 189*3700bec3SMauro Carvalho Chehab allocation (to base it on lock wait time, rather than blindly 190*3700bec3SMauro Carvalho Chehab using a "try lock") 191*3700bec3SMauro Carvalho Chehab 192*3700bec3SMauro Carvalho ChehabDue to the smoothing action of the updates, a step change in 193*3700bec3SMauro Carvalho Chehabsome input quantity being sampled will only fully be taken 194*3700bec3SMauro Carvalho Chehabinto account after 8 samples (or 4 for the variance) and this 195*3700bec3SMauro Carvalho Chehabneeds to be carefully considered when interpreting the 196*3700bec3SMauro Carvalho Chehabresults. 197*3700bec3SMauro Carvalho Chehab 198*3700bec3SMauro Carvalho ChehabKnowing both the time it takes a lock request to complete and 199*3700bec3SMauro Carvalho Chehabthe average time between lock requests for a glock means we 200*3700bec3SMauro Carvalho Chehabcan compute the total percentage of the time for which the 201*3700bec3SMauro Carvalho Chehabnode is able to use a glock vs. time that the rest of the 202*3700bec3SMauro Carvalho Chehabcluster has its share. That will be very useful when setting 203*3700bec3SMauro Carvalho Chehabthe lock min hold time. 204*3700bec3SMauro Carvalho Chehab 205*3700bec3SMauro Carvalho ChehabGreat care has been taken to ensure that we 206*3700bec3SMauro Carvalho Chehabmeasure exactly the quantities that we want, as accurately 207*3700bec3SMauro Carvalho Chehabas possible. There are always inaccuracies in any 208*3700bec3SMauro Carvalho Chehabmeasuring system, but I hope this is as accurate as we 209*3700bec3SMauro Carvalho Chehabcan reasonably make it. 210*3700bec3SMauro Carvalho Chehab 211*3700bec3SMauro Carvalho ChehabPer sb stats can be found here:: 212*3700bec3SMauro Carvalho Chehab 213*3700bec3SMauro Carvalho Chehab /sys/kernel/debug/gfs2/<fsname>/sbstats 214*3700bec3SMauro Carvalho Chehab 215*3700bec3SMauro Carvalho ChehabPer glock stats can be found here:: 216*3700bec3SMauro Carvalho Chehab 217*3700bec3SMauro Carvalho Chehab /sys/kernel/debug/gfs2/<fsname>/glstats 218*3700bec3SMauro Carvalho Chehab 219*3700bec3SMauro Carvalho ChehabAssuming that debugfs is mounted on /sys/kernel/debug and also 220*3700bec3SMauro Carvalho Chehabthat <fsname> is replaced with the name of the gfs2 filesystem 221*3700bec3SMauro Carvalho Chehabin question. 222*3700bec3SMauro Carvalho Chehab 223*3700bec3SMauro Carvalho ChehabThe abbreviations used in the output as are follows: 224*3700bec3SMauro Carvalho Chehab 225*3700bec3SMauro Carvalho Chehab========= ================================================================ 226*3700bec3SMauro Carvalho Chehabsrtt Smoothed round trip time for non blocking dlm requests 227*3700bec3SMauro Carvalho Chehabsrttvar Variance estimate for srtt 228*3700bec3SMauro Carvalho Chehabsrttb Smoothed round trip time for (potentially) blocking dlm requests 229*3700bec3SMauro Carvalho Chehabsrttvarb Variance estimate for srttb 230*3700bec3SMauro Carvalho Chehabsirt Smoothed inter request time (for dlm requests) 231*3700bec3SMauro Carvalho Chehabsirtvar Variance estimate for sirt 232*3700bec3SMauro Carvalho Chehabdlm Number of dlm requests made (dcnt in glstats file) 233*3700bec3SMauro Carvalho Chehabqueue Number of glock requests queued (qcnt in glstats file) 234*3700bec3SMauro Carvalho Chehab========= ================================================================ 235*3700bec3SMauro Carvalho Chehab 236*3700bec3SMauro Carvalho ChehabThe sbstats file contains a set of these stats for each glock type (so 8 lines 237*3700bec3SMauro Carvalho Chehabfor each type) and for each cpu (one column per cpu). The glstats file contains 238*3700bec3SMauro Carvalho Chehaba set of these stats for each glock in a similar format to the glocks file, but 239*3700bec3SMauro Carvalho Chehabusing the format mean/variance for each of the timing stats. 240*3700bec3SMauro Carvalho Chehab 241*3700bec3SMauro Carvalho ChehabThe gfs2_glock_lock_time tracepoint prints out the current values of the stats 242*3700bec3SMauro Carvalho Chehabfor the glock in question, along with some addition information on each dlm 243*3700bec3SMauro Carvalho Chehabreply that is received: 244*3700bec3SMauro Carvalho Chehab 245*3700bec3SMauro Carvalho Chehab====== ======================================= 246*3700bec3SMauro Carvalho Chehabstatus The status of the dlm request 247*3700bec3SMauro Carvalho Chehabflags The dlm request flags 248*3700bec3SMauro Carvalho Chehabtdiff The time taken by this specific request 249*3700bec3SMauro Carvalho Chehab====== ======================================= 250*3700bec3SMauro Carvalho Chehab 251*3700bec3SMauro Carvalho Chehab(remaining fields as per above list) 252*3700bec3SMauro Carvalho Chehab 253*3700bec3SMauro Carvalho Chehab 254