13700bec3SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0 23700bec3SMauro Carvalho Chehab 33700bec3SMauro Carvalho Chehab============================ 43700bec3SMauro Carvalho ChehabGlock internal locking rules 53700bec3SMauro Carvalho Chehab============================ 63700bec3SMauro Carvalho Chehab 73700bec3SMauro Carvalho ChehabThis documents the basic principles of the glock state machine 83700bec3SMauro Carvalho Chehabinternals. Each glock (struct gfs2_glock in fs/gfs2/incore.h) 93700bec3SMauro Carvalho Chehabhas two main (internal) locks: 103700bec3SMauro Carvalho Chehab 113700bec3SMauro Carvalho Chehab 1. A spinlock (gl_lockref.lock) which protects the internal state such 123700bec3SMauro Carvalho Chehab as gl_state, gl_target and the list of holders (gl_holders) 133700bec3SMauro Carvalho Chehab 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other 143700bec3SMauro Carvalho Chehab threads from making calls to the DLM, etc. at the same time. If a 153700bec3SMauro Carvalho Chehab thread takes this lock, it must then call run_queue (usually via the 163700bec3SMauro Carvalho Chehab workqueue) when it releases it in order to ensure any pending tasks 173700bec3SMauro Carvalho Chehab are completed. 183700bec3SMauro Carvalho Chehab 193700bec3SMauro Carvalho ChehabThe gl_holders list contains all the queued lock requests (not 203700bec3SMauro Carvalho Chehabjust the holders) associated with the glock. If there are any 213700bec3SMauro Carvalho Chehabheld locks, then they will be contiguous entries at the head 223700bec3SMauro Carvalho Chehabof the list. Locks are granted in strictly the order that they 23*0b93bac2SAndreas Gruenbacherare queued. 243700bec3SMauro Carvalho Chehab 253700bec3SMauro Carvalho ChehabThere are three lock states that users of the glock layer can request, 263700bec3SMauro Carvalho Chehabnamely shared (SH), deferred (DF) and exclusive (EX). Those translate 273700bec3SMauro Carvalho Chehabto the following DLM lock modes: 283700bec3SMauro Carvalho Chehab 293700bec3SMauro Carvalho Chehab========== ====== ===================================================== 303700bec3SMauro Carvalho ChehabGlock mode DLM lock mode 313700bec3SMauro Carvalho Chehab========== ====== ===================================================== 323700bec3SMauro Carvalho Chehab UN IV/NL Unlocked (no DLM lock associated with glock) or NL 333700bec3SMauro Carvalho Chehab SH PR (Protected read) 343700bec3SMauro Carvalho Chehab DF CW (Concurrent write) 353700bec3SMauro Carvalho Chehab EX EX (Exclusive) 363700bec3SMauro Carvalho Chehab========== ====== ===================================================== 373700bec3SMauro Carvalho Chehab 383700bec3SMauro Carvalho ChehabThus DF is basically a shared mode which is incompatible with the "normal" 393700bec3SMauro Carvalho Chehabshared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O 403700bec3SMauro Carvalho Chehaboperations. The glocks are basically a lock plus some routines which deal 413700bec3SMauro Carvalho Chehabwith cache management. The following rules apply for the cache: 423700bec3SMauro Carvalho Chehab 433700bec3SMauro Carvalho Chehab========== ========== ============== ========== ============== 443700bec3SMauro Carvalho ChehabGlock mode Cache data Cache Metadata Dirty Data Dirty Metadata 453700bec3SMauro Carvalho Chehab========== ========== ============== ========== ============== 463700bec3SMauro Carvalho Chehab UN No No No No 473700bec3SMauro Carvalho Chehab SH Yes Yes No No 483700bec3SMauro Carvalho Chehab DF No Yes No No 493700bec3SMauro Carvalho Chehab EX Yes Yes Yes Yes 503700bec3SMauro Carvalho Chehab========== ========== ============== ========== ============== 513700bec3SMauro Carvalho Chehab 523700bec3SMauro Carvalho ChehabThese rules are implemented using the various glock operations which 533700bec3SMauro Carvalho Chehabare defined for each type of glock. Not all types of glocks use 543700bec3SMauro Carvalho Chehaball the modes. Only inode glocks use the DF mode for example. 553700bec3SMauro Carvalho Chehab 563700bec3SMauro Carvalho ChehabTable of glock operations and per type constants: 573700bec3SMauro Carvalho Chehab 583700bec3SMauro Carvalho Chehab============= ============================================================= 593700bec3SMauro Carvalho ChehabField Purpose 603700bec3SMauro Carvalho Chehab============= ============================================================= 613700bec3SMauro Carvalho Chehabgo_xmote_th Called before remote state change (e.g. to sync dirty data) 623700bec3SMauro Carvalho Chehabgo_xmote_bh Called after remote state change (e.g. to refill cache) 633700bec3SMauro Carvalho Chehabgo_inval Called if remote state change requires invalidating the cache 643700bec3SMauro Carvalho Chehabgo_demote_ok Returns boolean value of whether its ok to demote a glock 653700bec3SMauro Carvalho Chehab (e.g. checks timeout, and that there is no cached data) 663700bec3SMauro Carvalho Chehabgo_lock Called for the first local holder of a lock 673700bec3SMauro Carvalho Chehabgo_unlock Called on the final local unlock of a lock 683700bec3SMauro Carvalho Chehabgo_dump Called to print content of object for debugfs file, or on 693700bec3SMauro Carvalho Chehab error to dump glock to the log. 703700bec3SMauro Carvalho Chehabgo_type The type of the glock, ``LM_TYPE_*`` 713700bec3SMauro Carvalho Chehabgo_callback Called if the DLM sends a callback to drop this lock 723700bec3SMauro Carvalho Chehabgo_flags GLOF_ASPACE is set, if the glock has an address space 733700bec3SMauro Carvalho Chehab associated with it 743700bec3SMauro Carvalho Chehab============= ============================================================= 753700bec3SMauro Carvalho Chehab 763700bec3SMauro Carvalho ChehabThe minimum hold time for each lock is the time after a remote lock 773700bec3SMauro Carvalho Chehabgrant for which we ignore remote demote requests. This is in order to 783700bec3SMauro Carvalho Chehabprevent a situation where locks are being bounced around the cluster 793700bec3SMauro Carvalho Chehabfrom node to node with none of the nodes making any progress. This 80d56b699dSBjorn Helgaastends to show up most with shared mmapped files which are being written 813700bec3SMauro Carvalho Chehabto by multiple nodes. By delaying the demotion in response to a 823700bec3SMauro Carvalho Chehabremote callback, that gives the userspace program time to make 833700bec3SMauro Carvalho Chehabsome progress before the pages are unmapped. 843700bec3SMauro Carvalho Chehab 853700bec3SMauro Carvalho ChehabThere is a plan to try and remove the go_lock and go_unlock callbacks 863700bec3SMauro Carvalho Chehabif possible, in order to try and speed up the fast path though the locking. 873700bec3SMauro Carvalho ChehabAlso, eventually we hope to make the glock "EX" mode locally shared 883700bec3SMauro Carvalho Chehabsuch that any local locking will be done with the i_mutex as required 893700bec3SMauro Carvalho Chehabrather than via the glock. 903700bec3SMauro Carvalho Chehab 913700bec3SMauro Carvalho ChehabLocking rules for glock operations: 923700bec3SMauro Carvalho Chehab 933700bec3SMauro Carvalho Chehab============= ====================== ============================= 943700bec3SMauro Carvalho ChehabOperation GLF_LOCK bit lock held gl_lockref.lock spinlock held 953700bec3SMauro Carvalho Chehab============= ====================== ============================= 963700bec3SMauro Carvalho Chehabgo_xmote_th Yes No 973700bec3SMauro Carvalho Chehabgo_xmote_bh Yes No 983700bec3SMauro Carvalho Chehabgo_inval Yes No 993700bec3SMauro Carvalho Chehabgo_demote_ok Sometimes Yes 1003700bec3SMauro Carvalho Chehabgo_lock Yes No 1013700bec3SMauro Carvalho Chehabgo_unlock Yes No 1023700bec3SMauro Carvalho Chehabgo_dump Sometimes Yes 1033700bec3SMauro Carvalho Chehabgo_callback Sometimes (N/A) Yes 1043700bec3SMauro Carvalho Chehab============= ====================== ============================= 1053700bec3SMauro Carvalho Chehab 1063700bec3SMauro Carvalho Chehab.. Note:: 1073700bec3SMauro Carvalho Chehab 1083700bec3SMauro Carvalho Chehab Operations must not drop either the bit lock or the spinlock 1093700bec3SMauro Carvalho Chehab if its held on entry. go_dump and do_demote_ok must never block. 1103700bec3SMauro Carvalho Chehab Note that go_dump will only be called if the glock's state 1113700bec3SMauro Carvalho Chehab indicates that it is caching uptodate data. 1123700bec3SMauro Carvalho Chehab 1133700bec3SMauro Carvalho ChehabGlock locking order within GFS2: 1143700bec3SMauro Carvalho Chehab 1153700bec3SMauro Carvalho Chehab 1. i_rwsem (if required) 1163700bec3SMauro Carvalho Chehab 2. Rename glock (for rename only) 1173700bec3SMauro Carvalho Chehab 3. Inode glock(s) 1183700bec3SMauro Carvalho Chehab (Parents before children, inodes at "same level" with same parent in 1193700bec3SMauro Carvalho Chehab lock number order) 1203700bec3SMauro Carvalho Chehab 4. Rgrp glock(s) (for (de)allocation operations) 1213700bec3SMauro Carvalho Chehab 5. Transaction glock (via gfs2_trans_begin) for non-read operations 1223700bec3SMauro Carvalho Chehab 6. i_rw_mutex (if required) 1233700bec3SMauro Carvalho Chehab 7. Page lock (always last, very important!) 1243700bec3SMauro Carvalho Chehab 1253700bec3SMauro Carvalho ChehabThere are two glocks per inode. One deals with access to the inode 1263700bec3SMauro Carvalho Chehabitself (locking order as above), and the other, known as the iopen 1273700bec3SMauro Carvalho Chehabglock is used in conjunction with the i_nlink field in the inode to 1283700bec3SMauro Carvalho Chehabdetermine the lifetime of the inode in question. Locking of inodes 1293700bec3SMauro Carvalho Chehabis on a per-inode basis. Locking of rgrps is on a per rgrp basis. 1303700bec3SMauro Carvalho ChehabIn general we prefer to lock local locks prior to cluster locks. 1313700bec3SMauro Carvalho Chehab 1323700bec3SMauro Carvalho ChehabGlock Statistics 1333700bec3SMauro Carvalho Chehab---------------- 1343700bec3SMauro Carvalho Chehab 1353700bec3SMauro Carvalho ChehabThe stats are divided into two sets: those relating to the 1363700bec3SMauro Carvalho Chehabsuper block and those relating to an individual glock. The 1373700bec3SMauro Carvalho Chehabsuper block stats are done on a per cpu basis in order to 1383700bec3SMauro Carvalho Chehabtry and reduce the overhead of gathering them. They are also 1393700bec3SMauro Carvalho Chehabfurther divided by glock type. All timings are in nanoseconds. 1403700bec3SMauro Carvalho Chehab 1413700bec3SMauro Carvalho ChehabIn the case of both the super block and glock statistics, 1423700bec3SMauro Carvalho Chehabthe same information is gathered in each case. The super 1433700bec3SMauro Carvalho Chehabblock timing statistics are used to provide default values for 1443700bec3SMauro Carvalho Chehabthe glock timing statistics, so that newly created glocks 1453700bec3SMauro Carvalho Chehabshould have, as far as possible, a sensible starting point. 1463700bec3SMauro Carvalho ChehabThe per-glock counters are initialised to zero when the 1473700bec3SMauro Carvalho Chehabglock is created. The per-glock statistics are lost when 1483700bec3SMauro Carvalho Chehabthe glock is ejected from memory. 1493700bec3SMauro Carvalho Chehab 1503700bec3SMauro Carvalho ChehabThe statistics are divided into three pairs of mean and 1513700bec3SMauro Carvalho Chehabvariance, plus two counters. The mean/variance pairs are 1523700bec3SMauro Carvalho Chehabsmoothed exponential estimates and the algorithm used is 1533700bec3SMauro Carvalho Chehabone which will be very familiar to those used to calculation 1543700bec3SMauro Carvalho Chehabof round trip times in network code. See "TCP/IP Illustrated, 1553700bec3SMauro Carvalho ChehabVolume 1", W. Richard Stevens, sect 21.3, "Round-Trip Time Measurement", 1563700bec3SMauro Carvalho Chehabp. 299 and onwards. Also, Volume 2, Sect. 25.10, p. 838 and onwards. 1573700bec3SMauro Carvalho ChehabUnlike the TCP/IP Illustrated case, the mean and variance are 1583700bec3SMauro Carvalho Chehabnot scaled, but are in units of integer nanoseconds. 1593700bec3SMauro Carvalho Chehab 1603700bec3SMauro Carvalho ChehabThe three pairs of mean/variance measure the following 1613700bec3SMauro Carvalho Chehabthings: 1623700bec3SMauro Carvalho Chehab 1633700bec3SMauro Carvalho Chehab 1. DLM lock time (non-blocking requests) 1643700bec3SMauro Carvalho Chehab 2. DLM lock time (blocking requests) 1653700bec3SMauro Carvalho Chehab 3. Inter-request time (again to the DLM) 1663700bec3SMauro Carvalho Chehab 1673700bec3SMauro Carvalho ChehabA non-blocking request is one which will complete right 1683700bec3SMauro Carvalho Chehabaway, whatever the state of the DLM lock in question. That 1693700bec3SMauro Carvalho Chehabcurrently means any requests when (a) the current state of 1703700bec3SMauro Carvalho Chehabthe lock is exclusive, i.e. a lock demotion (b) the requested 1713700bec3SMauro Carvalho Chehabstate is either null or unlocked (again, a demotion) or (c) the 1723700bec3SMauro Carvalho Chehab"try lock" flag is set. A blocking request covers all the other 1733700bec3SMauro Carvalho Chehablock requests. 1743700bec3SMauro Carvalho Chehab 1753700bec3SMauro Carvalho ChehabThere are two counters. The first is there primarily to show 1763700bec3SMauro Carvalho Chehabhow many lock requests have been made, and thus how much data 1773700bec3SMauro Carvalho Chehabhas gone into the mean/variance calculations. The other counter 1783700bec3SMauro Carvalho Chehabis counting queuing of holders at the top layer of the glock 1793700bec3SMauro Carvalho Chehabcode. Hopefully that number will be a lot larger than the number 1803700bec3SMauro Carvalho Chehabof dlm lock requests issued. 1813700bec3SMauro Carvalho Chehab 1823700bec3SMauro Carvalho ChehabSo why gather these statistics? There are several reasons 1833700bec3SMauro Carvalho Chehabwe'd like to get a better idea of these timings: 1843700bec3SMauro Carvalho Chehab 1853700bec3SMauro Carvalho Chehab1. To be able to better set the glock "min hold time" 1863700bec3SMauro Carvalho Chehab2. To spot performance issues more easily 1873700bec3SMauro Carvalho Chehab3. To improve the algorithm for selecting resource groups for 1883700bec3SMauro Carvalho Chehab allocation (to base it on lock wait time, rather than blindly 1893700bec3SMauro Carvalho Chehab using a "try lock") 1903700bec3SMauro Carvalho Chehab 1913700bec3SMauro Carvalho ChehabDue to the smoothing action of the updates, a step change in 1923700bec3SMauro Carvalho Chehabsome input quantity being sampled will only fully be taken 1933700bec3SMauro Carvalho Chehabinto account after 8 samples (or 4 for the variance) and this 1943700bec3SMauro Carvalho Chehabneeds to be carefully considered when interpreting the 1953700bec3SMauro Carvalho Chehabresults. 1963700bec3SMauro Carvalho Chehab 1973700bec3SMauro Carvalho ChehabKnowing both the time it takes a lock request to complete and 1983700bec3SMauro Carvalho Chehabthe average time between lock requests for a glock means we 1993700bec3SMauro Carvalho Chehabcan compute the total percentage of the time for which the 2003700bec3SMauro Carvalho Chehabnode is able to use a glock vs. time that the rest of the 2013700bec3SMauro Carvalho Chehabcluster has its share. That will be very useful when setting 2023700bec3SMauro Carvalho Chehabthe lock min hold time. 2033700bec3SMauro Carvalho Chehab 2043700bec3SMauro Carvalho ChehabGreat care has been taken to ensure that we 2053700bec3SMauro Carvalho Chehabmeasure exactly the quantities that we want, as accurately 2063700bec3SMauro Carvalho Chehabas possible. There are always inaccuracies in any 2073700bec3SMauro Carvalho Chehabmeasuring system, but I hope this is as accurate as we 2083700bec3SMauro Carvalho Chehabcan reasonably make it. 2093700bec3SMauro Carvalho Chehab 2103700bec3SMauro Carvalho ChehabPer sb stats can be found here:: 2113700bec3SMauro Carvalho Chehab 2123700bec3SMauro Carvalho Chehab /sys/kernel/debug/gfs2/<fsname>/sbstats 2133700bec3SMauro Carvalho Chehab 2143700bec3SMauro Carvalho ChehabPer glock stats can be found here:: 2153700bec3SMauro Carvalho Chehab 2163700bec3SMauro Carvalho Chehab /sys/kernel/debug/gfs2/<fsname>/glstats 2173700bec3SMauro Carvalho Chehab 2183700bec3SMauro Carvalho ChehabAssuming that debugfs is mounted on /sys/kernel/debug and also 2193700bec3SMauro Carvalho Chehabthat <fsname> is replaced with the name of the gfs2 filesystem 2203700bec3SMauro Carvalho Chehabin question. 2213700bec3SMauro Carvalho Chehab 2223700bec3SMauro Carvalho ChehabThe abbreviations used in the output as are follows: 2233700bec3SMauro Carvalho Chehab 2243700bec3SMauro Carvalho Chehab========= ================================================================ 2253700bec3SMauro Carvalho Chehabsrtt Smoothed round trip time for non blocking dlm requests 2263700bec3SMauro Carvalho Chehabsrttvar Variance estimate for srtt 2273700bec3SMauro Carvalho Chehabsrttb Smoothed round trip time for (potentially) blocking dlm requests 2283700bec3SMauro Carvalho Chehabsrttvarb Variance estimate for srttb 2293700bec3SMauro Carvalho Chehabsirt Smoothed inter request time (for dlm requests) 2303700bec3SMauro Carvalho Chehabsirtvar Variance estimate for sirt 2313700bec3SMauro Carvalho Chehabdlm Number of dlm requests made (dcnt in glstats file) 2323700bec3SMauro Carvalho Chehabqueue Number of glock requests queued (qcnt in glstats file) 2333700bec3SMauro Carvalho Chehab========= ================================================================ 2343700bec3SMauro Carvalho Chehab 2353700bec3SMauro Carvalho ChehabThe sbstats file contains a set of these stats for each glock type (so 8 lines 2363700bec3SMauro Carvalho Chehabfor each type) and for each cpu (one column per cpu). The glstats file contains 2373700bec3SMauro Carvalho Chehaba set of these stats for each glock in a similar format to the glocks file, but 2383700bec3SMauro Carvalho Chehabusing the format mean/variance for each of the timing stats. 2393700bec3SMauro Carvalho Chehab 2403700bec3SMauro Carvalho ChehabThe gfs2_glock_lock_time tracepoint prints out the current values of the stats 2413700bec3SMauro Carvalho Chehabfor the glock in question, along with some addition information on each dlm 2423700bec3SMauro Carvalho Chehabreply that is received: 2433700bec3SMauro Carvalho Chehab 2443700bec3SMauro Carvalho Chehab====== ======================================= 2453700bec3SMauro Carvalho Chehabstatus The status of the dlm request 2463700bec3SMauro Carvalho Chehabflags The dlm request flags 2473700bec3SMauro Carvalho Chehabtdiff The time taken by this specific request 2483700bec3SMauro Carvalho Chehab====== ======================================= 2493700bec3SMauro Carvalho Chehab 2503700bec3SMauro Carvalho Chehab(remaining fields as per above list) 2513700bec3SMauro Carvalho Chehab 2523700bec3SMauro Carvalho Chehab 253