1*e790a4ceSJonathan Corbet========================================================= 2*e790a4ceSJonathan CorbetCluster-wide Power-up/power-down race avoidance algorithm 3*e790a4ceSJonathan Corbet========================================================= 4*e790a4ceSJonathan Corbet 5*e790a4ceSJonathan CorbetThis file documents the algorithm which is used to coordinate CPU and 6*e790a4ceSJonathan Corbetcluster setup and teardown operations and to manage hardware coherency 7*e790a4ceSJonathan Corbetcontrols safely. 8*e790a4ceSJonathan Corbet 9*e790a4ceSJonathan CorbetThe section "Rationale" explains what the algorithm is for and why it is 10*e790a4ceSJonathan Corbetneeded. "Basic model" explains general concepts using a simplified view 11*e790a4ceSJonathan Corbetof the system. The other sections explain the actual details of the 12*e790a4ceSJonathan Corbetalgorithm in use. 13*e790a4ceSJonathan Corbet 14*e790a4ceSJonathan Corbet 15*e790a4ceSJonathan CorbetRationale 16*e790a4ceSJonathan Corbet--------- 17*e790a4ceSJonathan Corbet 18*e790a4ceSJonathan CorbetIn a system containing multiple CPUs, it is desirable to have the 19*e790a4ceSJonathan Corbetability to turn off individual CPUs when the system is idle, reducing 20*e790a4ceSJonathan Corbetpower consumption and thermal dissipation. 21*e790a4ceSJonathan Corbet 22*e790a4ceSJonathan CorbetIn a system containing multiple clusters of CPUs, it is also desirable 23*e790a4ceSJonathan Corbetto have the ability to turn off entire clusters. 24*e790a4ceSJonathan Corbet 25*e790a4ceSJonathan CorbetTurning entire clusters off and on is a risky business, because it 26*e790a4ceSJonathan Corbetinvolves performing potentially destructive operations affecting a group 27*e790a4ceSJonathan Corbetof independently running CPUs, while the OS continues to run. This 28*e790a4ceSJonathan Corbetmeans that we need some coordination in order to ensure that critical 29*e790a4ceSJonathan Corbetcluster-level operations are only performed when it is truly safe to do 30*e790a4ceSJonathan Corbetso. 31*e790a4ceSJonathan Corbet 32*e790a4ceSJonathan CorbetSimple locking may not be sufficient to solve this problem, because 33*e790a4ceSJonathan Corbetmechanisms like Linux spinlocks may rely on coherency mechanisms which 34*e790a4ceSJonathan Corbetare not immediately enabled when a cluster powers up. Since enabling or 35*e790a4ceSJonathan Corbetdisabling those mechanisms may itself be a non-atomic operation (such as 36*e790a4ceSJonathan Corbetwriting some hardware registers and invalidating large caches), other 37*e790a4ceSJonathan Corbetmethods of coordination are required in order to guarantee safe 38*e790a4ceSJonathan Corbetpower-down and power-up at the cluster level. 39*e790a4ceSJonathan Corbet 40*e790a4ceSJonathan CorbetThe mechanism presented in this document describes a coherent memory 41*e790a4ceSJonathan Corbetbased protocol for performing the needed coordination. It aims to be as 42*e790a4ceSJonathan Corbetlightweight as possible, while providing the required safety properties. 43*e790a4ceSJonathan Corbet 44*e790a4ceSJonathan Corbet 45*e790a4ceSJonathan CorbetBasic model 46*e790a4ceSJonathan Corbet----------- 47*e790a4ceSJonathan Corbet 48*e790a4ceSJonathan CorbetEach cluster and CPU is assigned a state, as follows: 49*e790a4ceSJonathan Corbet 50*e790a4ceSJonathan Corbet - DOWN 51*e790a4ceSJonathan Corbet - COMING_UP 52*e790a4ceSJonathan Corbet - UP 53*e790a4ceSJonathan Corbet - GOING_DOWN 54*e790a4ceSJonathan Corbet 55*e790a4ceSJonathan Corbet:: 56*e790a4ceSJonathan Corbet 57*e790a4ceSJonathan Corbet +---------> UP ----------+ 58*e790a4ceSJonathan Corbet | v 59*e790a4ceSJonathan Corbet 60*e790a4ceSJonathan Corbet COMING_UP GOING_DOWN 61*e790a4ceSJonathan Corbet 62*e790a4ceSJonathan Corbet ^ | 63*e790a4ceSJonathan Corbet +--------- DOWN <--------+ 64*e790a4ceSJonathan Corbet 65*e790a4ceSJonathan Corbet 66*e790a4ceSJonathan CorbetDOWN: 67*e790a4ceSJonathan Corbet The CPU or cluster is not coherent, and is either powered off or 68*e790a4ceSJonathan Corbet suspended, or is ready to be powered off or suspended. 69*e790a4ceSJonathan Corbet 70*e790a4ceSJonathan CorbetCOMING_UP: 71*e790a4ceSJonathan Corbet The CPU or cluster has committed to moving to the UP state. 72*e790a4ceSJonathan Corbet It may be part way through the process of initialisation and 73*e790a4ceSJonathan Corbet enabling coherency. 74*e790a4ceSJonathan Corbet 75*e790a4ceSJonathan CorbetUP: 76*e790a4ceSJonathan Corbet The CPU or cluster is active and coherent at the hardware 77*e790a4ceSJonathan Corbet level. A CPU in this state is not necessarily being used 78*e790a4ceSJonathan Corbet actively by the kernel. 79*e790a4ceSJonathan Corbet 80*e790a4ceSJonathan CorbetGOING_DOWN: 81*e790a4ceSJonathan Corbet The CPU or cluster has committed to moving to the DOWN 82*e790a4ceSJonathan Corbet state. It may be part way through the process of teardown and 83*e790a4ceSJonathan Corbet coherency exit. 84*e790a4ceSJonathan Corbet 85*e790a4ceSJonathan Corbet 86*e790a4ceSJonathan CorbetEach CPU has one of these states assigned to it at any point in time. 87*e790a4ceSJonathan CorbetThe CPU states are described in the "CPU state" section, below. 88*e790a4ceSJonathan Corbet 89*e790a4ceSJonathan CorbetEach cluster is also assigned a state, but it is necessary to split the 90*e790a4ceSJonathan Corbetstate value into two parts (the "cluster" state and "inbound" state) and 91*e790a4ceSJonathan Corbetto introduce additional states in order to avoid races between different 92*e790a4ceSJonathan CorbetCPUs in the cluster simultaneously modifying the state. The cluster- 93*e790a4ceSJonathan Corbetlevel states are described in the "Cluster state" section. 94*e790a4ceSJonathan Corbet 95*e790a4ceSJonathan CorbetTo help distinguish the CPU states from cluster states in this 96*e790a4ceSJonathan Corbetdiscussion, the state names are given a `CPU_` prefix for the CPU states, 97*e790a4ceSJonathan Corbetand a `CLUSTER_` or `INBOUND_` prefix for the cluster states. 98*e790a4ceSJonathan Corbet 99*e790a4ceSJonathan Corbet 100*e790a4ceSJonathan CorbetCPU state 101*e790a4ceSJonathan Corbet--------- 102*e790a4ceSJonathan Corbet 103*e790a4ceSJonathan CorbetIn this algorithm, each individual core in a multi-core processor is 104*e790a4ceSJonathan Corbetreferred to as a "CPU". CPUs are assumed to be single-threaded: 105*e790a4ceSJonathan Corbettherefore, a CPU can only be doing one thing at a single point in time. 106*e790a4ceSJonathan Corbet 107*e790a4ceSJonathan CorbetThis means that CPUs fit the basic model closely. 108*e790a4ceSJonathan Corbet 109*e790a4ceSJonathan CorbetThe algorithm defines the following states for each CPU in the system: 110*e790a4ceSJonathan Corbet 111*e790a4ceSJonathan Corbet - CPU_DOWN 112*e790a4ceSJonathan Corbet - CPU_COMING_UP 113*e790a4ceSJonathan Corbet - CPU_UP 114*e790a4ceSJonathan Corbet - CPU_GOING_DOWN 115*e790a4ceSJonathan Corbet 116*e790a4ceSJonathan Corbet:: 117*e790a4ceSJonathan Corbet 118*e790a4ceSJonathan Corbet cluster setup and 119*e790a4ceSJonathan Corbet CPU setup complete policy decision 120*e790a4ceSJonathan Corbet +-----------> CPU_UP ------------+ 121*e790a4ceSJonathan Corbet | v 122*e790a4ceSJonathan Corbet 123*e790a4ceSJonathan Corbet CPU_COMING_UP CPU_GOING_DOWN 124*e790a4ceSJonathan Corbet 125*e790a4ceSJonathan Corbet ^ | 126*e790a4ceSJonathan Corbet +----------- CPU_DOWN <----------+ 127*e790a4ceSJonathan Corbet policy decision CPU teardown complete 128*e790a4ceSJonathan Corbet or hardware event 129*e790a4ceSJonathan Corbet 130*e790a4ceSJonathan Corbet 131*e790a4ceSJonathan CorbetThe definitions of the four states correspond closely to the states of 132*e790a4ceSJonathan Corbetthe basic model. 133*e790a4ceSJonathan Corbet 134*e790a4ceSJonathan CorbetTransitions between states occur as follows. 135*e790a4ceSJonathan Corbet 136*e790a4ceSJonathan CorbetA trigger event (spontaneous) means that the CPU can transition to the 137*e790a4ceSJonathan Corbetnext state as a result of making local progress only, with no 138*e790a4ceSJonathan Corbetrequirement for any external event to happen. 139*e790a4ceSJonathan Corbet 140*e790a4ceSJonathan Corbet 141*e790a4ceSJonathan CorbetCPU_DOWN: 142*e790a4ceSJonathan Corbet A CPU reaches the CPU_DOWN state when it is ready for 143*e790a4ceSJonathan Corbet power-down. On reaching this state, the CPU will typically 144*e790a4ceSJonathan Corbet power itself down or suspend itself, via a WFI instruction or a 145*e790a4ceSJonathan Corbet firmware call. 146*e790a4ceSJonathan Corbet 147*e790a4ceSJonathan Corbet Next state: 148*e790a4ceSJonathan Corbet CPU_COMING_UP 149*e790a4ceSJonathan Corbet Conditions: 150*e790a4ceSJonathan Corbet none 151*e790a4ceSJonathan Corbet 152*e790a4ceSJonathan Corbet Trigger events: 153*e790a4ceSJonathan Corbet a) an explicit hardware power-up operation, resulting 154*e790a4ceSJonathan Corbet from a policy decision on another CPU; 155*e790a4ceSJonathan Corbet 156*e790a4ceSJonathan Corbet b) a hardware event, such as an interrupt. 157*e790a4ceSJonathan Corbet 158*e790a4ceSJonathan Corbet 159*e790a4ceSJonathan CorbetCPU_COMING_UP: 160*e790a4ceSJonathan Corbet A CPU cannot start participating in hardware coherency until the 161*e790a4ceSJonathan Corbet cluster is set up and coherent. If the cluster is not ready, 162*e790a4ceSJonathan Corbet then the CPU will wait in the CPU_COMING_UP state until the 163*e790a4ceSJonathan Corbet cluster has been set up. 164*e790a4ceSJonathan Corbet 165*e790a4ceSJonathan Corbet Next state: 166*e790a4ceSJonathan Corbet CPU_UP 167*e790a4ceSJonathan Corbet Conditions: 168*e790a4ceSJonathan Corbet The CPU's parent cluster must be in CLUSTER_UP. 169*e790a4ceSJonathan Corbet Trigger events: 170*e790a4ceSJonathan Corbet Transition of the parent cluster to CLUSTER_UP. 171*e790a4ceSJonathan Corbet 172*e790a4ceSJonathan Corbet Refer to the "Cluster state" section for a description of the 173*e790a4ceSJonathan Corbet CLUSTER_UP state. 174*e790a4ceSJonathan Corbet 175*e790a4ceSJonathan Corbet 176*e790a4ceSJonathan CorbetCPU_UP: 177*e790a4ceSJonathan Corbet When a CPU reaches the CPU_UP state, it is safe for the CPU to 178*e790a4ceSJonathan Corbet start participating in local coherency. 179*e790a4ceSJonathan Corbet 180*e790a4ceSJonathan Corbet This is done by jumping to the kernel's CPU resume code. 181*e790a4ceSJonathan Corbet 182*e790a4ceSJonathan Corbet Note that the definition of this state is slightly different 183*e790a4ceSJonathan Corbet from the basic model definition: CPU_UP does not mean that the 184*e790a4ceSJonathan Corbet CPU is coherent yet, but it does mean that it is safe to resume 185*e790a4ceSJonathan Corbet the kernel. The kernel handles the rest of the resume 186*e790a4ceSJonathan Corbet procedure, so the remaining steps are not visible as part of the 187*e790a4ceSJonathan Corbet race avoidance algorithm. 188*e790a4ceSJonathan Corbet 189*e790a4ceSJonathan Corbet The CPU remains in this state until an explicit policy decision 190*e790a4ceSJonathan Corbet is made to shut down or suspend the CPU. 191*e790a4ceSJonathan Corbet 192*e790a4ceSJonathan Corbet Next state: 193*e790a4ceSJonathan Corbet CPU_GOING_DOWN 194*e790a4ceSJonathan Corbet Conditions: 195*e790a4ceSJonathan Corbet none 196*e790a4ceSJonathan Corbet Trigger events: 197*e790a4ceSJonathan Corbet explicit policy decision 198*e790a4ceSJonathan Corbet 199*e790a4ceSJonathan Corbet 200*e790a4ceSJonathan CorbetCPU_GOING_DOWN: 201*e790a4ceSJonathan Corbet While in this state, the CPU exits coherency, including any 202*e790a4ceSJonathan Corbet operations required to achieve this (such as cleaning data 203*e790a4ceSJonathan Corbet caches). 204*e790a4ceSJonathan Corbet 205*e790a4ceSJonathan Corbet Next state: 206*e790a4ceSJonathan Corbet CPU_DOWN 207*e790a4ceSJonathan Corbet Conditions: 208*e790a4ceSJonathan Corbet local CPU teardown complete 209*e790a4ceSJonathan Corbet Trigger events: 210*e790a4ceSJonathan Corbet (spontaneous) 211*e790a4ceSJonathan Corbet 212*e790a4ceSJonathan Corbet 213*e790a4ceSJonathan CorbetCluster state 214*e790a4ceSJonathan Corbet------------- 215*e790a4ceSJonathan Corbet 216*e790a4ceSJonathan CorbetA cluster is a group of connected CPUs with some common resources. 217*e790a4ceSJonathan CorbetBecause a cluster contains multiple CPUs, it can be doing multiple 218*e790a4ceSJonathan Corbetthings at the same time. This has some implications. In particular, a 219*e790a4ceSJonathan CorbetCPU can start up while another CPU is tearing the cluster down. 220*e790a4ceSJonathan Corbet 221*e790a4ceSJonathan CorbetIn this discussion, the "outbound side" is the view of the cluster state 222*e790a4ceSJonathan Corbetas seen by a CPU tearing the cluster down. The "inbound side" is the 223*e790a4ceSJonathan Corbetview of the cluster state as seen by a CPU setting the CPU up. 224*e790a4ceSJonathan Corbet 225*e790a4ceSJonathan CorbetIn order to enable safe coordination in such situations, it is important 226*e790a4ceSJonathan Corbetthat a CPU which is setting up the cluster can advertise its state 227*e790a4ceSJonathan Corbetindependently of the CPU which is tearing down the cluster. For this 228*e790a4ceSJonathan Corbetreason, the cluster state is split into two parts: 229*e790a4ceSJonathan Corbet 230*e790a4ceSJonathan Corbet "cluster" state: The global state of the cluster; or the state 231*e790a4ceSJonathan Corbet on the outbound side: 232*e790a4ceSJonathan Corbet 233*e790a4ceSJonathan Corbet - CLUSTER_DOWN 234*e790a4ceSJonathan Corbet - CLUSTER_UP 235*e790a4ceSJonathan Corbet - CLUSTER_GOING_DOWN 236*e790a4ceSJonathan Corbet 237*e790a4ceSJonathan Corbet "inbound" state: The state of the cluster on the inbound side. 238*e790a4ceSJonathan Corbet 239*e790a4ceSJonathan Corbet - INBOUND_NOT_COMING_UP 240*e790a4ceSJonathan Corbet - INBOUND_COMING_UP 241*e790a4ceSJonathan Corbet 242*e790a4ceSJonathan Corbet 243*e790a4ceSJonathan Corbet The different pairings of these states results in six possible 244*e790a4ceSJonathan Corbet states for the cluster as a whole:: 245*e790a4ceSJonathan Corbet 246*e790a4ceSJonathan Corbet CLUSTER_UP 247*e790a4ceSJonathan Corbet +==========> INBOUND_NOT_COMING_UP -------------+ 248*e790a4ceSJonathan Corbet # | 249*e790a4ceSJonathan Corbet | 250*e790a4ceSJonathan Corbet CLUSTER_UP <----+ | 251*e790a4ceSJonathan Corbet INBOUND_COMING_UP | v 252*e790a4ceSJonathan Corbet 253*e790a4ceSJonathan Corbet ^ CLUSTER_GOING_DOWN CLUSTER_GOING_DOWN 254*e790a4ceSJonathan Corbet # INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP 255*e790a4ceSJonathan Corbet 256*e790a4ceSJonathan Corbet CLUSTER_DOWN | | 257*e790a4ceSJonathan Corbet INBOUND_COMING_UP <----+ | 258*e790a4ceSJonathan Corbet | 259*e790a4ceSJonathan Corbet ^ | 260*e790a4ceSJonathan Corbet +=========== CLUSTER_DOWN <------------+ 261*e790a4ceSJonathan Corbet INBOUND_NOT_COMING_UP 262*e790a4ceSJonathan Corbet 263*e790a4ceSJonathan Corbet Transitions -----> can only be made by the outbound CPU, and 264*e790a4ceSJonathan Corbet only involve changes to the "cluster" state. 265*e790a4ceSJonathan Corbet 266*e790a4ceSJonathan Corbet Transitions ===##> can only be made by the inbound CPU, and only 267*e790a4ceSJonathan Corbet involve changes to the "inbound" state, except where there is no 268*e790a4ceSJonathan Corbet further transition possible on the outbound side (i.e., the 269*e790a4ceSJonathan Corbet outbound CPU has put the cluster into the CLUSTER_DOWN state). 270*e790a4ceSJonathan Corbet 271*e790a4ceSJonathan Corbet The race avoidance algorithm does not provide a way to determine 272*e790a4ceSJonathan Corbet which exact CPUs within the cluster play these roles. This must 273*e790a4ceSJonathan Corbet be decided in advance by some other means. Refer to the section 274*e790a4ceSJonathan Corbet "Last man and first man selection" for more explanation. 275*e790a4ceSJonathan Corbet 276*e790a4ceSJonathan Corbet 277*e790a4ceSJonathan Corbet CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the 278*e790a4ceSJonathan Corbet cluster can actually be powered down. 279*e790a4ceSJonathan Corbet 280*e790a4ceSJonathan Corbet The parallelism of the inbound and outbound CPUs is observed by 281*e790a4ceSJonathan Corbet the existence of two different paths from CLUSTER_GOING_DOWN/ 282*e790a4ceSJonathan Corbet INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic 283*e790a4ceSJonathan Corbet model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to 284*e790a4ceSJonathan Corbet COMING_UP in the basic model). The second path avoids cluster 285*e790a4ceSJonathan Corbet teardown completely. 286*e790a4ceSJonathan Corbet 287*e790a4ceSJonathan Corbet CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic 288*e790a4ceSJonathan Corbet model. The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP 289*e790a4ceSJonathan Corbet is trivial and merely resets the state machine ready for the 290*e790a4ceSJonathan Corbet next cycle. 291*e790a4ceSJonathan Corbet 292*e790a4ceSJonathan Corbet Details of the allowable transitions follow. 293*e790a4ceSJonathan Corbet 294*e790a4ceSJonathan Corbet The next state in each case is notated 295*e790a4ceSJonathan Corbet 296*e790a4ceSJonathan Corbet <cluster state>/<inbound state> (<transitioner>) 297*e790a4ceSJonathan Corbet 298*e790a4ceSJonathan Corbet where the <transitioner> is the side on which the transition 299*e790a4ceSJonathan Corbet can occur; either the inbound or the outbound side. 300*e790a4ceSJonathan Corbet 301*e790a4ceSJonathan Corbet 302*e790a4ceSJonathan CorbetCLUSTER_DOWN/INBOUND_NOT_COMING_UP: 303*e790a4ceSJonathan Corbet Next state: 304*e790a4ceSJonathan Corbet CLUSTER_DOWN/INBOUND_COMING_UP (inbound) 305*e790a4ceSJonathan Corbet Conditions: 306*e790a4ceSJonathan Corbet none 307*e790a4ceSJonathan Corbet 308*e790a4ceSJonathan Corbet Trigger events: 309*e790a4ceSJonathan Corbet a) an explicit hardware power-up operation, resulting 310*e790a4ceSJonathan Corbet from a policy decision on another CPU; 311*e790a4ceSJonathan Corbet 312*e790a4ceSJonathan Corbet b) a hardware event, such as an interrupt. 313*e790a4ceSJonathan Corbet 314*e790a4ceSJonathan Corbet 315*e790a4ceSJonathan CorbetCLUSTER_DOWN/INBOUND_COMING_UP: 316*e790a4ceSJonathan Corbet 317*e790a4ceSJonathan Corbet In this state, an inbound CPU sets up the cluster, including 318*e790a4ceSJonathan Corbet enabling of hardware coherency at the cluster level and any 319*e790a4ceSJonathan Corbet other operations (such as cache invalidation) which are required 320*e790a4ceSJonathan Corbet in order to achieve this. 321*e790a4ceSJonathan Corbet 322*e790a4ceSJonathan Corbet The purpose of this state is to do sufficient cluster-level 323*e790a4ceSJonathan Corbet setup to enable other CPUs in the cluster to enter coherency 324*e790a4ceSJonathan Corbet safely. 325*e790a4ceSJonathan Corbet 326*e790a4ceSJonathan Corbet Next state: 327*e790a4ceSJonathan Corbet CLUSTER_UP/INBOUND_COMING_UP (inbound) 328*e790a4ceSJonathan Corbet Conditions: 329*e790a4ceSJonathan Corbet cluster-level setup and hardware coherency complete 330*e790a4ceSJonathan Corbet Trigger events: 331*e790a4ceSJonathan Corbet (spontaneous) 332*e790a4ceSJonathan Corbet 333*e790a4ceSJonathan Corbet 334*e790a4ceSJonathan CorbetCLUSTER_UP/INBOUND_COMING_UP: 335*e790a4ceSJonathan Corbet 336*e790a4ceSJonathan Corbet Cluster-level setup is complete and hardware coherency is 337*e790a4ceSJonathan Corbet enabled for the cluster. Other CPUs in the cluster can safely 338*e790a4ceSJonathan Corbet enter coherency. 339*e790a4ceSJonathan Corbet 340*e790a4ceSJonathan Corbet This is a transient state, leading immediately to 341*e790a4ceSJonathan Corbet CLUSTER_UP/INBOUND_NOT_COMING_UP. All other CPUs on the cluster 342*e790a4ceSJonathan Corbet should consider treat these two states as equivalent. 343*e790a4ceSJonathan Corbet 344*e790a4ceSJonathan Corbet Next state: 345*e790a4ceSJonathan Corbet CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound) 346*e790a4ceSJonathan Corbet Conditions: 347*e790a4ceSJonathan Corbet none 348*e790a4ceSJonathan Corbet Trigger events: 349*e790a4ceSJonathan Corbet (spontaneous) 350*e790a4ceSJonathan Corbet 351*e790a4ceSJonathan Corbet 352*e790a4ceSJonathan CorbetCLUSTER_UP/INBOUND_NOT_COMING_UP: 353*e790a4ceSJonathan Corbet 354*e790a4ceSJonathan Corbet Cluster-level setup is complete and hardware coherency is 355*e790a4ceSJonathan Corbet enabled for the cluster. Other CPUs in the cluster can safely 356*e790a4ceSJonathan Corbet enter coherency. 357*e790a4ceSJonathan Corbet 358*e790a4ceSJonathan Corbet The cluster will remain in this state until a policy decision is 359*e790a4ceSJonathan Corbet made to power the cluster down. 360*e790a4ceSJonathan Corbet 361*e790a4ceSJonathan Corbet Next state: 362*e790a4ceSJonathan Corbet CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound) 363*e790a4ceSJonathan Corbet Conditions: 364*e790a4ceSJonathan Corbet none 365*e790a4ceSJonathan Corbet Trigger events: 366*e790a4ceSJonathan Corbet policy decision to power down the cluster 367*e790a4ceSJonathan Corbet 368*e790a4ceSJonathan Corbet 369*e790a4ceSJonathan CorbetCLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP: 370*e790a4ceSJonathan Corbet 371*e790a4ceSJonathan Corbet An outbound CPU is tearing the cluster down. The selected CPU 372*e790a4ceSJonathan Corbet must wait in this state until all CPUs in the cluster are in the 373*e790a4ceSJonathan Corbet CPU_DOWN state. 374*e790a4ceSJonathan Corbet 375*e790a4ceSJonathan Corbet When all CPUs are in the CPU_DOWN state, the cluster can be torn 376*e790a4ceSJonathan Corbet down, for example by cleaning data caches and exiting 377*e790a4ceSJonathan Corbet cluster-level coherency. 378*e790a4ceSJonathan Corbet 379*e790a4ceSJonathan Corbet To avoid wasteful unnecessary teardown operations, the outbound 380*e790a4ceSJonathan Corbet should check the inbound cluster state for asynchronous 381*e790a4ceSJonathan Corbet transitions to INBOUND_COMING_UP. Alternatively, individual 382*e790a4ceSJonathan Corbet CPUs can be checked for entry into CPU_COMING_UP or CPU_UP. 383*e790a4ceSJonathan Corbet 384*e790a4ceSJonathan Corbet 385*e790a4ceSJonathan Corbet Next states: 386*e790a4ceSJonathan Corbet 387*e790a4ceSJonathan Corbet CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound) 388*e790a4ceSJonathan Corbet Conditions: 389*e790a4ceSJonathan Corbet cluster torn down and ready to power off 390*e790a4ceSJonathan Corbet Trigger events: 391*e790a4ceSJonathan Corbet (spontaneous) 392*e790a4ceSJonathan Corbet 393*e790a4ceSJonathan Corbet CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound) 394*e790a4ceSJonathan Corbet Conditions: 395*e790a4ceSJonathan Corbet none 396*e790a4ceSJonathan Corbet 397*e790a4ceSJonathan Corbet Trigger events: 398*e790a4ceSJonathan Corbet a) an explicit hardware power-up operation, 399*e790a4ceSJonathan Corbet resulting from a policy decision on another 400*e790a4ceSJonathan Corbet CPU; 401*e790a4ceSJonathan Corbet 402*e790a4ceSJonathan Corbet b) a hardware event, such as an interrupt. 403*e790a4ceSJonathan Corbet 404*e790a4ceSJonathan Corbet 405*e790a4ceSJonathan CorbetCLUSTER_GOING_DOWN/INBOUND_COMING_UP: 406*e790a4ceSJonathan Corbet 407*e790a4ceSJonathan Corbet The cluster is (or was) being torn down, but another CPU has 408*e790a4ceSJonathan Corbet come online in the meantime and is trying to set up the cluster 409*e790a4ceSJonathan Corbet again. 410*e790a4ceSJonathan Corbet 411*e790a4ceSJonathan Corbet If the outbound CPU observes this state, it has two choices: 412*e790a4ceSJonathan Corbet 413*e790a4ceSJonathan Corbet a) back out of teardown, restoring the cluster to the 414*e790a4ceSJonathan Corbet CLUSTER_UP state; 415*e790a4ceSJonathan Corbet 416*e790a4ceSJonathan Corbet b) finish tearing the cluster down and put the cluster 417*e790a4ceSJonathan Corbet in the CLUSTER_DOWN state; the inbound CPU will 418*e790a4ceSJonathan Corbet set up the cluster again from there. 419*e790a4ceSJonathan Corbet 420*e790a4ceSJonathan Corbet Choice (a) permits the removal of some latency by avoiding 421*e790a4ceSJonathan Corbet unnecessary teardown and setup operations in situations where 422*e790a4ceSJonathan Corbet the cluster is not really going to be powered down. 423*e790a4ceSJonathan Corbet 424*e790a4ceSJonathan Corbet 425*e790a4ceSJonathan Corbet Next states: 426*e790a4ceSJonathan Corbet 427*e790a4ceSJonathan Corbet CLUSTER_UP/INBOUND_COMING_UP (outbound) 428*e790a4ceSJonathan Corbet Conditions: 429*e790a4ceSJonathan Corbet cluster-level setup and hardware 430*e790a4ceSJonathan Corbet coherency complete 431*e790a4ceSJonathan Corbet 432*e790a4ceSJonathan Corbet Trigger events: 433*e790a4ceSJonathan Corbet (spontaneous) 434*e790a4ceSJonathan Corbet 435*e790a4ceSJonathan Corbet CLUSTER_DOWN/INBOUND_COMING_UP (outbound) 436*e790a4ceSJonathan Corbet Conditions: 437*e790a4ceSJonathan Corbet cluster torn down and ready to power off 438*e790a4ceSJonathan Corbet 439*e790a4ceSJonathan Corbet Trigger events: 440*e790a4ceSJonathan Corbet (spontaneous) 441*e790a4ceSJonathan Corbet 442*e790a4ceSJonathan Corbet 443*e790a4ceSJonathan CorbetLast man and First man selection 444*e790a4ceSJonathan Corbet-------------------------------- 445*e790a4ceSJonathan Corbet 446*e790a4ceSJonathan CorbetThe CPU which performs cluster tear-down operations on the outbound side 447*e790a4ceSJonathan Corbetis commonly referred to as the "last man". 448*e790a4ceSJonathan Corbet 449*e790a4ceSJonathan CorbetThe CPU which performs cluster setup on the inbound side is commonly 450*e790a4ceSJonathan Corbetreferred to as the "first man". 451*e790a4ceSJonathan Corbet 452*e790a4ceSJonathan CorbetThe race avoidance algorithm documented above does not provide a 453*e790a4ceSJonathan Corbetmechanism to choose which CPUs should play these roles. 454*e790a4ceSJonathan Corbet 455*e790a4ceSJonathan Corbet 456*e790a4ceSJonathan CorbetLast man: 457*e790a4ceSJonathan Corbet 458*e790a4ceSJonathan CorbetWhen shutting down the cluster, all the CPUs involved are initially 459*e790a4ceSJonathan Corbetexecuting Linux and hence coherent. Therefore, ordinary spinlocks can 460*e790a4ceSJonathan Corbetbe used to select a last man safely, before the CPUs become 461*e790a4ceSJonathan Corbetnon-coherent. 462*e790a4ceSJonathan Corbet 463*e790a4ceSJonathan Corbet 464*e790a4ceSJonathan CorbetFirst man: 465*e790a4ceSJonathan Corbet 466*e790a4ceSJonathan CorbetBecause CPUs may power up asynchronously in response to external wake-up 467*e790a4ceSJonathan Corbetevents, a dynamic mechanism is needed to make sure that only one CPU 468*e790a4ceSJonathan Corbetattempts to play the first man role and do the cluster-level 469*e790a4ceSJonathan Corbetinitialisation: any other CPUs must wait for this to complete before 470*e790a4ceSJonathan Corbetproceeding. 471*e790a4ceSJonathan Corbet 472*e790a4ceSJonathan CorbetCluster-level initialisation may involve actions such as configuring 473*e790a4ceSJonathan Corbetcoherency controls in the bus fabric. 474*e790a4ceSJonathan Corbet 475*e790a4ceSJonathan CorbetThe current implementation in mcpm_head.S uses a separate mutual exclusion 476*e790a4ceSJonathan Corbetmechanism to do this arbitration. This mechanism is documented in 477*e790a4ceSJonathan Corbetdetail in vlocks.txt. 478*e790a4ceSJonathan Corbet 479*e790a4ceSJonathan Corbet 480*e790a4ceSJonathan CorbetFeatures and Limitations 481*e790a4ceSJonathan Corbet------------------------ 482*e790a4ceSJonathan Corbet 483*e790a4ceSJonathan CorbetImplementation: 484*e790a4ceSJonathan Corbet 485*e790a4ceSJonathan Corbet The current ARM-based implementation is split between 486*e790a4ceSJonathan Corbet arch/arm/common/mcpm_head.S (low-level inbound CPU operations) and 487*e790a4ceSJonathan Corbet arch/arm/common/mcpm_entry.c (everything else): 488*e790a4ceSJonathan Corbet 489*e790a4ceSJonathan Corbet __mcpm_cpu_going_down() signals the transition of a CPU to the 490*e790a4ceSJonathan Corbet CPU_GOING_DOWN state. 491*e790a4ceSJonathan Corbet 492*e790a4ceSJonathan Corbet __mcpm_cpu_down() signals the transition of a CPU to the CPU_DOWN 493*e790a4ceSJonathan Corbet state. 494*e790a4ceSJonathan Corbet 495*e790a4ceSJonathan Corbet A CPU transitions to CPU_COMING_UP and then to CPU_UP via the 496*e790a4ceSJonathan Corbet low-level power-up code in mcpm_head.S. This could 497*e790a4ceSJonathan Corbet involve CPU-specific setup code, but in the current 498*e790a4ceSJonathan Corbet implementation it does not. 499*e790a4ceSJonathan Corbet 500*e790a4ceSJonathan Corbet __mcpm_outbound_enter_critical() and __mcpm_outbound_leave_critical() 501*e790a4ceSJonathan Corbet handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN 502*e790a4ceSJonathan Corbet and from there to CLUSTER_DOWN or back to CLUSTER_UP (in 503*e790a4ceSJonathan Corbet the case of an aborted cluster power-down). 504*e790a4ceSJonathan Corbet 505*e790a4ceSJonathan Corbet These functions are more complex than the __mcpm_cpu_*() 506*e790a4ceSJonathan Corbet functions due to the extra inter-CPU coordination which 507*e790a4ceSJonathan Corbet is needed for safe transitions at the cluster level. 508*e790a4ceSJonathan Corbet 509*e790a4ceSJonathan Corbet A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via 510*e790a4ceSJonathan Corbet the low-level power-up code in mcpm_head.S. This 511*e790a4ceSJonathan Corbet typically involves platform-specific setup code, 512*e790a4ceSJonathan Corbet provided by the platform-specific power_up_setup 513*e790a4ceSJonathan Corbet function registered via mcpm_sync_init. 514*e790a4ceSJonathan Corbet 515*e790a4ceSJonathan CorbetDeep topologies: 516*e790a4ceSJonathan Corbet 517*e790a4ceSJonathan Corbet As currently described and implemented, the algorithm does not 518*e790a4ceSJonathan Corbet support CPU topologies involving more than two levels (i.e., 519*e790a4ceSJonathan Corbet clusters of clusters are not supported). The algorithm could be 520*e790a4ceSJonathan Corbet extended by replicating the cluster-level states for the 521*e790a4ceSJonathan Corbet additional topological levels, and modifying the transition 522*e790a4ceSJonathan Corbet rules for the intermediate (non-outermost) cluster levels. 523*e790a4ceSJonathan Corbet 524*e790a4ceSJonathan Corbet 525*e790a4ceSJonathan CorbetColophon 526*e790a4ceSJonathan Corbet-------- 527*e790a4ceSJonathan Corbet 528*e790a4ceSJonathan CorbetOriginally created and documented by Dave Martin for Linaro Limited, in 529*e790a4ceSJonathan Corbetcollaboration with Nicolas Pitre and Achin Gupta. 530*e790a4ceSJonathan Corbet 531*e790a4ceSJonathan CorbetCopyright (C) 2012-2013 Linaro Limited 532*e790a4ceSJonathan CorbetDistributed under the terms of Version 2 of the GNU General Public 533*e790a4ceSJonathan CorbetLicense, as defined in linux/COPYING. 534