xref: /openbmc/linux/Documentation/arch/arm/vlocks.rst (revision c900529f3d9161bfde5cca0754f83b4d3c3e0220)
1e790a4ceSJonathan Corbet======================================
2e790a4ceSJonathan Corbetvlocks for Bare-Metal Mutual Exclusion
3e790a4ceSJonathan Corbet======================================
4e790a4ceSJonathan Corbet
5e790a4ceSJonathan CorbetVoting Locks, or "vlocks" provide a simple low-level mutual exclusion
6e790a4ceSJonathan Corbetmechanism, with reasonable but minimal requirements on the memory
7e790a4ceSJonathan Corbetsystem.
8e790a4ceSJonathan Corbet
9e790a4ceSJonathan CorbetThese are intended to be used to coordinate critical activity among CPUs
10e790a4ceSJonathan Corbetwhich are otherwise non-coherent, in situations where the hardware
11e790a4ceSJonathan Corbetprovides no other mechanism to support this and ordinary spinlocks
12e790a4ceSJonathan Corbetcannot be used.
13e790a4ceSJonathan Corbet
14e790a4ceSJonathan Corbet
15e790a4ceSJonathan Corbetvlocks make use of the atomicity provided by the memory system for
16e790a4ceSJonathan Corbetwrites to a single memory location.  To arbitrate, every CPU "votes for
17e790a4ceSJonathan Corbetitself", by storing a unique number to a common memory location.  The
18e790a4ceSJonathan Corbetfinal value seen in that memory location when all the votes have been
19e790a4ceSJonathan Corbetcast identifies the winner.
20e790a4ceSJonathan Corbet
21e790a4ceSJonathan CorbetIn order to make sure that the election produces an unambiguous result
22e790a4ceSJonathan Corbetin finite time, a CPU will only enter the election in the first place if
23e790a4ceSJonathan Corbetno winner has been chosen and the election does not appear to have
24e790a4ceSJonathan Corbetstarted yet.
25e790a4ceSJonathan Corbet
26e790a4ceSJonathan Corbet
27e790a4ceSJonathan CorbetAlgorithm
28e790a4ceSJonathan Corbet---------
29e790a4ceSJonathan Corbet
30e790a4ceSJonathan CorbetThe easiest way to explain the vlocks algorithm is with some pseudo-code::
31e790a4ceSJonathan Corbet
32e790a4ceSJonathan Corbet
33e790a4ceSJonathan Corbet	int currently_voting[NR_CPUS] = { 0, };
34e790a4ceSJonathan Corbet	int last_vote = -1; /* no votes yet */
35e790a4ceSJonathan Corbet
36e790a4ceSJonathan Corbet	bool vlock_trylock(int this_cpu)
37e790a4ceSJonathan Corbet	{
38e790a4ceSJonathan Corbet		/* signal our desire to vote */
39e790a4ceSJonathan Corbet		currently_voting[this_cpu] = 1;
40e790a4ceSJonathan Corbet		if (last_vote != -1) {
41e790a4ceSJonathan Corbet			/* someone already volunteered himself */
42e790a4ceSJonathan Corbet			currently_voting[this_cpu] = 0;
43e790a4ceSJonathan Corbet			return false; /* not ourself */
44e790a4ceSJonathan Corbet		}
45e790a4ceSJonathan Corbet
46e790a4ceSJonathan Corbet		/* let's suggest ourself */
47e790a4ceSJonathan Corbet		last_vote = this_cpu;
48e790a4ceSJonathan Corbet		currently_voting[this_cpu] = 0;
49e790a4ceSJonathan Corbet
50e790a4ceSJonathan Corbet		/* then wait until everyone else is done voting */
51e790a4ceSJonathan Corbet		for_each_cpu(i) {
52e790a4ceSJonathan Corbet			while (currently_voting[i] != 0)
53e790a4ceSJonathan Corbet				/* wait */;
54e790a4ceSJonathan Corbet		}
55e790a4ceSJonathan Corbet
56e790a4ceSJonathan Corbet		/* result */
57e790a4ceSJonathan Corbet		if (last_vote == this_cpu)
58e790a4ceSJonathan Corbet			return true; /* we won */
59e790a4ceSJonathan Corbet		return false;
60e790a4ceSJonathan Corbet	}
61e790a4ceSJonathan Corbet
62e790a4ceSJonathan Corbet	bool vlock_unlock(void)
63e790a4ceSJonathan Corbet	{
64e790a4ceSJonathan Corbet		last_vote = -1;
65e790a4ceSJonathan Corbet	}
66e790a4ceSJonathan Corbet
67e790a4ceSJonathan Corbet
68e790a4ceSJonathan CorbetThe currently_voting[] array provides a way for the CPUs to determine
69e790a4ceSJonathan Corbetwhether an election is in progress, and plays a role analogous to the
70e790a4ceSJonathan Corbet"entering" array in Lamport's bakery algorithm [1].
71e790a4ceSJonathan Corbet
72e790a4ceSJonathan CorbetHowever, once the election has started, the underlying memory system
73e790a4ceSJonathan Corbetatomicity is used to pick the winner.  This avoids the need for a static
74e790a4ceSJonathan Corbetpriority rule to act as a tie-breaker, or any counters which could
75e790a4ceSJonathan Corbetoverflow.
76e790a4ceSJonathan Corbet
77e790a4ceSJonathan CorbetAs long as the last_vote variable is globally visible to all CPUs, it
78e790a4ceSJonathan Corbetwill contain only one value that won't change once every CPU has cleared
79e790a4ceSJonathan Corbetits currently_voting flag.
80e790a4ceSJonathan Corbet
81e790a4ceSJonathan Corbet
82e790a4ceSJonathan CorbetFeatures and limitations
83e790a4ceSJonathan Corbet------------------------
84e790a4ceSJonathan Corbet
85e790a4ceSJonathan Corbet * vlocks are not intended to be fair.  In the contended case, it is the
86e790a4ceSJonathan Corbet   _last_ CPU which attempts to get the lock which will be most likely
87e790a4ceSJonathan Corbet   to win.
88e790a4ceSJonathan Corbet
89e790a4ceSJonathan Corbet   vlocks are therefore best suited to situations where it is necessary
90e790a4ceSJonathan Corbet   to pick a unique winner, but it does not matter which CPU actually
91e790a4ceSJonathan Corbet   wins.
92e790a4ceSJonathan Corbet
93e790a4ceSJonathan Corbet * Like other similar mechanisms, vlocks will not scale well to a large
94e790a4ceSJonathan Corbet   number of CPUs.
95e790a4ceSJonathan Corbet
96e790a4ceSJonathan Corbet   vlocks can be cascaded in a voting hierarchy to permit better scaling
97e790a4ceSJonathan Corbet   if necessary, as in the following hypothetical example for 4096 CPUs::
98e790a4ceSJonathan Corbet
99e790a4ceSJonathan Corbet	/* first level: local election */
100e790a4ceSJonathan Corbet	my_town = towns[(this_cpu >> 4) & 0xf];
101e790a4ceSJonathan Corbet	I_won = vlock_trylock(my_town, this_cpu & 0xf);
102e790a4ceSJonathan Corbet	if (I_won) {
103e790a4ceSJonathan Corbet		/* we won the town election, let's go for the state */
104e790a4ceSJonathan Corbet		my_state = states[(this_cpu >> 8) & 0xf];
105e790a4ceSJonathan Corbet		I_won = vlock_lock(my_state, this_cpu & 0xf));
106e790a4ceSJonathan Corbet		if (I_won) {
107e790a4ceSJonathan Corbet			/* and so on */
108e790a4ceSJonathan Corbet			I_won = vlock_lock(the_whole_country, this_cpu & 0xf];
109e790a4ceSJonathan Corbet			if (I_won) {
110e790a4ceSJonathan Corbet				/* ... */
111e790a4ceSJonathan Corbet			}
112e790a4ceSJonathan Corbet			vlock_unlock(the_whole_country);
113e790a4ceSJonathan Corbet		}
114e790a4ceSJonathan Corbet		vlock_unlock(my_state);
115e790a4ceSJonathan Corbet	}
116e790a4ceSJonathan Corbet	vlock_unlock(my_town);
117e790a4ceSJonathan Corbet
118e790a4ceSJonathan Corbet
119e790a4ceSJonathan CorbetARM implementation
120e790a4ceSJonathan Corbet------------------
121e790a4ceSJonathan Corbet
122e790a4ceSJonathan CorbetThe current ARM implementation [2] contains some optimisations beyond
123e790a4ceSJonathan Corbetthe basic algorithm:
124e790a4ceSJonathan Corbet
125e790a4ceSJonathan Corbet * By packing the members of the currently_voting array close together,
126e790a4ceSJonathan Corbet   we can read the whole array in one transaction (providing the number
127e790a4ceSJonathan Corbet   of CPUs potentially contending the lock is small enough).  This
128e790a4ceSJonathan Corbet   reduces the number of round-trips required to external memory.
129e790a4ceSJonathan Corbet
130e790a4ceSJonathan Corbet   In the ARM implementation, this means that we can use a single load
131e790a4ceSJonathan Corbet   and comparison::
132e790a4ceSJonathan Corbet
133e790a4ceSJonathan Corbet	LDR	Rt, [Rn]
134e790a4ceSJonathan Corbet	CMP	Rt, #0
135e790a4ceSJonathan Corbet
136e790a4ceSJonathan Corbet   ...in place of code equivalent to::
137e790a4ceSJonathan Corbet
138e790a4ceSJonathan Corbet	LDRB	Rt, [Rn]
139e790a4ceSJonathan Corbet	CMP	Rt, #0
140e790a4ceSJonathan Corbet	LDRBEQ	Rt, [Rn, #1]
141e790a4ceSJonathan Corbet	CMPEQ	Rt, #0
142e790a4ceSJonathan Corbet	LDRBEQ	Rt, [Rn, #2]
143e790a4ceSJonathan Corbet	CMPEQ	Rt, #0
144e790a4ceSJonathan Corbet	LDRBEQ	Rt, [Rn, #3]
145e790a4ceSJonathan Corbet	CMPEQ	Rt, #0
146e790a4ceSJonathan Corbet
147e790a4ceSJonathan Corbet   This cuts down on the fast-path latency, as well as potentially
148e790a4ceSJonathan Corbet   reducing bus contention in contended cases.
149e790a4ceSJonathan Corbet
150e790a4ceSJonathan Corbet   The optimisation relies on the fact that the ARM memory system
151e790a4ceSJonathan Corbet   guarantees coherency between overlapping memory accesses of
152e790a4ceSJonathan Corbet   different sizes, similarly to many other architectures.  Note that
153e790a4ceSJonathan Corbet   we do not care which element of currently_voting appears in which
154e790a4ceSJonathan Corbet   bits of Rt, so there is no need to worry about endianness in this
155e790a4ceSJonathan Corbet   optimisation.
156e790a4ceSJonathan Corbet
157e790a4ceSJonathan Corbet   If there are too many CPUs to read the currently_voting array in
158*d56b699dSBjorn Helgaas   one transaction then multiple transactions are still required.  The
159e790a4ceSJonathan Corbet   implementation uses a simple loop of word-sized loads for this
160e790a4ceSJonathan Corbet   case.  The number of transactions is still fewer than would be
161e790a4ceSJonathan Corbet   required if bytes were loaded individually.
162e790a4ceSJonathan Corbet
163e790a4ceSJonathan Corbet
164e790a4ceSJonathan Corbet   In principle, we could aggregate further by using LDRD or LDM, but
165e790a4ceSJonathan Corbet   to keep the code simple this was not attempted in the initial
166e790a4ceSJonathan Corbet   implementation.
167e790a4ceSJonathan Corbet
168e790a4ceSJonathan Corbet
169e790a4ceSJonathan Corbet * vlocks are currently only used to coordinate between CPUs which are
170e790a4ceSJonathan Corbet   unable to enable their caches yet.  This means that the
171e790a4ceSJonathan Corbet   implementation removes many of the barriers which would be required
172e790a4ceSJonathan Corbet   when executing the algorithm in cached memory.
173e790a4ceSJonathan Corbet
174e790a4ceSJonathan Corbet   packing of the currently_voting array does not work with cached
175e790a4ceSJonathan Corbet   memory unless all CPUs contending the lock are cache-coherent, due
176e790a4ceSJonathan Corbet   to cache writebacks from one CPU clobbering values written by other
177e790a4ceSJonathan Corbet   CPUs.  (Though if all the CPUs are cache-coherent, you should be
178e790a4ceSJonathan Corbet   probably be using proper spinlocks instead anyway).
179e790a4ceSJonathan Corbet
180e790a4ceSJonathan Corbet
181e790a4ceSJonathan Corbet * The "no votes yet" value used for the last_vote variable is 0 (not
182e790a4ceSJonathan Corbet   -1 as in the pseudocode).  This allows statically-allocated vlocks
183e790a4ceSJonathan Corbet   to be implicitly initialised to an unlocked state simply by putting
184e790a4ceSJonathan Corbet   them in .bss.
185e790a4ceSJonathan Corbet
186e790a4ceSJonathan Corbet   An offset is added to each CPU's ID for the purpose of setting this
187e790a4ceSJonathan Corbet   variable, so that no CPU uses the value 0 for its ID.
188e790a4ceSJonathan Corbet
189e790a4ceSJonathan Corbet
190e790a4ceSJonathan CorbetColophon
191e790a4ceSJonathan Corbet--------
192e790a4ceSJonathan Corbet
193e790a4ceSJonathan CorbetOriginally created and documented by Dave Martin for Linaro Limited, for
194e790a4ceSJonathan Corbetuse in ARM-based big.LITTLE platforms, with review and input gratefully
195e790a4ceSJonathan Corbetreceived from Nicolas Pitre and Achin Gupta.  Thanks to Nicolas for
196e790a4ceSJonathan Corbetgrabbing most of this text out of the relevant mail thread and writing
197e790a4ceSJonathan Corbetup the pseudocode.
198e790a4ceSJonathan Corbet
199e790a4ceSJonathan CorbetCopyright (C) 2012-2013  Linaro Limited
200e790a4ceSJonathan CorbetDistributed under the terms of Version 2 of the GNU General Public
201e790a4ceSJonathan CorbetLicense, as defined in linux/COPYING.
202e790a4ceSJonathan Corbet
203e790a4ceSJonathan Corbet
204e790a4ceSJonathan CorbetReferences
205e790a4ceSJonathan Corbet----------
206e790a4ceSJonathan Corbet
207e790a4ceSJonathan Corbet[1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
208e790a4ceSJonathan Corbet    Problem", Communications of the ACM 17, 8 (August 1974), 453-455.
209e790a4ceSJonathan Corbet
210e790a4ceSJonathan Corbet    https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
211e790a4ceSJonathan Corbet
212e790a4ceSJonathan Corbet[2] linux/arch/arm/common/vlock.S, www.kernel.org.
213