xref: /openbmc/linux/Documentation/admin-guide/device-mapper/cache-policies.rst (revision 9a87ffc99ec8eb8d35eed7c4f816d75f5cc9662e)
16cf2a73cSMauro Carvalho Chehab=============================
26cf2a73cSMauro Carvalho ChehabGuidance for writing policies
36cf2a73cSMauro Carvalho Chehab=============================
46cf2a73cSMauro Carvalho Chehab
56cf2a73cSMauro Carvalho ChehabTry to keep transactionality out of it.  The core is careful to
66cf2a73cSMauro Carvalho Chehabavoid asking about anything that is migrating.  This is a pain, but
76cf2a73cSMauro Carvalho Chehabmakes it easier to write the policies.
86cf2a73cSMauro Carvalho Chehab
96cf2a73cSMauro Carvalho ChehabMappings are loaded into the policy at construction time.
106cf2a73cSMauro Carvalho Chehab
116cf2a73cSMauro Carvalho ChehabEvery bio that is mapped by the target is referred to the policy.
126cf2a73cSMauro Carvalho ChehabThe policy can return a simple HIT or MISS or issue a migration.
136cf2a73cSMauro Carvalho Chehab
146cf2a73cSMauro Carvalho ChehabCurrently there's no way for the policy to issue background work,
156cf2a73cSMauro Carvalho Chehabe.g. to start writing back dirty blocks that are going to be evicted
166cf2a73cSMauro Carvalho Chehabsoon.
176cf2a73cSMauro Carvalho Chehab
186cf2a73cSMauro Carvalho ChehabBecause we map bios, rather than requests it's easy for the policy
196cf2a73cSMauro Carvalho Chehabto get fooled by many small bios.  For this reason the core target
206cf2a73cSMauro Carvalho Chehabissues periodic ticks to the policy.  It's suggested that the policy
216cf2a73cSMauro Carvalho Chehabdoesn't update states (eg, hit counts) for a block more than once
226cf2a73cSMauro Carvalho Chehabfor each tick.  The core ticks by watching bios complete, and so
236cf2a73cSMauro Carvalho Chehabtrying to see when the io scheduler has let the ios run.
246cf2a73cSMauro Carvalho Chehab
256cf2a73cSMauro Carvalho Chehab
266cf2a73cSMauro Carvalho ChehabOverview of supplied cache replacement policies
276cf2a73cSMauro Carvalho Chehab===============================================
286cf2a73cSMauro Carvalho Chehab
296cf2a73cSMauro Carvalho Chehabmultiqueue (mq)
306cf2a73cSMauro Carvalho Chehab---------------
316cf2a73cSMauro Carvalho Chehab
326cf2a73cSMauro Carvalho ChehabThis policy is now an alias for smq (see below).
336cf2a73cSMauro Carvalho Chehab
346cf2a73cSMauro Carvalho ChehabThe following tunables are accepted, but have no effect::
356cf2a73cSMauro Carvalho Chehab
366cf2a73cSMauro Carvalho Chehab	'sequential_threshold <#nr_sequential_ios>'
376cf2a73cSMauro Carvalho Chehab	'random_threshold <#nr_random_ios>'
386cf2a73cSMauro Carvalho Chehab	'read_promote_adjustment <value>'
396cf2a73cSMauro Carvalho Chehab	'write_promote_adjustment <value>'
406cf2a73cSMauro Carvalho Chehab	'discard_promote_adjustment <value>'
416cf2a73cSMauro Carvalho Chehab
426cf2a73cSMauro Carvalho ChehabStochastic multiqueue (smq)
436cf2a73cSMauro Carvalho Chehab---------------------------
446cf2a73cSMauro Carvalho Chehab
456cf2a73cSMauro Carvalho ChehabThis policy is the default.
466cf2a73cSMauro Carvalho Chehab
476cf2a73cSMauro Carvalho ChehabThe stochastic multi-queue (smq) policy addresses some of the problems
486cf2a73cSMauro Carvalho Chehabwith the multiqueue (mq) policy.
496cf2a73cSMauro Carvalho Chehab
506cf2a73cSMauro Carvalho ChehabThe smq policy (vs mq) offers the promise of less memory utilization,
516cf2a73cSMauro Carvalho Chehabimproved performance and increased adaptability in the face of changing
526cf2a73cSMauro Carvalho Chehabworkloads.  smq also does not have any cumbersome tuning knobs.
536cf2a73cSMauro Carvalho Chehab
546cf2a73cSMauro Carvalho ChehabUsers may switch from "mq" to "smq" simply by appropriately reloading a
556cf2a73cSMauro Carvalho ChehabDM table that is using the cache target.  Doing so will cause all of the
566cf2a73cSMauro Carvalho Chehabmq policy's hints to be dropped.  Also, performance of the cache may
576cf2a73cSMauro Carvalho Chehabdegrade slightly until smq recalculates the origin device's hotspots
586cf2a73cSMauro Carvalho Chehabthat should be cached.
596cf2a73cSMauro Carvalho Chehab
606cf2a73cSMauro Carvalho ChehabMemory usage
616cf2a73cSMauro Carvalho Chehab^^^^^^^^^^^^
626cf2a73cSMauro Carvalho Chehab
636cf2a73cSMauro Carvalho ChehabThe mq policy used a lot of memory; 88 bytes per cache block on a 64
646cf2a73cSMauro Carvalho Chehabbit machine.
656cf2a73cSMauro Carvalho Chehab
666cf2a73cSMauro Carvalho Chehabsmq uses 28bit indexes to implement its data structures rather than
676cf2a73cSMauro Carvalho Chehabpointers.  It avoids storing an explicit hit count for each block.  It
686cf2a73cSMauro Carvalho Chehabhas a 'hotspot' queue, rather than a pre-cache, which uses a quarter of
696cf2a73cSMauro Carvalho Chehabthe entries (each hotspot block covers a larger area than a single
706cf2a73cSMauro Carvalho Chehabcache block).
716cf2a73cSMauro Carvalho Chehab
726cf2a73cSMauro Carvalho ChehabAll this means smq uses ~25bytes per cache block.  Still a lot of
73*dbeb56feSRandy Dunlapmemory, but a substantial improvement nonetheless.
746cf2a73cSMauro Carvalho Chehab
756cf2a73cSMauro Carvalho ChehabLevel balancing
766cf2a73cSMauro Carvalho Chehab^^^^^^^^^^^^^^^
776cf2a73cSMauro Carvalho Chehab
786cf2a73cSMauro Carvalho Chehabmq placed entries in different levels of the multiqueue structures
796cf2a73cSMauro Carvalho Chehabbased on their hit count (~ln(hit count)).  This meant the bottom
806cf2a73cSMauro Carvalho Chehablevels generally had the most entries, and the top ones had very
816cf2a73cSMauro Carvalho Chehabfew.  Having unbalanced levels like this reduced the efficacy of the
826cf2a73cSMauro Carvalho Chehabmultiqueue.
836cf2a73cSMauro Carvalho Chehab
846cf2a73cSMauro Carvalho Chehabsmq does not maintain a hit count, instead it swaps hit entries with
856cf2a73cSMauro Carvalho Chehabthe least recently used entry from the level above.  The overall
866cf2a73cSMauro Carvalho Chehabordering being a side effect of this stochastic process.  With this
876cf2a73cSMauro Carvalho Chehabscheme we can decide how many entries occupy each multiqueue level,
886cf2a73cSMauro Carvalho Chehabresulting in better promotion/demotion decisions.
896cf2a73cSMauro Carvalho Chehab
906cf2a73cSMauro Carvalho ChehabAdaptability:
916cf2a73cSMauro Carvalho ChehabThe mq policy maintained a hit count for each cache block.  For a
926cf2a73cSMauro Carvalho Chehabdifferent block to get promoted to the cache its hit count has to
936cf2a73cSMauro Carvalho Chehabexceed the lowest currently in the cache.  This meant it could take a
946cf2a73cSMauro Carvalho Chehablong time for the cache to adapt between varying IO patterns.
956cf2a73cSMauro Carvalho Chehab
966cf2a73cSMauro Carvalho Chehabsmq doesn't maintain hit counts, so a lot of this problem just goes
976cf2a73cSMauro Carvalho Chehabaway.  In addition it tracks performance of the hotspot queue, which
986cf2a73cSMauro Carvalho Chehabis used to decide which blocks to promote.  If the hotspot queue is
996cf2a73cSMauro Carvalho Chehabperforming badly then it starts moving entries more quickly between
1006cf2a73cSMauro Carvalho Chehablevels.  This lets it adapt to new IO patterns very quickly.
1016cf2a73cSMauro Carvalho Chehab
1026cf2a73cSMauro Carvalho ChehabPerformance
1036cf2a73cSMauro Carvalho Chehab^^^^^^^^^^^
1046cf2a73cSMauro Carvalho Chehab
1056cf2a73cSMauro Carvalho ChehabTesting smq shows substantially better performance than mq.
1066cf2a73cSMauro Carvalho Chehab
1076cf2a73cSMauro Carvalho Chehabcleaner
1086cf2a73cSMauro Carvalho Chehab-------
1096cf2a73cSMauro Carvalho Chehab
1106cf2a73cSMauro Carvalho ChehabThe cleaner writes back all dirty blocks in a cache to decommission it.
1116cf2a73cSMauro Carvalho Chehab
1126cf2a73cSMauro Carvalho ChehabExamples
1136cf2a73cSMauro Carvalho Chehab========
1146cf2a73cSMauro Carvalho Chehab
1156cf2a73cSMauro Carvalho ChehabThe syntax for a table is::
1166cf2a73cSMauro Carvalho Chehab
1176cf2a73cSMauro Carvalho Chehab	cache <metadata dev> <cache dev> <origin dev> <block size>
1186cf2a73cSMauro Carvalho Chehab	<#feature_args> [<feature arg>]*
1196cf2a73cSMauro Carvalho Chehab	<policy> <#policy_args> [<policy arg>]*
1206cf2a73cSMauro Carvalho Chehab
1216cf2a73cSMauro Carvalho ChehabThe syntax to send a message using the dmsetup command is::
1226cf2a73cSMauro Carvalho Chehab
1236cf2a73cSMauro Carvalho Chehab	dmsetup message <mapped device> 0 sequential_threshold 1024
1246cf2a73cSMauro Carvalho Chehab	dmsetup message <mapped device> 0 random_threshold 8
1256cf2a73cSMauro Carvalho Chehab
1266cf2a73cSMauro Carvalho ChehabUsing dmsetup::
1276cf2a73cSMauro Carvalho Chehab
1286cf2a73cSMauro Carvalho Chehab	dmsetup create blah --table "0 268435456 cache /dev/sdb /dev/sdc \
1296cf2a73cSMauro Carvalho Chehab	    /dev/sdd 512 0 mq 4 sequential_threshold 1024 random_threshold 8"
1306cf2a73cSMauro Carvalho Chehab	creates a 128GB large mapped device named 'blah' with the
1316cf2a73cSMauro Carvalho Chehab	sequential threshold set to 1024 and the random_threshold set to 8.
132