16cf2a73cSMauro Carvalho Chehab=============
26cf2a73cSMauro Carvalho Chehabdm-log-writes
36cf2a73cSMauro Carvalho Chehab=============
46cf2a73cSMauro Carvalho Chehab
56cf2a73cSMauro Carvalho ChehabThis target takes 2 devices, one to pass all IO to normally, and one to log all
66cf2a73cSMauro Carvalho Chehabof the write operations to.  This is intended for file system developers wishing
76cf2a73cSMauro Carvalho Chehabto verify the integrity of metadata or data as the file system is written to.
86cf2a73cSMauro Carvalho ChehabThere is a log_write_entry written for every WRITE request and the target is
96cf2a73cSMauro Carvalho Chehabable to take arbitrary data from userspace to insert into the log.  The data
106cf2a73cSMauro Carvalho Chehabthat is in the WRITE requests is copied into the log to make the replay happen
116cf2a73cSMauro Carvalho Chehabexactly as it happened originally.
126cf2a73cSMauro Carvalho Chehab
136cf2a73cSMauro Carvalho ChehabLog Ordering
146cf2a73cSMauro Carvalho Chehab============
156cf2a73cSMauro Carvalho Chehab
166cf2a73cSMauro Carvalho ChehabWe log things in order of completion once we are sure the write is no longer in
176cf2a73cSMauro Carvalho Chehabcache.  This means that normal WRITE requests are not actually logged until the
186cf2a73cSMauro Carvalho Chehabnext REQ_PREFLUSH request.  This is to make it easier for userspace to replay
196cf2a73cSMauro Carvalho Chehabthe log in a way that correlates to what is on disk and not what is in cache,
206cf2a73cSMauro Carvalho Chehabto make it easier to detect improper waiting/flushing.
216cf2a73cSMauro Carvalho Chehab
226cf2a73cSMauro Carvalho ChehabThis works by attaching all WRITE requests to a list once the write completes.
236cf2a73cSMauro Carvalho ChehabOnce we see a REQ_PREFLUSH request we splice this list onto the request and once
246cf2a73cSMauro Carvalho Chehabthe FLUSH request completes we log all of the WRITEs and then the FLUSH.  Only
256cf2a73cSMauro Carvalho Chehabcompleted WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
266cf2a73cSMauro Carvalho Chehabsimulate the worst case scenario with regard to power failures.  Consider the
276cf2a73cSMauro Carvalho Chehabfollowing example (W means write, C means complete):
286cf2a73cSMauro Carvalho Chehab
296cf2a73cSMauro Carvalho Chehab	W1,W2,W3,C3,C2,Wflush,C1,Cflush
306cf2a73cSMauro Carvalho Chehab
316cf2a73cSMauro Carvalho ChehabThe log would show the following:
326cf2a73cSMauro Carvalho Chehab
336cf2a73cSMauro Carvalho Chehab	W3,W2,flush,W1....
346cf2a73cSMauro Carvalho Chehab
356cf2a73cSMauro Carvalho ChehabAgain this is to simulate what is actually on disk, this allows us to detect
366cf2a73cSMauro Carvalho Chehabcases where a power failure at a particular point in time would create an
376cf2a73cSMauro Carvalho Chehabinconsistent file system.
386cf2a73cSMauro Carvalho Chehab
396cf2a73cSMauro Carvalho ChehabAny REQ_FUA requests bypass this flushing mechanism and are logged as soon as
406cf2a73cSMauro Carvalho Chehabthey complete as those requests will obviously bypass the device cache.
416cf2a73cSMauro Carvalho Chehab
426cf2a73cSMauro Carvalho ChehabAny REQ_OP_DISCARD requests are treated like WRITE requests.  Otherwise we would
436cf2a73cSMauro Carvalho Chehabhave all the DISCARD requests, and then the WRITE requests and then the FLUSH
446cf2a73cSMauro Carvalho Chehabrequest.  Consider the following example:
456cf2a73cSMauro Carvalho Chehab
466cf2a73cSMauro Carvalho Chehab	WRITE block 1, DISCARD block 1, FLUSH
476cf2a73cSMauro Carvalho Chehab
486cf2a73cSMauro Carvalho ChehabIf we logged DISCARD when it completed, the replay would look like this:
496cf2a73cSMauro Carvalho Chehab
506cf2a73cSMauro Carvalho Chehab	DISCARD 1, WRITE 1, FLUSH
516cf2a73cSMauro Carvalho Chehab
526cf2a73cSMauro Carvalho Chehabwhich isn't quite what happened and wouldn't be caught during the log replay.
536cf2a73cSMauro Carvalho Chehab
546cf2a73cSMauro Carvalho ChehabTarget interface
556cf2a73cSMauro Carvalho Chehab================
566cf2a73cSMauro Carvalho Chehab
576cf2a73cSMauro Carvalho Chehabi) Constructor
586cf2a73cSMauro Carvalho Chehab
596cf2a73cSMauro Carvalho Chehab   log-writes <dev_path> <log_dev_path>
606cf2a73cSMauro Carvalho Chehab
616cf2a73cSMauro Carvalho Chehab   ============= ==============================================
626cf2a73cSMauro Carvalho Chehab   dev_path	 Device that all of the IO will go to normally.
636cf2a73cSMauro Carvalho Chehab   log_dev_path  Device where the log entries are written to.
646cf2a73cSMauro Carvalho Chehab   ============= ==============================================
656cf2a73cSMauro Carvalho Chehab
666cf2a73cSMauro Carvalho Chehabii) Status
676cf2a73cSMauro Carvalho Chehab
686cf2a73cSMauro Carvalho Chehab    <#logged entries> <highest allocated sector>
696cf2a73cSMauro Carvalho Chehab
706cf2a73cSMauro Carvalho Chehab    =========================== ========================
716cf2a73cSMauro Carvalho Chehab    #logged entries	        Number of logged entries
726cf2a73cSMauro Carvalho Chehab    highest allocated sector    Highest allocated sector
736cf2a73cSMauro Carvalho Chehab    =========================== ========================
746cf2a73cSMauro Carvalho Chehab
756cf2a73cSMauro Carvalho Chehabiii) Messages
766cf2a73cSMauro Carvalho Chehab
776cf2a73cSMauro Carvalho Chehab    mark <description>
786cf2a73cSMauro Carvalho Chehab
796cf2a73cSMauro Carvalho Chehab	You can use a dmsetup message to set an arbitrary mark in a log.
806cf2a73cSMauro Carvalho Chehab	For example say you want to fsck a file system after every
816cf2a73cSMauro Carvalho Chehab	write, but first you need to replay up to the mkfs to make sure
826cf2a73cSMauro Carvalho Chehab	we're fsck'ing something reasonable, you would do something like
836cf2a73cSMauro Carvalho Chehab	this::
846cf2a73cSMauro Carvalho Chehab
856cf2a73cSMauro Carvalho Chehab	  mkfs.btrfs -f /dev/mapper/log
866cf2a73cSMauro Carvalho Chehab	  dmsetup message log 0 mark mkfs
876cf2a73cSMauro Carvalho Chehab	  <run test>
886cf2a73cSMauro Carvalho Chehab
896cf2a73cSMauro Carvalho Chehab	This would allow you to replay the log up to the mkfs mark and
906cf2a73cSMauro Carvalho Chehab	then replay from that point on doing the fsck check in the
916cf2a73cSMauro Carvalho Chehab	interval that you want.
926cf2a73cSMauro Carvalho Chehab
936cf2a73cSMauro Carvalho Chehab	Every log has a mark at the end labeled "dm-log-writes-end".
946cf2a73cSMauro Carvalho Chehab
956cf2a73cSMauro Carvalho ChehabUserspace component
966cf2a73cSMauro Carvalho Chehab===================
976cf2a73cSMauro Carvalho Chehab
986cf2a73cSMauro Carvalho ChehabThere is a userspace tool that will replay the log for you in various ways.
996cf2a73cSMauro Carvalho ChehabIt can be found here: https://github.com/josefbacik/log-writes
1006cf2a73cSMauro Carvalho Chehab
1016cf2a73cSMauro Carvalho ChehabExample usage
1026cf2a73cSMauro Carvalho Chehab=============
1036cf2a73cSMauro Carvalho Chehab
1046cf2a73cSMauro Carvalho ChehabSay you want to test fsync on your file system.  You would do something like
1056cf2a73cSMauro Carvalho Chehabthis::
1066cf2a73cSMauro Carvalho Chehab
1076cf2a73cSMauro Carvalho Chehab  TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
1086cf2a73cSMauro Carvalho Chehab  dmsetup create log --table "$TABLE"
1096cf2a73cSMauro Carvalho Chehab  mkfs.btrfs -f /dev/mapper/log
1106cf2a73cSMauro Carvalho Chehab  dmsetup message log 0 mark mkfs
1116cf2a73cSMauro Carvalho Chehab
1126cf2a73cSMauro Carvalho Chehab  mount /dev/mapper/log /mnt/btrfs-test
1136cf2a73cSMauro Carvalho Chehab  <some test that does fsync at the end>
1146cf2a73cSMauro Carvalho Chehab  dmsetup message log 0 mark fsync
1156cf2a73cSMauro Carvalho Chehab  md5sum /mnt/btrfs-test/foo
1166cf2a73cSMauro Carvalho Chehab  umount /mnt/btrfs-test
1176cf2a73cSMauro Carvalho Chehab
1186cf2a73cSMauro Carvalho Chehab  dmsetup remove log
1196cf2a73cSMauro Carvalho Chehab  replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
1206cf2a73cSMauro Carvalho Chehab  mount /dev/sdb /mnt/btrfs-test
1216cf2a73cSMauro Carvalho Chehab  md5sum /mnt/btrfs-test/foo
1226cf2a73cSMauro Carvalho Chehab  <verify md5sum's are correct>
1236cf2a73cSMauro Carvalho Chehab
1246cf2a73cSMauro Carvalho Chehab  Another option is to do a complicated file system operation and verify the file
1256cf2a73cSMauro Carvalho Chehab  system is consistent during the entire operation.  You could do this with:
1266cf2a73cSMauro Carvalho Chehab
1276cf2a73cSMauro Carvalho Chehab  TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
1286cf2a73cSMauro Carvalho Chehab  dmsetup create log --table "$TABLE"
1296cf2a73cSMauro Carvalho Chehab  mkfs.btrfs -f /dev/mapper/log
1306cf2a73cSMauro Carvalho Chehab  dmsetup message log 0 mark mkfs
1316cf2a73cSMauro Carvalho Chehab
1326cf2a73cSMauro Carvalho Chehab  mount /dev/mapper/log /mnt/btrfs-test
1336cf2a73cSMauro Carvalho Chehab  <fsstress to dirty the fs>
1346cf2a73cSMauro Carvalho Chehab  btrfs filesystem balance /mnt/btrfs-test
1356cf2a73cSMauro Carvalho Chehab  umount /mnt/btrfs-test
1366cf2a73cSMauro Carvalho Chehab  dmsetup remove log
1376cf2a73cSMauro Carvalho Chehab
1386cf2a73cSMauro Carvalho Chehab  replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
1396cf2a73cSMauro Carvalho Chehab  btrfsck /dev/sdb
1406cf2a73cSMauro Carvalho Chehab  replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
1416cf2a73cSMauro Carvalho Chehab	--fsck "btrfsck /dev/sdb" --check fua
1426cf2a73cSMauro Carvalho Chehab
1436cf2a73cSMauro Carvalho ChehabAnd that will replay the log until it sees a FUA request, run the fsck command
1446cf2a73cSMauro Carvalho Chehaband if the fsck passes it will replay to the next FUA, until it is completed or
1456cf2a73cSMauro Carvalho Chehabthe fsck command exists abnormally.
146