16cf2a73cSMauro Carvalho Chehab============= 26cf2a73cSMauro Carvalho Chehabdm-log-writes 36cf2a73cSMauro Carvalho Chehab============= 46cf2a73cSMauro Carvalho Chehab 56cf2a73cSMauro Carvalho ChehabThis target takes 2 devices, one to pass all IO to normally, and one to log all 66cf2a73cSMauro Carvalho Chehabof the write operations to. This is intended for file system developers wishing 76cf2a73cSMauro Carvalho Chehabto verify the integrity of metadata or data as the file system is written to. 86cf2a73cSMauro Carvalho ChehabThere is a log_write_entry written for every WRITE request and the target is 96cf2a73cSMauro Carvalho Chehabable to take arbitrary data from userspace to insert into the log. The data 106cf2a73cSMauro Carvalho Chehabthat is in the WRITE requests is copied into the log to make the replay happen 116cf2a73cSMauro Carvalho Chehabexactly as it happened originally. 126cf2a73cSMauro Carvalho Chehab 136cf2a73cSMauro Carvalho ChehabLog Ordering 146cf2a73cSMauro Carvalho Chehab============ 156cf2a73cSMauro Carvalho Chehab 166cf2a73cSMauro Carvalho ChehabWe log things in order of completion once we are sure the write is no longer in 176cf2a73cSMauro Carvalho Chehabcache. This means that normal WRITE requests are not actually logged until the 186cf2a73cSMauro Carvalho Chehabnext REQ_PREFLUSH request. This is to make it easier for userspace to replay 196cf2a73cSMauro Carvalho Chehabthe log in a way that correlates to what is on disk and not what is in cache, 206cf2a73cSMauro Carvalho Chehabto make it easier to detect improper waiting/flushing. 216cf2a73cSMauro Carvalho Chehab 226cf2a73cSMauro Carvalho ChehabThis works by attaching all WRITE requests to a list once the write completes. 236cf2a73cSMauro Carvalho ChehabOnce we see a REQ_PREFLUSH request we splice this list onto the request and once 246cf2a73cSMauro Carvalho Chehabthe FLUSH request completes we log all of the WRITEs and then the FLUSH. Only 256cf2a73cSMauro Carvalho Chehabcompleted WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to 266cf2a73cSMauro Carvalho Chehabsimulate the worst case scenario with regard to power failures. Consider the 276cf2a73cSMauro Carvalho Chehabfollowing example (W means write, C means complete): 286cf2a73cSMauro Carvalho Chehab 296cf2a73cSMauro Carvalho Chehab W1,W2,W3,C3,C2,Wflush,C1,Cflush 306cf2a73cSMauro Carvalho Chehab 316cf2a73cSMauro Carvalho ChehabThe log would show the following: 326cf2a73cSMauro Carvalho Chehab 336cf2a73cSMauro Carvalho Chehab W3,W2,flush,W1.... 346cf2a73cSMauro Carvalho Chehab 356cf2a73cSMauro Carvalho ChehabAgain this is to simulate what is actually on disk, this allows us to detect 366cf2a73cSMauro Carvalho Chehabcases where a power failure at a particular point in time would create an 376cf2a73cSMauro Carvalho Chehabinconsistent file system. 386cf2a73cSMauro Carvalho Chehab 396cf2a73cSMauro Carvalho ChehabAny REQ_FUA requests bypass this flushing mechanism and are logged as soon as 406cf2a73cSMauro Carvalho Chehabthey complete as those requests will obviously bypass the device cache. 416cf2a73cSMauro Carvalho Chehab 426cf2a73cSMauro Carvalho ChehabAny REQ_OP_DISCARD requests are treated like WRITE requests. Otherwise we would 436cf2a73cSMauro Carvalho Chehabhave all the DISCARD requests, and then the WRITE requests and then the FLUSH 446cf2a73cSMauro Carvalho Chehabrequest. Consider the following example: 456cf2a73cSMauro Carvalho Chehab 466cf2a73cSMauro Carvalho Chehab WRITE block 1, DISCARD block 1, FLUSH 476cf2a73cSMauro Carvalho Chehab 486cf2a73cSMauro Carvalho ChehabIf we logged DISCARD when it completed, the replay would look like this: 496cf2a73cSMauro Carvalho Chehab 506cf2a73cSMauro Carvalho Chehab DISCARD 1, WRITE 1, FLUSH 516cf2a73cSMauro Carvalho Chehab 526cf2a73cSMauro Carvalho Chehabwhich isn't quite what happened and wouldn't be caught during the log replay. 536cf2a73cSMauro Carvalho Chehab 546cf2a73cSMauro Carvalho ChehabTarget interface 556cf2a73cSMauro Carvalho Chehab================ 566cf2a73cSMauro Carvalho Chehab 576cf2a73cSMauro Carvalho Chehabi) Constructor 586cf2a73cSMauro Carvalho Chehab 596cf2a73cSMauro Carvalho Chehab log-writes <dev_path> <log_dev_path> 606cf2a73cSMauro Carvalho Chehab 616cf2a73cSMauro Carvalho Chehab ============= ============================================== 626cf2a73cSMauro Carvalho Chehab dev_path Device that all of the IO will go to normally. 636cf2a73cSMauro Carvalho Chehab log_dev_path Device where the log entries are written to. 646cf2a73cSMauro Carvalho Chehab ============= ============================================== 656cf2a73cSMauro Carvalho Chehab 666cf2a73cSMauro Carvalho Chehabii) Status 676cf2a73cSMauro Carvalho Chehab 686cf2a73cSMauro Carvalho Chehab <#logged entries> <highest allocated sector> 696cf2a73cSMauro Carvalho Chehab 706cf2a73cSMauro Carvalho Chehab =========================== ======================== 716cf2a73cSMauro Carvalho Chehab #logged entries Number of logged entries 726cf2a73cSMauro Carvalho Chehab highest allocated sector Highest allocated sector 736cf2a73cSMauro Carvalho Chehab =========================== ======================== 746cf2a73cSMauro Carvalho Chehab 756cf2a73cSMauro Carvalho Chehabiii) Messages 766cf2a73cSMauro Carvalho Chehab 776cf2a73cSMauro Carvalho Chehab mark <description> 786cf2a73cSMauro Carvalho Chehab 796cf2a73cSMauro Carvalho Chehab You can use a dmsetup message to set an arbitrary mark in a log. 806cf2a73cSMauro Carvalho Chehab For example say you want to fsck a file system after every 816cf2a73cSMauro Carvalho Chehab write, but first you need to replay up to the mkfs to make sure 826cf2a73cSMauro Carvalho Chehab we're fsck'ing something reasonable, you would do something like 836cf2a73cSMauro Carvalho Chehab this:: 846cf2a73cSMauro Carvalho Chehab 856cf2a73cSMauro Carvalho Chehab mkfs.btrfs -f /dev/mapper/log 866cf2a73cSMauro Carvalho Chehab dmsetup message log 0 mark mkfs 876cf2a73cSMauro Carvalho Chehab <run test> 886cf2a73cSMauro Carvalho Chehab 896cf2a73cSMauro Carvalho Chehab This would allow you to replay the log up to the mkfs mark and 906cf2a73cSMauro Carvalho Chehab then replay from that point on doing the fsck check in the 916cf2a73cSMauro Carvalho Chehab interval that you want. 926cf2a73cSMauro Carvalho Chehab 936cf2a73cSMauro Carvalho Chehab Every log has a mark at the end labeled "dm-log-writes-end". 946cf2a73cSMauro Carvalho Chehab 956cf2a73cSMauro Carvalho ChehabUserspace component 966cf2a73cSMauro Carvalho Chehab=================== 976cf2a73cSMauro Carvalho Chehab 986cf2a73cSMauro Carvalho ChehabThere is a userspace tool that will replay the log for you in various ways. 996cf2a73cSMauro Carvalho ChehabIt can be found here: https://github.com/josefbacik/log-writes 1006cf2a73cSMauro Carvalho Chehab 1016cf2a73cSMauro Carvalho ChehabExample usage 1026cf2a73cSMauro Carvalho Chehab============= 1036cf2a73cSMauro Carvalho Chehab 1046cf2a73cSMauro Carvalho ChehabSay you want to test fsync on your file system. You would do something like 1056cf2a73cSMauro Carvalho Chehabthis:: 1066cf2a73cSMauro Carvalho Chehab 1076cf2a73cSMauro Carvalho Chehab TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" 1086cf2a73cSMauro Carvalho Chehab dmsetup create log --table "$TABLE" 1096cf2a73cSMauro Carvalho Chehab mkfs.btrfs -f /dev/mapper/log 1106cf2a73cSMauro Carvalho Chehab dmsetup message log 0 mark mkfs 1116cf2a73cSMauro Carvalho Chehab 1126cf2a73cSMauro Carvalho Chehab mount /dev/mapper/log /mnt/btrfs-test 1136cf2a73cSMauro Carvalho Chehab <some test that does fsync at the end> 1146cf2a73cSMauro Carvalho Chehab dmsetup message log 0 mark fsync 1156cf2a73cSMauro Carvalho Chehab md5sum /mnt/btrfs-test/foo 1166cf2a73cSMauro Carvalho Chehab umount /mnt/btrfs-test 1176cf2a73cSMauro Carvalho Chehab 1186cf2a73cSMauro Carvalho Chehab dmsetup remove log 1196cf2a73cSMauro Carvalho Chehab replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync 1206cf2a73cSMauro Carvalho Chehab mount /dev/sdb /mnt/btrfs-test 1216cf2a73cSMauro Carvalho Chehab md5sum /mnt/btrfs-test/foo 1226cf2a73cSMauro Carvalho Chehab <verify md5sum's are correct> 1236cf2a73cSMauro Carvalho Chehab 1246cf2a73cSMauro Carvalho Chehab Another option is to do a complicated file system operation and verify the file 1256cf2a73cSMauro Carvalho Chehab system is consistent during the entire operation. You could do this with: 1266cf2a73cSMauro Carvalho Chehab 1276cf2a73cSMauro Carvalho Chehab TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" 1286cf2a73cSMauro Carvalho Chehab dmsetup create log --table "$TABLE" 1296cf2a73cSMauro Carvalho Chehab mkfs.btrfs -f /dev/mapper/log 1306cf2a73cSMauro Carvalho Chehab dmsetup message log 0 mark mkfs 1316cf2a73cSMauro Carvalho Chehab 1326cf2a73cSMauro Carvalho Chehab mount /dev/mapper/log /mnt/btrfs-test 1336cf2a73cSMauro Carvalho Chehab <fsstress to dirty the fs> 1346cf2a73cSMauro Carvalho Chehab btrfs filesystem balance /mnt/btrfs-test 1356cf2a73cSMauro Carvalho Chehab umount /mnt/btrfs-test 1366cf2a73cSMauro Carvalho Chehab dmsetup remove log 1376cf2a73cSMauro Carvalho Chehab 1386cf2a73cSMauro Carvalho Chehab replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs 1396cf2a73cSMauro Carvalho Chehab btrfsck /dev/sdb 1406cf2a73cSMauro Carvalho Chehab replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \ 1416cf2a73cSMauro Carvalho Chehab --fsck "btrfsck /dev/sdb" --check fua 1426cf2a73cSMauro Carvalho Chehab 1436cf2a73cSMauro Carvalho ChehabAnd that will replay the log until it sees a FUA request, run the fsck command 1446cf2a73cSMauro Carvalho Chehaband if the fsck passes it will replay to the next FUA, until it is completed or 1456cf2a73cSMauro Carvalho Chehabthe fsck command exists abnormally. 146