xref: /openbmc/linux/Documentation/admin-guide/device-mapper/log-writes.rst (revision 0898782247ae533d1f4e47a06bc5d4870931b284)
1*6cf2a73cSMauro Carvalho Chehab=============
2*6cf2a73cSMauro Carvalho Chehabdm-log-writes
3*6cf2a73cSMauro Carvalho Chehab=============
4*6cf2a73cSMauro Carvalho Chehab
5*6cf2a73cSMauro Carvalho ChehabThis target takes 2 devices, one to pass all IO to normally, and one to log all
6*6cf2a73cSMauro Carvalho Chehabof the write operations to.  This is intended for file system developers wishing
7*6cf2a73cSMauro Carvalho Chehabto verify the integrity of metadata or data as the file system is written to.
8*6cf2a73cSMauro Carvalho ChehabThere is a log_write_entry written for every WRITE request and the target is
9*6cf2a73cSMauro Carvalho Chehabable to take arbitrary data from userspace to insert into the log.  The data
10*6cf2a73cSMauro Carvalho Chehabthat is in the WRITE requests is copied into the log to make the replay happen
11*6cf2a73cSMauro Carvalho Chehabexactly as it happened originally.
12*6cf2a73cSMauro Carvalho Chehab
13*6cf2a73cSMauro Carvalho ChehabLog Ordering
14*6cf2a73cSMauro Carvalho Chehab============
15*6cf2a73cSMauro Carvalho Chehab
16*6cf2a73cSMauro Carvalho ChehabWe log things in order of completion once we are sure the write is no longer in
17*6cf2a73cSMauro Carvalho Chehabcache.  This means that normal WRITE requests are not actually logged until the
18*6cf2a73cSMauro Carvalho Chehabnext REQ_PREFLUSH request.  This is to make it easier for userspace to replay
19*6cf2a73cSMauro Carvalho Chehabthe log in a way that correlates to what is on disk and not what is in cache,
20*6cf2a73cSMauro Carvalho Chehabto make it easier to detect improper waiting/flushing.
21*6cf2a73cSMauro Carvalho Chehab
22*6cf2a73cSMauro Carvalho ChehabThis works by attaching all WRITE requests to a list once the write completes.
23*6cf2a73cSMauro Carvalho ChehabOnce we see a REQ_PREFLUSH request we splice this list onto the request and once
24*6cf2a73cSMauro Carvalho Chehabthe FLUSH request completes we log all of the WRITEs and then the FLUSH.  Only
25*6cf2a73cSMauro Carvalho Chehabcompleted WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
26*6cf2a73cSMauro Carvalho Chehabsimulate the worst case scenario with regard to power failures.  Consider the
27*6cf2a73cSMauro Carvalho Chehabfollowing example (W means write, C means complete):
28*6cf2a73cSMauro Carvalho Chehab
29*6cf2a73cSMauro Carvalho Chehab	W1,W2,W3,C3,C2,Wflush,C1,Cflush
30*6cf2a73cSMauro Carvalho Chehab
31*6cf2a73cSMauro Carvalho ChehabThe log would show the following:
32*6cf2a73cSMauro Carvalho Chehab
33*6cf2a73cSMauro Carvalho Chehab	W3,W2,flush,W1....
34*6cf2a73cSMauro Carvalho Chehab
35*6cf2a73cSMauro Carvalho ChehabAgain this is to simulate what is actually on disk, this allows us to detect
36*6cf2a73cSMauro Carvalho Chehabcases where a power failure at a particular point in time would create an
37*6cf2a73cSMauro Carvalho Chehabinconsistent file system.
38*6cf2a73cSMauro Carvalho Chehab
39*6cf2a73cSMauro Carvalho ChehabAny REQ_FUA requests bypass this flushing mechanism and are logged as soon as
40*6cf2a73cSMauro Carvalho Chehabthey complete as those requests will obviously bypass the device cache.
41*6cf2a73cSMauro Carvalho Chehab
42*6cf2a73cSMauro Carvalho ChehabAny REQ_OP_DISCARD requests are treated like WRITE requests.  Otherwise we would
43*6cf2a73cSMauro Carvalho Chehabhave all the DISCARD requests, and then the WRITE requests and then the FLUSH
44*6cf2a73cSMauro Carvalho Chehabrequest.  Consider the following example:
45*6cf2a73cSMauro Carvalho Chehab
46*6cf2a73cSMauro Carvalho Chehab	WRITE block 1, DISCARD block 1, FLUSH
47*6cf2a73cSMauro Carvalho Chehab
48*6cf2a73cSMauro Carvalho ChehabIf we logged DISCARD when it completed, the replay would look like this:
49*6cf2a73cSMauro Carvalho Chehab
50*6cf2a73cSMauro Carvalho Chehab	DISCARD 1, WRITE 1, FLUSH
51*6cf2a73cSMauro Carvalho Chehab
52*6cf2a73cSMauro Carvalho Chehabwhich isn't quite what happened and wouldn't be caught during the log replay.
53*6cf2a73cSMauro Carvalho Chehab
54*6cf2a73cSMauro Carvalho ChehabTarget interface
55*6cf2a73cSMauro Carvalho Chehab================
56*6cf2a73cSMauro Carvalho Chehab
57*6cf2a73cSMauro Carvalho Chehabi) Constructor
58*6cf2a73cSMauro Carvalho Chehab
59*6cf2a73cSMauro Carvalho Chehab   log-writes <dev_path> <log_dev_path>
60*6cf2a73cSMauro Carvalho Chehab
61*6cf2a73cSMauro Carvalho Chehab   ============= ==============================================
62*6cf2a73cSMauro Carvalho Chehab   dev_path	 Device that all of the IO will go to normally.
63*6cf2a73cSMauro Carvalho Chehab   log_dev_path  Device where the log entries are written to.
64*6cf2a73cSMauro Carvalho Chehab   ============= ==============================================
65*6cf2a73cSMauro Carvalho Chehab
66*6cf2a73cSMauro Carvalho Chehabii) Status
67*6cf2a73cSMauro Carvalho Chehab
68*6cf2a73cSMauro Carvalho Chehab    <#logged entries> <highest allocated sector>
69*6cf2a73cSMauro Carvalho Chehab
70*6cf2a73cSMauro Carvalho Chehab    =========================== ========================
71*6cf2a73cSMauro Carvalho Chehab    #logged entries	        Number of logged entries
72*6cf2a73cSMauro Carvalho Chehab    highest allocated sector    Highest allocated sector
73*6cf2a73cSMauro Carvalho Chehab    =========================== ========================
74*6cf2a73cSMauro Carvalho Chehab
75*6cf2a73cSMauro Carvalho Chehabiii) Messages
76*6cf2a73cSMauro Carvalho Chehab
77*6cf2a73cSMauro Carvalho Chehab    mark <description>
78*6cf2a73cSMauro Carvalho Chehab
79*6cf2a73cSMauro Carvalho Chehab	You can use a dmsetup message to set an arbitrary mark in a log.
80*6cf2a73cSMauro Carvalho Chehab	For example say you want to fsck a file system after every
81*6cf2a73cSMauro Carvalho Chehab	write, but first you need to replay up to the mkfs to make sure
82*6cf2a73cSMauro Carvalho Chehab	we're fsck'ing something reasonable, you would do something like
83*6cf2a73cSMauro Carvalho Chehab	this::
84*6cf2a73cSMauro Carvalho Chehab
85*6cf2a73cSMauro Carvalho Chehab	  mkfs.btrfs -f /dev/mapper/log
86*6cf2a73cSMauro Carvalho Chehab	  dmsetup message log 0 mark mkfs
87*6cf2a73cSMauro Carvalho Chehab	  <run test>
88*6cf2a73cSMauro Carvalho Chehab
89*6cf2a73cSMauro Carvalho Chehab	This would allow you to replay the log up to the mkfs mark and
90*6cf2a73cSMauro Carvalho Chehab	then replay from that point on doing the fsck check in the
91*6cf2a73cSMauro Carvalho Chehab	interval that you want.
92*6cf2a73cSMauro Carvalho Chehab
93*6cf2a73cSMauro Carvalho Chehab	Every log has a mark at the end labeled "dm-log-writes-end".
94*6cf2a73cSMauro Carvalho Chehab
95*6cf2a73cSMauro Carvalho ChehabUserspace component
96*6cf2a73cSMauro Carvalho Chehab===================
97*6cf2a73cSMauro Carvalho Chehab
98*6cf2a73cSMauro Carvalho ChehabThere is a userspace tool that will replay the log for you in various ways.
99*6cf2a73cSMauro Carvalho ChehabIt can be found here: https://github.com/josefbacik/log-writes
100*6cf2a73cSMauro Carvalho Chehab
101*6cf2a73cSMauro Carvalho ChehabExample usage
102*6cf2a73cSMauro Carvalho Chehab=============
103*6cf2a73cSMauro Carvalho Chehab
104*6cf2a73cSMauro Carvalho ChehabSay you want to test fsync on your file system.  You would do something like
105*6cf2a73cSMauro Carvalho Chehabthis::
106*6cf2a73cSMauro Carvalho Chehab
107*6cf2a73cSMauro Carvalho Chehab  TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
108*6cf2a73cSMauro Carvalho Chehab  dmsetup create log --table "$TABLE"
109*6cf2a73cSMauro Carvalho Chehab  mkfs.btrfs -f /dev/mapper/log
110*6cf2a73cSMauro Carvalho Chehab  dmsetup message log 0 mark mkfs
111*6cf2a73cSMauro Carvalho Chehab
112*6cf2a73cSMauro Carvalho Chehab  mount /dev/mapper/log /mnt/btrfs-test
113*6cf2a73cSMauro Carvalho Chehab  <some test that does fsync at the end>
114*6cf2a73cSMauro Carvalho Chehab  dmsetup message log 0 mark fsync
115*6cf2a73cSMauro Carvalho Chehab  md5sum /mnt/btrfs-test/foo
116*6cf2a73cSMauro Carvalho Chehab  umount /mnt/btrfs-test
117*6cf2a73cSMauro Carvalho Chehab
118*6cf2a73cSMauro Carvalho Chehab  dmsetup remove log
119*6cf2a73cSMauro Carvalho Chehab  replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
120*6cf2a73cSMauro Carvalho Chehab  mount /dev/sdb /mnt/btrfs-test
121*6cf2a73cSMauro Carvalho Chehab  md5sum /mnt/btrfs-test/foo
122*6cf2a73cSMauro Carvalho Chehab  <verify md5sum's are correct>
123*6cf2a73cSMauro Carvalho Chehab
124*6cf2a73cSMauro Carvalho Chehab  Another option is to do a complicated file system operation and verify the file
125*6cf2a73cSMauro Carvalho Chehab  system is consistent during the entire operation.  You could do this with:
126*6cf2a73cSMauro Carvalho Chehab
127*6cf2a73cSMauro Carvalho Chehab  TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
128*6cf2a73cSMauro Carvalho Chehab  dmsetup create log --table "$TABLE"
129*6cf2a73cSMauro Carvalho Chehab  mkfs.btrfs -f /dev/mapper/log
130*6cf2a73cSMauro Carvalho Chehab  dmsetup message log 0 mark mkfs
131*6cf2a73cSMauro Carvalho Chehab
132*6cf2a73cSMauro Carvalho Chehab  mount /dev/mapper/log /mnt/btrfs-test
133*6cf2a73cSMauro Carvalho Chehab  <fsstress to dirty the fs>
134*6cf2a73cSMauro Carvalho Chehab  btrfs filesystem balance /mnt/btrfs-test
135*6cf2a73cSMauro Carvalho Chehab  umount /mnt/btrfs-test
136*6cf2a73cSMauro Carvalho Chehab  dmsetup remove log
137*6cf2a73cSMauro Carvalho Chehab
138*6cf2a73cSMauro Carvalho Chehab  replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
139*6cf2a73cSMauro Carvalho Chehab  btrfsck /dev/sdb
140*6cf2a73cSMauro Carvalho Chehab  replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
141*6cf2a73cSMauro Carvalho Chehab	--fsck "btrfsck /dev/sdb" --check fua
142*6cf2a73cSMauro Carvalho Chehab
143*6cf2a73cSMauro Carvalho ChehabAnd that will replay the log until it sees a FUA request, run the fsck command
144*6cf2a73cSMauro Carvalho Chehaband if the fsck passes it will replay to the next FUA, until it is completed or
145*6cf2a73cSMauro Carvalho Chehabthe fsck command exists abnormally.
146