1================ 2RAID 4/5/6 cache 3================ 4 5Raid 4/5/6 could include an extra disk for data cache besides normal RAID 6disks. The role of RAID disks isn't changed with the cache disk. The cache disk 7caches data to the RAID disks. The cache can be in write-through (supported 8since 4.4) or write-back mode (supported since 4.10). mdadm (supported since 93.4) has a new option '--write-journal' to create array with cache. Please 10refer to mdadm manual for details. By default (RAID array starts), the cache is 11in write-through mode. A user can switch it to write-back mode by:: 12 13 echo "write-back" > /sys/block/md0/md/journal_mode 14 15And switch it back to write-through mode by:: 16 17 echo "write-through" > /sys/block/md0/md/journal_mode 18 19In both modes, all writes to the array will hit cache disk first. This means 20the cache disk must be fast and sustainable. 21 22write-through mode 23================== 24 25This mode mainly fixes the 'write hole' issue. For RAID 4/5/6 array, an unclean 26shutdown can cause data in some stripes to not be in consistent state, eg, data 27and parity don't match. The reason is that a stripe write involves several RAID 28disks and it's possible the writes don't hit all RAID disks yet before the 29unclean shutdown. We call an array degraded if it has inconsistent data. MD 30tries to resync the array to bring it back to normal state. But before the 31resync completes, any system crash will expose the chance of real data 32corruption in the RAID array. This problem is called 'write hole'. 33 34The write-through cache will cache all data on cache disk first. After the data 35is safe on the cache disk, the data will be flushed onto RAID disks. The 36two-step write will guarantee MD can recover correct data after unclean 37shutdown even the array is degraded. Thus the cache can close the 'write hole'. 38 39In write-through mode, MD reports IO completion to upper layer (usually 40filesystems) after the data is safe on RAID disks, so cache disk failure 41doesn't cause data loss. Of course cache disk failure means the array is 42exposed to 'write hole' again. 43 44In write-through mode, the cache disk isn't required to be big. Several 45hundreds megabytes are enough. 46 47write-back mode 48=============== 49 50write-back mode fixes the 'write hole' issue too, since all write data is 51cached on cache disk. But the main goal of 'write-back' cache is to speed up 52write. If a write crosses all RAID disks of a stripe, we call it full-stripe 53write. For non-full-stripe writes, MD must read old data before the new parity 54can be calculated. These synchronous reads hurt write throughput. Some writes 55which are sequential but not dispatched in the same time will suffer from this 56overhead too. Write-back cache will aggregate the data and flush the data to 57RAID disks only after the data becomes a full stripe write. This will 58completely avoid the overhead, so it's very helpful for some workloads. A 59typical workload which does sequential write followed by fsync is an example. 60 61In write-back mode, MD reports IO completion to upper layer (usually 62filesystems) right after the data hits cache disk. The data is flushed to raid 63disks later after specific conditions met. So cache disk failure will cause 64data loss. 65 66In write-back mode, MD also caches data in memory. The memory cache includes 67the same data stored on cache disk, so a power loss doesn't cause data loss. 68The memory cache size has performance impact for the array. It's recommended 69the size is big. A user can configure the size by:: 70 71 echo "2048" > /sys/block/md0/md/stripe_cache_size 72 73Too small cache disk will make the write aggregation less efficient in this 74mode depending on the workloads. It's recommended to use a cache disk with at 75least several gigabytes size in write-back mode. 76 77The implementation 78================== 79 80The write-through and write-back cache use the same disk format. The cache disk 81is organized as a simple write log. The log consists of 'meta data' and 'data' 82pairs. The meta data describes the data. It also includes checksum and sequence 83ID for recovery identification. Data can be IO data and parity data. Data is 84checksummed too. The checksum is stored in the meta data ahead of the data. The 85checksum is an optimization because MD can write meta and data freely without 86worry about the order. MD superblock has a field pointed to the valid meta data 87of log head. 88 89The log implementation is pretty straightforward. The difficult part is the 90order in which MD writes data to cache disk and RAID disks. Specifically, in 91write-through mode, MD calculates parity for IO data, writes both IO data and 92parity to the log, writes the data and parity to RAID disks after the data and 93parity is settled down in log and finally the IO is finished. Read just reads 94from raid disks as usual. 95 96In write-back mode, MD writes IO data to the log and reports IO completion. The 97data is also fully cached in memory at that time, which means read must query 98memory cache. If some conditions are met, MD will flush the data to RAID disks. 99MD will calculate parity for the data and write parity into the log. After this 100is finished, MD will write both data and parity into RAID disks, then MD can 101release the memory cache. The flush conditions could be stripe becomes a full 102stripe write, free cache disk space is low or free in-kernel memory cache space 103is low. 104 105After an unclean shutdown, MD does recovery. MD reads all meta data and data 106from the log. The sequence ID and checksum will help us detect corrupted meta 107data and data. If MD finds a stripe with data and valid parities (1 parity for 108raid4/5 and 2 for raid6), MD will write the data and parities to RAID disks. If 109parities are incompleted, they are discarded. If part of data is corrupted, 110they are discarded too. MD then loads valid data and writes them to RAID disks 111in normal way. 112