#
2ded3703 |
| 17-Nov-2016 |
Song Liu <songliubraving@fb.com> |
md/r5cache: State machine for raid5-cache write back mode
This patch adds state machine for raid5-cache. With log device, the raid456 array could operate in two different modes (r5c_journal_mode):
md/r5cache: State machine for raid5-cache write back mode
This patch adds state machine for raid5-cache. With log device, the raid456 array could operate in two different modes (r5c_journal_mode): - write-back (R5C_MODE_WRITE_BACK) - write-through (R5C_MODE_WRITE_THROUGH)
Existing code of raid5-cache only has write-through mode. For write-back cache, it is necessary to extend the state machine.
With write-back cache, every stripe could operate in two different phases: - caching - writing-out
In caching phase, the stripe handles writes as: - write to journal - return IO
In writing-out phase, the stripe behaviors as a stripe in write through mode R5C_MODE_WRITE_THROUGH.
STRIPE_R5C_CACHING is added to sh->state to differentiate caching and writing-out phase.
Please note: this is a "no-op" patch for raid5-cache write-through mode.
The following detailed explanation is copied from the raid5-cache.c:
/* * raid5 cache state machine * * With rhe RAID cache, each stripe works in two phases: * - caching phase * - writing-out phase * * These two phases are controlled by bit STRIPE_R5C_CACHING: * if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase * if STRIPE_R5C_CACHING == 1, the stripe is in caching phase * * When there is no journal, or the journal is in write-through mode, * the stripe is always in writing-out phase. * * For write-back journal, the stripe is sent to caching phase on write * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off * the write-out phase by clearing STRIPE_R5C_CACHING. * * Stripes in caching phase do not write the raid disks. Instead, all * writes are committed from the log device. Therefore, a stripe in * caching phase handles writes as: * - write to log device * - return IO * * Stripes in writing-out phase handle writes as: * - calculate parity * - write pending data and parity to journal * - write data and parity to raid disks * - return IO for pending writes */
Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
show more ...
|
#
c757ec95 |
| 17-Nov-2016 |
Song Liu <songliubraving@fb.com> |
md/r5cache: Check array size in r5l_init_log
Currently, r5l_write_stripe checks meta size for each stripe write, which is not necessary.
With this patch, r5l_init_log checks maximal meta size of th
md/r5cache: Check array size in r5l_init_log
Currently, r5l_write_stripe checks meta size for each stripe write, which is not necessary.
With this patch, r5l_init_log checks maximal meta size of the array, which is (r5l_meta_block + raid_disks x r5l_payload_data_parity). If this is too big to fit in one page, r5l_init_log aborts.
With current meta data, r5l_log support raid_disks up to 203.
Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
show more ...
|
#
354b445b |
| 16-Nov-2016 |
Shaohua Li <shli@fb.com> |
raid5-cache: fix lockdep warning
lockdep reports warning of the rcu_dereference usage. Using normal rdev access pattern to avoid the warning.
Signed-off-by: Shaohua Li <shli@fb.com>
|
Revision tags: v4.4.32, v4.4.31 |
|
#
3fd880af |
| 02-Nov-2016 |
JackieLiu <liuyun01@kylinos.cn> |
raid5-cache: restrict the use area of the log_offset variable
We can calculate this offset by using ctx->meta_total_blocks, without passing in from the function
Signed-off-by: JackieLiu <liuyun01@k
raid5-cache: restrict the use area of the log_offset variable
We can calculate this offset by using ctx->meta_total_blocks, without passing in from the function
Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
show more ...
|
#
70fd7614 |
| 01-Nov-2016 |
Christoph Hellwig <hch@lst.de> |
block,fs: use REQ_* flags directly
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags directly. Where applicable this also drops usage of the bio_set_op_attrs wrapper.
Signed-off-by
block,fs: use REQ_* flags directly
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags directly. Where applicable this also drops usage of the bio_set_op_attrs wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
show more ...
|
Revision tags: v4.4.30, v4.4.29, v4.4.28 |
|
#
9a8b27fa |
| 27-Oct-2016 |
Shaohua Li <shli@fb.com> |
raid5-cache: correct condition for empty metadata write
As long as we recover one metadata block, we should write the empty metadata write. The original code could make recovery corrupted if only on
raid5-cache: correct condition for empty metadata write
As long as we recover one metadata block, we should write the empty metadata write. The original code could make recovery corrupted if only one meta is valid.
Reported-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
show more ...
|
#
56056c2e |
| 24-Oct-2016 |
Zhengyuan Liu <liuzhengyuan@kylinos.cn> |
md/raid5: write an empty meta-block when creating log super-block
If superblock points to an invalid meta block, r5l_load_log will set create_super with true and create an new superblock, this runti
md/raid5: write an empty meta-block when creating log super-block
If superblock points to an invalid meta block, r5l_load_log will set create_super with true and create an new superblock, this runtime path would always happen if we do no writing I/O to this array since it was created. Writing an empty meta block could avoid this unnecessary action at the first time we created log superblock.
Another reason is for the corretness of log recovery. Currently we have bellow code to guarantee log revocery to be correct.
if (ctx.seq > log->last_cp_seq + 1) { int ret;
ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10); if (ret) return ret; log->seq = ctx.seq + 11; log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS); r5l_write_super(log, ctx.pos); } else { log->log_start = ctx.pos; log->seq = ctx.seq; }
If we just created a array with a journal device, log->log_start and log->last_checkpoint should all be 0, then we write three meta block which are valid except mid one and supposed crash happened. The ctx.seq would equal to log->last_cp_seq + 1 and log->log_start would be set to position of mid invalid meta block after we did a recovery, this will lead to problems which could be avoided with this patch.
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
show more ...
|
#
28cd88e2 |
| 23-Oct-2016 |
Zhengyuan Liu <liuzhengyuan@kylinos.cn> |
md/raid5: initialize next_checkpoint field before use
No initial operation was done to this field when we load/recovery the log, it got assignment only when IO to raid disk was finished. So r5l_quie
md/raid5: initialize next_checkpoint field before use
No initial operation was done to this field when we load/recovery the log, it got assignment only when IO to raid disk was finished. So r5l_quiesce may use wrong next_checkpoint to reclaim log space, that would make reclaimable space calculation confused.
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
show more ...
|
Revision tags: v4.4.27, v4.7.10, openbmc-4.4-20161021-1, v4.7.9, v4.4.26, v4.7.8, v4.4.25, v4.4.24, v4.7.7, v4.8, v4.4.23, v4.7.6, v4.7.5, v4.4.22, v4.4.21, v4.7.4, v4.7.3, v4.4.20 |
|
#
8e018c21 |
| 25-Aug-2016 |
Shaohua Li <shli@fb.com> |
raid5-cache: fix a deadlock in superblock write
There is a potential deadlock in superblock write. Discard could zero data, so before discard we must make sure superblock is updated to new log tail.
raid5-cache: fix a deadlock in superblock write
There is a potential deadlock in superblock write. Discard could zero data, so before discard we must make sure superblock is updated to new log tail. Updating superblock (either directly call md_update_sb() or depend on md thread) must hold reconfig mutex. On the other hand, raid5_quiesce is called with reconfig_mutex hold. The first step of raid5_quiesce() is waitting for all IO finish, hence waitting for reclaim thread, while reclaim thread is calling this function and waitting for reconfig mutex. So there is a deadlock. We workaround this issue with a trylock. The downside of the solution is we could miss discard if we can't take reconfig mutex. But this should happen rarely (mainly in raid array stop), so miss discard shouldn't be a big problem.
Cc: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
show more ...
|
Revision tags: v4.7.2, v4.4.19, openbmc-4.4-20160819-1, v4.7.1, v4.4.18, v4.4.17 |
|
#
1eff9d32 |
| 05-Aug-2016 |
Jens Axboe <axboe@fb.com> |
block: rename bio bi_rw to bi_opf
Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually s
block: rename bio bi_rw to bi_opf
Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime.
No intended functional changes in this commit.
Signed-off-by: Jens Axboe <axboe@fb.com>
show more ...
|
Revision tags: openbmc-4.4-20160804-1, v4.4.16, v4.7, openbmc-4.4-20160722-1, openbmc-20160722-1, openbmc-20160713-1, v4.4.15, v4.6.4, v4.6.3, v4.4.14, v4.6.2, v4.4.13, openbmc-20160606-1 |
|
#
28a8f0d3 |
| 05-Jun-2016 |
Mike Christie <mchristi@redhat.com> |
block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH
To avoid confusion between REQ_OP_FLUSH, which is handled by request_fn drivers, and upper layers requesting the block layer perform a flush sequ
block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH
To avoid confusion between REQ_OP_FLUSH, which is handled by request_fn drivers, and upper layers requesting the block layer perform a flush sequence along with possibly a WRITE, this patch renames REQ_FLUSH to REQ_PREFLUSH.
Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
show more ...
|
#
796a5cf0 |
| 05-Jun-2016 |
Mike Christie <mchristi@redhat.com> |
md: use bio op accessors
Separate the op from the rq_flag_bits and have md set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph H
md: use bio op accessors
Separate the op from the rq_flag_bits and have md set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
show more ...
|
#
4e49ea4a |
| 05-Jun-2016 |
Mike Christie <mchristi@redhat.com> |
block/fs/drivers: remove rw argument from submit_bio
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw instead of passing it in. This makes that use the same as generic_make_request
block/fs/drivers: remove rw argument from submit_bio
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw instead of passing it in. This makes that use the same as generic_make_request and how we set the other bio fields.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixed up fs/ext4/crypto.c
Signed-off-by: Jens Axboe <axboe@fb.com>
show more ...
|
Revision tags: v4.6.1, v4.4.12, openbmc-20160521-1, v4.4.11, openbmc-20160518-1, v4.6, v4.4.10, openbmc-20160511-1, openbmc-20160505-1, v4.4.9 |
|
#
85ad1d13 |
| 03-May-2016 |
Guoqing Jiang <gqjiang@suse.com> |
md: set MD_CHANGE_PENDING in a atomic region
Some code waits for a metadata update by:
1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN) 2. setting MD_CHANGE_PENDING and waking the
md: set MD_CHANGE_PENDING in a atomic region
Some code waits for a metadata update by:
1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN) 2. setting MD_CHANGE_PENDING and waking the management thread 3. waiting for MD_CHANGE_PENDING to be cleared
If the first two are done without locking, the code in md_update_sb() which checks if it needs to repeat might test if an update is needed before step 1, then clear MD_CHANGE_PENDING after step 2, resulting in the wait returning early.
So make sure all places that set MD_CHANGE_PENDING are atomicial, and bit_clear_unless (suggested by Neil) is introduced for the purpose.
Cc: Martin Kepplinger <martink@posteo.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: <linux-kernel@vger.kernel.org> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
show more ...
|
Revision tags: v4.4.8 |
|
#
c888a8f9 |
| 13-Apr-2016 |
Jens Axboe <axboe@fb.com> |
block: kill off q->flush_flags
Now that we converted everything to the newer block write cache interface, kill off the queue flush_flags and queueable flush entries.
Signed-off-by: Jens Axboe <axbo
block: kill off q->flush_flags
Now that we converted everything to the newer block write cache interface, kill off the queue flush_flags and queueable flush entries.
Signed-off-by: Jens Axboe <axboe@fb.com>
show more ...
|
Revision tags: v4.4.7, openbmc-20160329-2, openbmc-20160329-1, openbmc-20160321-1, v4.4.6, v4.5, v4.4.5, v4.4.4, v4.4.3, openbmc-20160222-1, v4.4.2, openbmc-20160212-1, openbmc-20160210-1, openbmc-20160202-2, openbmc-20160202-1, v4.4.1, openbmc-20160127-1, openbmc-20160120-1, v4.4 |
|
#
16a43f6a |
| 06-Jan-2016 |
Shaohua Li <shli@fb.com> |
raid5-cache: handle journal hotadd in quiesce
Handle journal hotadd in quiesce to avoid creating duplicated threads.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
a62ab49e |
| 06-Jan-2016 |
Shaohua Li <shli@fb.com> |
md: set MD_HAS_JOURNAL in correct places
Set MD_HAS_JOURNAL when a array is loaded or journal is initialized. This is to avoid the flags set too early in journal disk hotadd.
Signed-off-by: Shaohua
md: set MD_HAS_JOURNAL in correct places
Set MD_HAS_JOURNAL when a array is loaded or journal is initialized. This is to avoid the flags set too early in journal disk hotadd.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
show more ...
|
#
5036c390 |
| 20-Dec-2015 |
Christoph Hellwig <hch@lst.de> |
raid5: allow r5l_io_unit allocations to fail
And propagate the error up the stack so we can add the stripe to no_stripes_list and retry our log operation later. This avoids blocking raid5d due to r
raid5: allow r5l_io_unit allocations to fail
And propagate the error up the stack so we can add the stripe to no_stripes_list and retry our log operation later. This avoids blocking raid5d due to reclaim, an it allows to get rid of the deadlock-prone GFP_NOFAIL allocation.
shli: add missing mempool_destroy()
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>
show more ...
|
#
e8deb638 |
| 20-Dec-2015 |
Christoph Hellwig <hch@lst.de> |
raid5-cache: use a mempool for the metadata block
We only have a limited number in flight, so use a page based mempool.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb
raid5-cache: use a mempool for the metadata block
We only have a limited number in flight, so use a page based mempool.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>
show more ...
|
#
c38d29b3 |
| 20-Dec-2015 |
Christoph Hellwig <hch@lst.de> |
raid5-cache: use a bio_set
This allows us to make guaranteed forward progress.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>
|
#
f6b6ec5c |
| 20-Dec-2015 |
Shaohua Li <shli@fb.com> |
raid5-cache: add journal hot add/remove support
Add support for journal disk hot add/remove. Mostly trival checks in md part. The raid5 part is a little tricky. For hot-remove, we can't wait pending
raid5-cache: add journal hot add/remove support
Add support for journal disk hot add/remove. Mostly trival checks in md part. The raid5 part is a little tricky. For hot-remove, we can't wait pending write as it's called from raid5d. The wait will cause deadlock. We simplily fail the hot-remove. A hot-remove retry can success eventually since if journal disk is faulty all pending write will be failed and finish. For hot-add, since an array supporting journal but without journal disk will be marked read-only, we are safe to hot add journal without stopping IO (should be read IO, while journal only handles write IO).
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
show more ...
|
#
ad66d445 |
| 20-Dec-2015 |
Christoph Hellwig <hch@lst.de> |
raid5-cache: free meta_page earlier
Once the I/O completed we don't need the meta page anymore. As the iounits can live on for a long time this reduces memory pressure a bit.
Signed-off-by: Christ
raid5-cache: free meta_page earlier
Once the I/O completed we don't need the meta page anymore. As the iounits can live on for a long time this reduces memory pressure a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
show more ...
|
#
3848c0bc |
| 20-Dec-2015 |
Christoph Hellwig <hch@lst.de> |
raid5-cache: simplify r5l_move_io_unit_list
It's only used for one kind of move, so make that explicit. Also clean up the code a bit by using list_for_each_safe.
Signed-off-by: Christoph Hellwig <
raid5-cache: simplify r5l_move_io_unit_list
It's only used for one kind of move, so make that explicit. Also clean up the code a bit by using list_for_each_safe.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
show more ...
|
Revision tags: openbmc-20151217-1, openbmc-20151210-1, openbmc-20151202-1, openbmc-20151123-1, openbmc-20151118-1, openbmc-20151104-1, v4.3, openbmc-20151102-1, openbmc-20151028-1 |
|
#
7dde2ad3 |
| 08-Oct-2015 |
Shaohua Li <shli@fb.com> |
raid5-cache: start raid5 readonly if journal is missing
If raid array is expected to have journal (eg, journal is set in MD superblock feature map) and the array is started without journal disk, sta
raid5-cache: start raid5 readonly if journal is missing
If raid array is expected to have journal (eg, journal is set in MD superblock feature map) and the array is started without journal disk, start the array readonly.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
show more ...
|
#
6e74a9cf |
| 08-Oct-2015 |
Shaohua Li <shli@fb.com> |
raid5-cache: IO error handling
There are 3 places the raid5-cache dispatches IO. The discard IO error doesn't matter, so we ignore it. The superblock write IO error can be handled in MD core. The re
raid5-cache: IO error handling
There are 3 places the raid5-cache dispatches IO. The discard IO error doesn't matter, so we ignore it. The superblock write IO error can be handled in MD core. The remaining are log write and flush. When the IO error happens, we mark log disk faulty and fail all write IO. Read IO is still allowed to run. Userspace will get a notification too and corresponding daemon can choose setting raid array readonly for example.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
show more ...
|