History log of /openbmc/linux/drivers/md/raid1.c (Results 151 – 175 of 1017)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 11353b9d 14-Mar-2017 Zhilong Liu <zlliu@suse.com>

md/raid1: fix a trivial typo in comments

raid1.c: fix a trivial typo in comments of freeze_array().

Cc: Jack Wang <jack.wang.usish@gmail.com>
Cc: Guoqing Jiang <gqjiang@suse.com>
Cc: John Stoffel <

md/raid1: fix a trivial typo in comments

raid1.c: fix a trivial typo in comments of freeze_array().

Cc: Jack Wang <jack.wang.usish@gmail.com>
Cc: Guoqing Jiang <gqjiang@suse.com>
Cc: John Stoffel <john@stoffel.org>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Zhilong Liu <zlliu@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


# 61eb2b43 28-Feb-2017 Shaohua Li <shli@fb.com>

md/raid1/10: fix potential deadlock

Neil Brown pointed out a potential deadlock in raid 10 code with
bio_split/chain. The raid1 code could have the same issue, but recent
barrier rework makes it les

md/raid1/10: fix potential deadlock

Neil Brown pointed out a potential deadlock in raid 10 code with
bio_split/chain. The raid1 code could have the same issue, but recent
barrier rework makes it less likely to happen. The deadlock happens in
below sequence:

1. generic_make_request(bio), this will set current->bio_list
2. raid10_make_request will split bio to bio1 and bio2
3. __make_request(bio1), wait_barrer, add underlayer disk bio to
current->bio_list
4. __make_request(bio2), wait_barrer

If raise_barrier happens between 3 & 4, since wait_barrier runs at 3,
raise_barrier waits for IO completion from 3. And since raise_barrier
sets barrier, 4 waits for raise_barrier. But IO from 3 can't be
dispatched because raid10_make_request() doesn't finished yet.

The solution is to adjust the IO ordering. Quotes from Neil:
"
It is much safer to:

if (need to split) {
split = bio_split(bio, ...)
bio_chain(...)
make_request_fn(split);
generic_make_request(bio);
} else
make_request_fn(mddev, bio);

This way we first process the initial section of the bio (in 'split')
which will queue some requests to the underlying devices. These
requests will be queued in generic_make_request.
Then we queue the remainder of the bio, which will be added to the end
of the generic_make_request queue.
Then we return.
generic_make_request() will pop the lower-level device requests off the
queue and handle them first. Then it will process the remainder
of the original bio once the first section has been fully processed.
"

Note, this only happens in read path. In write path, the bio is flushed to
underlaying disks either by blk flush (from schedule) or offladed to raid1/10d.
It's queued in current->bio_list.

Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org (v3.14+, only the raid10 part)
Suggested-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


Revision tags: v4.10.1
# c9483634 23-Feb-2017 Guoqing Jiang <gqjiang@suse.com>

md: move funcs from pers->resize to update_size

raid1_resize and raid5_resize should also check the
mddev->queue if run underneath dm-raid.

And both set_capacity and revalidate_disk are used in
per

md: move funcs from pers->resize to update_size

raid1_resize and raid5_resize should also check the
mddev->queue if run underneath dm-raid.

And both set_capacity and revalidate_disk are used in
pers->resize such as raid1, raid10 and raid5. So
move them from personality file to common code.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


Revision tags: v4.10
# 3f07c014 08-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>

We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up f

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>

We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

show more ...


# 1ec49223 21-Feb-2017 Shaohua Li <shli@fb.com>

md/raid1: fix write behind issues introduced by bio_clone_bioset_partial

There are two issues, introduced by commit 8e58e32(md/raid1: use
bio_clone_bioset_partial() in case of write behind):
- bio_c

md/raid1: fix write behind issues introduced by bio_clone_bioset_partial

There are two issues, introduced by commit 8e58e32(md/raid1: use
bio_clone_bioset_partial() in case of write behind):
- bio_clone_bioset_partial() uses bytes instead of sectors as parameters
- in writebehind mode, we return bio if all !writemostly disk bios finish,
which could happen before writemostly disk bios run. So all
writemostly disk bios should have their bvec. Here we just make sure
all bios are cloned instead of fast cloned.

Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


# aff8da09 21-Feb-2017 Shaohua Li <shli@fb.com>

md/raid1: handle flush request correctly

I got a warning triggered in align_to_barrier_unit_end. It's a flush
request so sectors == 0. The flush request happens to work well without
the new barrier

md/raid1: handle flush request correctly

I got a warning triggered in align_to_barrier_unit_end. It's a flush
request so sectors == 0. The flush request happens to work well without
the new barrier patch, but we'd better handle it explictly.

Cc: NeilBrown <neilb@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


# af5f42a7 20-Feb-2017 Shaohua Li <shli@fb.com>

md/raid1: fix a use-after-free bug

Commit fd76863 (RAID1: a new I/O barrier implementation to remove resync
window) introduces a user-after-free bug.

Signed-off-by: Shaohua Li <shli@fb.com>


# 824e47da 17-Feb-2017 colyli@suse.de <colyli@suse.de>

RAID1: avoid unnecessary spin locks in I/O barrier code

When I run a parallel reading performan testing on a md raid1 device with
two NVMe SSDs, I observe very bad throughput in supprise: by fio wit

RAID1: avoid unnecessary spin locks in I/O barrier code

When I run a parallel reading performan testing on a md raid1 device with
two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
only 2.7GB/s, this is around 50% of the idea performance number.

The perf reports locking contention happens at allow_barrier() and
wait_barrier() code,
- 41.41% fio [kernel.kallsyms] [k] _raw_spin_lock_irqsave
- _raw_spin_lock_irqsave
+ 89.92% allow_barrier
+ 9.34% __wake_up
- 37.30% fio [kernel.kallsyms] [k] _raw_spin_lock_irq
- _raw_spin_lock_irq
- 100.00% wait_barrier

The reason is, in these I/O barrier related functions,
- raise_barrier()
- lower_barrier()
- wait_barrier()
- allow_barrier()
They always hold conf->resync_lock firstly, even there are only regular
reading I/Os and no resync I/O at all. This is a huge performance penalty.

The solution is a lockless-like algorithm in I/O barrier code, and only
holding conf->resync_lock when it has to.

The original idea is from Hannes Reinecke, and Neil Brown provides
comments to improve it. I continue to work on it, and make the patch into
current form.

In the new simpler raid1 I/O barrier implementation, there are two
wait barrier functions,
- wait_barrier()
Which calls _wait_barrier(), is used for regular write I/O. If there is
resync I/O happening on the same I/O barrier bucket, or the whole
array is frozen, task will wait until no barrier on same barrier bucket,
or the whold array is unfreezed.
- wait_read_barrier()
Since regular read I/O won't interfere with resync I/O (read_balance()
will make sure only uptodate data will be read out), it is unnecessary
to wait for barrier in regular read I/Os, waiting in only necessary
when the whole array is frozen.

The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
barrier[idx] are very carefully designed in raise_barrier(),
lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
avoid unnecessary spin locks in these functions. Once conf->
nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
has to wait in raise_barrier(). Then in _wait_barrier() if no barrier
raised in same barrier bucket index and array is not frozen, the regular
I/O doesn't need to hold conf->resync_lock, it can just increase
conf->nr_pending[idx], and return to its caller. wait_read_barrier() is
very similar to _wait_barrier(), the only difference is it only waits when
array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier
code almostly gets rid of all spin lock cost.

This patch significantly improves raid1 reading peroformance. From my
testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
increases from 2.7GB/s to 4.6GB/s (+70%).

Changelog
V4:
- Change conf->nr_queued[] to atomic_t.
- Define BARRIER_BUCKETS_NR_BITS by (PAGE_SHIFT - ilog2(sizeof(atomic_t)))
V3:
- Add smp_mb__after_atomic() as Shaohua and Neil suggested.
- Change conf->nr_queued[] from atomic_t to int.
- Change conf->array_frozen from atomic_t back to int, and use
READ_ONCE(conf->array_frozen) to check value of conf->array_frozen
in _wait_barrier() and wait_read_barrier().
- In _wait_barrier() and wait_read_barrier(), add a call to
wake_up(&conf->wait_barrier) after atomic_dec(&conf->nr_pending[idx]),
to fix a deadlock between _wait_barrier()/wait_read_barrier and
freeze_array().
V2:
- Remove a spin_lock/unlock pair in raid1d().
- Add more code comments to explain why there is no racy when checking two
atomic_t variables at same time.
V1:
- Original RFC patch for comments.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


# fd76863e 17-Feb-2017 colyli@suse.de <colyli@suse.de>

RAID1: a new I/O barrier implementation to remove resync window

'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
introduces a sliding resync window for raid1 I/O barrier, th

RAID1: a new I/O barrier implementation to remove resync window

'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
introduces a sliding resync window for raid1 I/O barrier, this idea limits
I/O barriers to happen only inside a slidingresync window, for regular
I/Os out of this resync window they don't need to wait for barrier any
more. On large raid1 device, it helps a lot to improve parallel writing
I/O throughput when there are background resync I/Os performing at
same time.

The idea of sliding resync widow is awesome, but code complexity is a
challenge. Sliding resync window requires several variables to work
collectively, this is complexed and very hard to make it work correctly.
Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches
to fix the original resync window patch. This is not the end, any further
related modification may easily introduce more regreassion.

Therefore I decide to implement a much simpler raid1 I/O barrier, by
removing resync window code, I believe life will be much easier.

The brief idea of the simpler barrier is,
- Do not maintain a global unique resync window
- Use multiple hash buckets to reduce I/O barrier conflicts, regular
I/O only has to wait for a resync I/O when both them have same barrier
bucket index, vice versa.
- I/O barrier can be reduced to an acceptable number if there are enough
barrier buckets

Here I explain how the barrier buckets are designed,
- BARRIER_UNIT_SECTOR_SIZE
The whole LBA address space of a raid1 device is divided into multiple
barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
Bio requests won't go across border of barrier unit size, that means
maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
For random I/O 64MB is large enough for both read and write requests,
for sequential I/O considering underlying block layer may merge them
into larger requests, 64MB is still good enough.
Neil also points out that for resync operation, "we want the resync to
move from region to region fairly quickly so that the slowness caused
by having to synchronize with the resync is averaged out over a fairly
small time frame". For full speed resync, 64MB should take less then 1
second. When resync is competing with other I/O, it could take up a few
minutes. Therefore 64MB size is fairly good range for resync.

- BARRIER_BUCKETS_NR
There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
#define BARRIER_BUCKETS_NR_BITS (PAGE_SHIFT - 2)
#define BARRIER_BUCKETS_NR (1<<BARRIER_BUCKETS_NR_BITS)
this patch makes the bellowed members of struct r1conf from integer
to array of integers,
- int nr_pending;
- int nr_waiting;
- int nr_queued;
- int barrier;
+ int *nr_pending;
+ int *nr_waiting;
+ int *nr_queued;
+ int *barrier;
number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
barrier buckets, and each array of integers occupies single memory page.
1024 means for a request which is smaller than the I/O barrier unit size
has ~0.1% chance to wait for resync to pause, which is quite a small
enough fraction. Also requesting single memory page is more friendly to
kernel page allocator than larger memory size.

- I/O barrier bucket is indexed by bio start sector
If multiple I/O requests hit different I/O barrier units, they only need
to compete I/O barrier with other I/Os which hit the same I/O barrier
bucket index with each other. The index of a barrier bucket which a
bio should look for is calculated by sector_to_idx() which is defined
in raid1.h as an inline function,
static inline int sector_to_idx(sector_t sector)
{
return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
BARRIER_BUCKETS_NR_BITS);
}
Here sector_nr is the start sector number of a bio.

- Single bio won't go across boundary of a I/O barrier unit
If a request goes across boundary of barrier unit, it will be split. A
bio may be split in raid1_make_request() or raid1_sync_request(), if
sectors returned by align_to_barrier_unit_end() is smaller than
original bio size.

Comparing to single sliding resync window,
- Currently resync I/O grows linearly, therefore regular and resync I/O
will conflict within a single barrier units. So the I/O behavior is
similar to single sliding resync window.
- But a barrier unit bucket is shared by all barrier units with identical
barrier uinit index, the probability of conflict might be higher
than single sliding resync window, in condition that writing I/Os
always hit barrier units which have identical barrier bucket indexs with
the resync I/Os. This is a very rare condition in real I/O work loads,
I cannot imagine how it could happen in practice.
- Therefore we can achieve a good enough low conflict rate with much
simpler barrier algorithm and implementation.

There are two changes should be noticed,
- In raid1d(), I change the code to decrease conf->nr_pending[idx] into
single loop, it looks like this,
spin_lock_irqsave(&conf->device_lock, flags);
conf->nr_queued[idx]--;
spin_unlock_irqrestore(&conf->device_lock, flags);
This change generates more spin lock operations, but in next patch of
this patch set, it will be replaced by a single line code,
atomic_dec(&conf->nr_queueud[idx]);
So we don't need to worry about spin lock cost here.
- Mainline raid1 code split original raid1_make_request() into
raid1_read_request() and raid1_write_request(). If the original bio
goes across an I/O barrier unit size, this bio will be split before
calling raid1_read_request() or raid1_write_request(), this change
the code logic more simple and clear.
- In this patch wait_barrier() is moved from raid1_make_request() to
raid1_write_request(). In raid_read_request(), original wait_barrier()
is replaced by raid1_read_request().
The differnece is wait_read_barrier() only waits if array is frozen,
using different barrier function in different code path makes the code
more clean and easy to read.
Changelog
V4:
- Add alloc_r1bio() to remove redundant r1bio memory allocation code.
- Fix many typos in patch comments.
- Use (PAGE_SHIFT - ilog2(sizeof(int))) to define BARRIER_BUCKETS_NR_BITS.
V3:
- Rebase the patch against latest upstream kernel code.
- Many fixes by review comments from Neil,
- Back to use pointers to replace arraries in struct r1conf
- Remove total_barriers from struct r1conf
- Add more patch comments to explain how/why the values of
BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided.
- Use get_unqueued_pending() to replace get_all_pendings() and
get_all_queued()
- Increase bucket number from 512 to 1024
- Change code comments format by review from Shaohua.
V2:
- Use bio_split() to split the orignal bio if it goes across barrier unit
bounday, to make the code more simple, by suggestion from Shaohua and
Neil.
- Use hash_long() to replace original linear hash, to avoid a possible
confilict between resync I/O and sequential write I/O, by suggestion from
Shaohua.
- Add conf->total_barriers to record barrier depth, which is used to
control number of parallel sync I/O barriers, by suggestion from Shaohua.
- In V1 patch the bellowed barrier buckets related members in r1conf are
allocated in memory page. To make the code more simple, V2 patch moves
the memory space into struct r1conf, like this,
- int nr_pending;
- int nr_waiting;
- int nr_queued;
- int barrier;
+ int nr_pending[BARRIER_BUCKETS_NR];
+ int nr_waiting[BARRIER_BUCKETS_NR];
+ int nr_queued[BARRIER_BUCKETS_NR];
+ int barrier[BARRIER_BUCKETS_NR];
This change is by the suggestion from Shaohua.
- Remove some inrelavent code comments, by suggestion from Guoqing.
- Add a missing wait_barrier() before jumping to retry_write, in
raid1_make_write_request().
V1:
- Original RFC patch for comments

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


# d7a10308 14-Feb-2017 Ming Lei <tom.leiming@gmail.com>

md: fast clone bio in bio_clone_mddev()

Firstly bio_clone_mddev() is used in raid normal I/O and isn't
in resync I/O path.

Secondly all the direct access to bvec table in raid happens on
resync I/O

md: fast clone bio in bio_clone_mddev()

Firstly bio_clone_mddev() is used in raid normal I/O and isn't
in resync I/O path.

Secondly all the direct access to bvec table in raid happens on
resync I/O except for write behind of raid1, in which we still
use bio_clone() for allocating new bvec table.

So this patch replaces bio_clone() with bio_clone_fast()
in bio_clone_mddev().

Also kill bio_clone_mddev() and call bio_clone_fast() directly, as
suggested by Christoph Hellwig.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


# 8e58e327 14-Feb-2017 Ming Lei <tom.leiming@gmail.com>

md/raid1: use bio_clone_bioset_partial() in case of write behind

Write behind need to replace pages in bio's bvecs, and we have
to clone a fresh bio with new bvec table, so use the introduced
bio_cl

md/raid1: use bio_clone_bioset_partial() in case of write behind

Write behind need to replace pages in bio's bvecs, and we have
to clone a fresh bio with new bvec table, so use the introduced
bio_clone_bioset_partial() for it.

For other bio_clone_mddev() cases, we will use fast clone since
they don't need to touch bvec table.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


# dc3b17cc 02-Feb-2017 Jan Kara <jack@suse.cz>

block: Use pointer to backing_dev_info from request_queue

We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_in

block: Use pointer to backing_dev_info from request_queue

We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

show more ...


# 309bd96a 25-Jan-2017 Christoph Hellwig <hch@lst.de>

md: cleanup bio op / flags handling in raid1_write_request

No need for the local variables, the bio is still live and we can just
assign the bits we want directly. Make me wonder why we can't assig

md: cleanup bio op / flags handling in raid1_write_request

No need for the local variables, the bio is still live and we can just
assign the bits we want directly. Make me wonder why we can't assign
all the bio flags to start with.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

show more ...


# 394ed8e4 04-Jan-2017 Shaohua Li <shli@fb.com>

md: cleanup mddev flag clear for takeover

Commit 6995f0b (md: takeover should clear unrelated bits) clear
unrelated bits, but it's quite fragile. To avoid error in the future,
define a macro for uns

md: cleanup mddev flag clear for takeover

Commit 6995f0b (md: takeover should clear unrelated bits) clear
unrelated bits, but it's quite fragile. To avoid error in the future,
define a macro for unsupported mddev flags for each raid type and use it
to clear unsupported mddev flags. This should be less error-prone.

Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


Revision tags: v4.9
# 3b046a97 05-Dec-2016 Robert LeBlanc <robert@leblancnet.us>

md/raid1: Refactor raid1_make_request

Refactor raid1_make_request to make read and write code in their own
functions to clean up the code.

Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
Signe

md/raid1: Refactor raid1_make_request

Refactor raid1_make_request to make read and write code in their own
functions to clean up the code.

Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


# 2953079c 08-Dec-2016 Shaohua Li <shli@fb.com>

md: separate flags for superblock changes

The mddev->flags are used for different purposes. There are a lot of
places we check/change the flags without masking unrelated flags, we
could check/change

md: separate flags for superblock changes

The mddev->flags are used for different purposes. There are a lot of
places we check/change the flags without masking unrelated flags, we
could check/change unrelated flags. These usage are most for superblock
write, so spearate superblock related flags. This should make the code
clearer and also fix real bugs.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


# 6995f0b2 08-Dec-2016 Shaohua Li <shli@fb.com>

md: takeover should clear unrelated bits

When we change level from raid1 to raid5, the MD_FAILFAST_SUPPORTED bit
will be accidentally set, but raid5 doesn't support it. The same is true
for the MD_H

md: takeover should clear unrelated bits

When we change level from raid1 to raid5, the MD_FAILFAST_SUPPORTED bit
will be accidentally set, but raid5 doesn't support it. The same is true
for the MD_HAS_JOURNAL bit.

Fix: 46533ff (md: Use REQ_FAILFAST_* on metadata writes where appropriate)
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


Revision tags: openbmc-4.4-20161121-1, v4.4.33
# 212e7eb7 17-Nov-2016 NeilBrown <neilb@suse.com>

md/raid1: add failfast handling for writes.

When writing to a fastfail device we use MD_FASTFAIL unless
it is the only device being written to.

For resync/recovery, assume there was a working devic

md/raid1: add failfast handling for writes.

When writing to a fastfail device we use MD_FASTFAIL unless
it is the only device being written to.

For resync/recovery, assume there was a working device to
read from so always use REQ_FASTFAIL_DEV.

If a write for resync/recovery fails, we just fail the
device - there is not much else to do.

If a normal failfast write fails, but the device cannot be
failed (must be only one left), we queue for write error
handling. This will call narrow_write_error() to retry the
write synchronously and without any FAILFAST flags.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


# 2e52d449 17-Nov-2016 NeilBrown <neilb@suse.com>

md/raid1: add failfast handling for reads.

If a device is marked FailFast and it is not the only device
we can read from, we mark the bio with REQ_FAILFAST_* flags.

If this does fail, we don't try

md/raid1: add failfast handling for reads.

If a device is marked FailFast and it is not the only device
we can read from, we mark the bio with REQ_FAILFAST_* flags.

If this does fail, we don't try read repair but just allow
failure. If it was the last device it doesn't fail of
course, so the retry happens on the same device - this time
without FAILFAST. A subsequent failure will not retry but
will just pass up the error.

During resync we may use FAILFAST requests and on a failure
we will simply use the other device(s).

During recovery we will only use FAILFAST in the unusual
case were there are multiple places to read from - i.e. if
there are > 2 devices. If we get a failure we will fail the
device and complete the resync/recovery with remaining
devices.

The new R1BIO_FailFast flag is set on read reqest to suggest
the a FAILFAST request might be acceptable. The rdev needs
to have FailFast set as well for the read to actually use
REQ_FAILFAST_*.

We need to know there are at least two working devices
before we can set R1BIO_FailFast, so we mustn't stop looking
at the first device we find. So the "min_pending == 0"
handling to not exit early, but too always choose the
best_pending_disk if min_pending == 0.

The spinlocked region in raid1_error() in enlarged to ensure
that if two bios, reading from two different devices, fail
at the same time, then there is no risk that both devices
will be marked faulty, leaving zero "In_sync" devices.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


# 46533ff7 17-Nov-2016 NeilBrown <neilb@suse.com>

md: Use REQ_FAILFAST_* on metadata writes where appropriate

This can only be supported on personalities which ensure
that md_error() never causes an array to enter the 'failed'
state. i.e. if marki

md: Use REQ_FAILFAST_* on metadata writes where appropriate

This can only be supported on personalities which ensure
that md_error() never causes an array to enter the 'failed'
state. i.e. if marking a device Faulty would cause some
data to be inaccessible, the device is status is left as
non-Faulty. This is true for RAID1 and RAID10.

If we get a failure writing metadata but the device doesn't
fail, it must be the last device so we re-write without
FAILFAST to improve chance of success. We also flag the
device as LastDev so that future metadata updates don't
waste time on failfast writes.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


Revision tags: v4.4.32
# 578b54ad 13-Nov-2016 NeilBrown <neilb@suse.com>

md/raid1, raid10: add blktrace records when IO is delayed

Both raid1 and raid10 will sometimes delay handling an IO request,
such as when resync is happening or there are too many requests queued.

md/raid1, raid10: add blktrace records when IO is delayed

Both raid1 and raid10 will sometimes delay handling an IO request,
such as when resync is happening or there are too many requests queued.

Add some blktrace messsages so we can see when that is happening when
looking for performance artefacts.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


# 109e3765 17-Nov-2016 NeilBrown <neilb@suse.com>

md: add block tracing for bio_remapping

The block tracing infrastructure (accessed with blktrace/blkparse)
supports the tracing of mapping bios from one device to another.
This is currently used whe

md: add block tracing for bio_remapping

The block tracing infrastructure (accessed with blktrace/blkparse)
supports the tracing of mapping bios from one device to another.
This is currently used when a bio in a partition is mapped to the
whole device, when bios are mapped by dm, and for mapping in md/raid5.
Other md personalities do not include this tracing yet, so add it.

When a read-error is detected we redirect the request to a different device.
This could justifiably be seen as a new mapping for the originial bio,
or a secondary mapping for the bio that errors. This patch uses
the second option.

When md is used under dm-raid, the mappings are not traced as we do
not have access to the block device number of the parent.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


Revision tags: v4.4.31
# f2c771a6 08-Nov-2016 NeilBrown <neilb@suse.com>

md/raid1: fix: IO can block resync indefinitely

While performing a resync/recovery, raid1 divides the
array space into three regions:
- before the resync
- at or shortly after the resync point
-

md/raid1: fix: IO can block resync indefinitely

While performing a resync/recovery, raid1 divides the
array space into three regions:
- before the resync
- at or shortly after the resync point
- much further ahead of the resync point.

Write requests to the first or third do not need to wait. Write
requests to the middle region do need to wait if resync requests are
pending.

If there are any active write requests in the middle region, resync
will wait for them.

Due to an accounting error, there is a small range of addresses,
between conf->next_resync and conf->start_next_window, where write
requests will *not* be blocked, but *will* be counted in the middle
region. This can effectively block resync indefinitely if filesystem
writes happen repeatedly to this region.

As ->next_window_requests is incremented when the sector is after
conf->start_next_window + NEXT_NORMALIO_DISTANCE
the same boundary should be used for determining when write requests
should wait.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


# 5e2c7a36 04-Nov-2016 NeilBrown <neilb@suse.com>

md/raid1: abort delayed writes when device fails.

When writing to an array with a bitmap enabled, the writes are grouped
in batches which are preceded by an update to the bitmap.

It is quite likely

md/raid1: abort delayed writes when device fails.

When writing to an array with a bitmap enabled, the writes are grouped
in batches which are preceded by an update to the bitmap.

It is quite likely if that a drive develops a problem which is not
media related, that the bitmap write will be the first to report an
error and cause the device to be marked faulty (as the bitmap write is
at the start of a batch).

In this case, there is point submiting the subsequent writes to the
failed device - that just wastes times.

So re-check the Faulty state of a device before submitting a
delayed write.

This requires that we keep the 'rdev', rather than the 'bdev' in the
bio, then swap in the bdev just before final submission.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>

show more ...


# 1d41c216 01-Nov-2016 NeilBrown <neilb@suse.com>

md/raid1: change printk() to pr_*()

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>


12345678910>>...41