- 29 7月, 2015 1 次提交
-
-
由 Christoph Hellwig 提交于
Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NHannes Reinecke <hare@suse.de> Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 02 6月, 2015 1 次提交
-
-
由 Tejun Heo 提交于
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->state into wb. * enum bdi_state is renamed to wb_state and the prefix of all enums is changed from BDI_ to WB_. * Explicit zeroing of bdi->state is removed without adding zeoring of wb->state as the whole data structure is zeroed on init anyway. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->state are mechanically replaced with bdi->wb.state introducing no behavior changes. Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: drbd-dev@lists.linbit.com Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 22 4月, 2015 1 次提交
-
-
由 NeilBrown 提交于
This option is not well justified and testing suggests that it hardly ever makes any difference. The comment suggests there might be a need to wait for non-resync activity indicated by ->nr_waiting, however raise_barrier() already waits for all of that. So just remove it to simplify reasoning about speed limiting. This allows us to remove a 'FIXME' comment from raid5.c as that never used the flag. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 25 2月, 2015 1 次提交
-
-
由 Tomáš Hodek 提交于
When a drive is marked write-mostly it should only be the target of reads if there is no other option. This behaviour was broken by commit 9dedf603 md/raid1: read balance chooses idlest disk for SSD which causes a write-mostly device to be *preferred* is some cases. Restore correct behaviour by checking and setting best_dist_disk and best_pending_disk rather than best_disk. We only need to test one of these as they are both changed from -1 or >=0 at the same time. As we leave min_pending and best_dist unchanged, any non-write-mostly device will appear better than the write-mostly device. Reported-by: NTomáš Hodek <tomas.hodek@volny.cz> Reported-by: NDark Penguin <darkpenguin@yandex.ru> Signed-off-by: NNeilBrown <neilb@suse.de> Link: http://marc.info/?l=linux-raid&m=135982797322422 Fixes: 9dedf603 Cc: stable@vger.kernel.org (3.6+)
-
- 23 2月, 2015 3 次提交
-
-
由 Goldwyn Rodrigues 提交于
Algorithm: 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD) 2. Node 1 sends NEWDISK with uuid and slot number 3. Other nodes issue kobject_uevent_env with uuid and slot number (Steps 4,5 could be a udev rule) 4. In userspace, the node searches for the disk, perhaps using blkid -t SUB_UUID="" 5. Other nodes issue either of the following depending on whether the disk was found: ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and disc.number set to slot number) ioctl(CLUSTERED_DISK_NACK) 6. Other nodes drop lock on no-new-devs (CR) if device is found 7. Node 1 attempts EX lock on no-new-devs 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk as SpareLocal 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED 10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message. Signed-off-by: NLidong Zhong <lzhong@suse.com> Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
-
由 Goldwyn Rodrigues 提交于
set choose_first true for cluster read in read balance when the area is resyncing. Signed-off-by: NLidong Zhong <lzhong@suse.com> Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
-
由 Goldwyn Rodrigues 提交于
If there is a resync going on, all nodes must suspend writes to the range. This is recorded in the suspend_info/suspend_list. If there is an I/O within the ranges of any of the suspend_info, should_suspend will return 1. Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
-
- 16 2月, 2015 1 次提交
-
-
由 Nate Dailey 提交于
This modifies raid1's narrow_write_error to round up block_sectors to the device's logical block size. This prevents sd complaining about "Bad block number requested" for non-512-byte sector disks. Signed-off-by: NNate Dailey <nate.dailey@stratus.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 04 2月, 2015 4 次提交
-
-
由 NeilBrown 提交于
Now that the ->stop function only frees the private data, rename is accordingly. Also pass in the private pointer as an arg rather than using mddev->private. This flexibility will be useful in level_store(). Finally, don't clear ->private. It doesn't make sense to clear it seeing that isn't what we free, and it is no longer necessary to clear ->private (it was some time ago before ->to_remove was introduced). Setting ->to_remove in ->free() is a bit of a wart, but not a big problem at the moment. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Each md personality has a 'stop' operation which does two things: 1/ it finalizes some aspects of the array to ensure nothing is accessing the ->private data 2/ it frees the ->private data. All the steps in '1' can apply to all arrays and so can be performed in common code. This is useful as in the case where we change the personality which manages an array (in level_store()), it would be helpful to do step 1 early, and step 2 later. So split the 'step 1' functionality out into a new mddev_detach(). Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
There is no locking around calls to merge_bvec_fn(), so it is possible that calls which coincide with a level (or personality) change could go wrong. So create a central dispatch point for these functions and use rcu_read_lock(). If the array is suspended, reject any merge that can be rejected. If not, we know it is safe to call the function. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
There is currently no locking around calls to the 'congested' bdi function. If called at an awkward time while an array is being converted from one level (or personality) to another, there is a tiny chance of running code in an unreferenced module etc. So add a 'congested' function to the md_personality operations structure, and call it with appropriate locking from a central 'mddev_congested'. When the array personality is changing the array will be 'suspended' so no IO is processed. If mddev_congested detects this, it simply reports that the array is congested, which is a safe guess. As mddev_suspend calls synchronize_rcu(), mddev_congested can avoid races by included the whole call inside an rcu_read_lock() region. This require that the congested functions for all subordinate devices can be run under rcu_lock. Fortunately this is the case. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 14 10月, 2014 2 次提交
-
-
由 NeilBrown 提交于
My editor shows much of this is RED. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
process_checks() always returns '0', so change it to 'void'. Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 09 10月, 2014 2 次提交
-
-
由 NeilBrown 提交于
Using {set,clear}_bit is more consistent than shifting and masking. No functional change. Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 22 9月, 2014 8 次提交
-
-
由 NeilBrown 提交于
If a devices is being recovered it is not InSync and is not Faulty. If a read error is experienced on that device, fix_read_error() will be called, but it ignores non-InSync devices. So it will neither fix the error nor fail the device. It is incorrect that fix_read_error() ignores non-InSync devices. It should only ignore Faulty devices. So fix it. This became a bug when we allowed reading from a device that was being recovered. It is suitable for any subsequent -stable kernel. Fixes: da8840a7 Cc: stable@vger.kernel.org (v3.5+) Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com> Tested-by: NAlexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Both normal IO and resync IO can be retried with reschedule_retry() and so be counted into ->nr_queued, but only normal IO gets counted in ->nr_pending. Before the recent improvement to RAID1 resync there could only possibly have been one or the other on the queue. When handling a read failure it could only be normal IO. So when handle_read_error() called freeze_array() the fact that freeze_array only compares ->nr_queued against ->nr_pending was safe. But now that these two types can interleave, we can have both normal and resync IO requests queued, so we need to count them both in nr_pending. This error can lead to freeze_array() hanging if there is a read error, so it is suitable for -stable. Fixes: 79ef3a8a cc: stable@vger.kernel.org (v3.13+) Reported-by: NBrassow Jonathan <jbrassow@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
raise_barrier() uses next_resync as part of its calculations, so it really should be updated first, instead of afterwards. next_resync is always used under resync_lock so update it under resync lock to, just before it is used. That is safest. This could cause normal IO and resync IO to interact badly so it suitable for -stable. Fixes: 79ef3a8a cc: stable@vger.kernel.org (v3.13+) Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
next_resync is (approximately) the location for the next resync request. However it does *not* reliably determine the earliest location at which resync might be happening. This is because resync requests can complete out of order, and we only limit the number of current requests, not the distance from the earliest pending request to the latest. mddev->curr_resync_completed is a reliable indicator of the earliest position at which resync could be happening. It is updated less frequently, but is actually reliable which is more important. So use it to determine if a write request is before the region being resynced and so safe from conflict. This error can allow resync IO to interfere with normal IO which could lead to data corruption. Hence: stable. Fixes: 79ef3a8a cc: stable@vger.kernel.org (v3.13+) Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
The resync/recovery process for raid1 was recently changed so that writes could happen in parallel with resync providing they were in different regions of the device. There is a problem though: While a write request will always wait for conflicting resync to complete, a resync request will *not* always wait for conflicting writes to complete. Two changes are needed to fix this: 1/ raise_barrier (which waits until it is safe to do resync) must wait until current_window_requests is zero 2/ wait_battier (which waits at the start of a new write request) must update current_window_requests if the request could possible conflict with a concurrent resync. As concurrent writes and resync can lead to data loss, this patch is suitable for -stable. Fixes: 79ef3a8a Cc: stable@vger.kernel.org (v3.13+) Cc: majianpeng <majianpeng@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
If there are outstanding writes when close_sync is called, the change to ->start_next_window might cause them to decrement the wrong counter when they complete. Fix this by merging the two counters into the one that will be decremented. Having an incorrect value in a counter can cause raise_barrier() to hangs, so this is suitable for -stable. Fixes: 79ef3a8a cc: stable@vger.kernel.org (v3.13+) Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
commit 79ef3a8a made it possible for reads to happen concurrently with resync. This means that we need to be more careful where read_balancing is allowed during resync - we can no longer be sure that any resync that has already started will definitely finish. So keep read_balancing to before recovery_cp, which is conservative but safe. This bug makes it possible to read from a device that doesn't have up-to-date data, so it can cause data corruption. So it is suitable for any kernel since 3.11. Fixes: 79ef3a8a cc: stable@vger.kernel.org (v3.13+) Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
r1_bio->start_next_window is not initialised in the READ case, so allow_barrier may incorrectly decrement conf->current_window_requests which can cause raise_barrier() to block forever. Fixes: 79ef3a8a cc: stable@vger.kernel.org (v3.13+) Reported-by: NBrassow Jonathan <jbrassow@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 31 7月, 2014 1 次提交
-
-
由 NeilBrown 提交于
Currently we don't abort recovery on a write error if the write error to the recovering device was triggerd by normal IO (as opposed to recovery IO). This means that for one bitmap region, the recovery might write to the recovering device for a few sectors, then not bother for subsequent sectors (as it never writes to failed devices). In this case the bitmap bit will be cleared, but it really shouldn't. The result is that if the recovering device fails and is then re-added (after fixing whatever hardware problem triggerred the failure), the second recovery won't redo the region it was in the middle of, so some of the device will not be recovered properly. If we abort the recovery, the region being processes will be cancelled (bit not cleared) and the whole region will be retried. As the bug can result in data corruption the patch is suitable for -stable. For kernels prior to 3.11 there is a conflict in raid10.c which will require care. Original-from: jiao hui <jiaohui@bwstor.com.cn> Reported-and-tested-by: Njiao hui <jiaohui@bwstor.com.cn> Signed-off-by: NNeilBrown <neilb@suse.de> Cc: stable@vger.kernel.org
-
- 09 4月, 2014 1 次提交
-
-
由 NeilBrown 提交于
When performing a user-request check/repair (MD_RECOVERY_REQUEST is set) on a raid1, we allocate multiple bios each with their own set of pages. If the page allocations for one bio fails, we currently do *not* free the pages allocated for the previous bios, nor do we free the bio itself. This patch frees all the already-allocate pages, and makes sure that all the bios are freed as well. This bug can cause a memory leak which can ultimately OOM a machine. It was introduced in 3.10-rc1. Fixes: a0787606 Cc: Kent Overstreet <koverstreet@google.com> Cc: stable@vger.kernel.org (3.10+) Reported-by: NRussell King - ARM Linux <linux@arm.linux.org.uk> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 05 2月, 2014 1 次提交
-
-
由 NeilBrown 提交于
commit 30bc9b53 md/raid1: fix bio handling problems in process_checks() Move the bio_reset() to a point before where BIO_UPTODATE is checked, so that check now always report that the bio is uptodate, even if it is not. This causes process_check() to sometimes treat read-errors as successful matches so the good data isn't written out. This patch preserves the flag until it is needed. Bug was introduced in 3.11, but backported to 3.10-stable (as it fixed an even worse bug). So suitable for any -stable since 3.10. Reported-and-tested-by: NMichael Tokarev <mjt@tls.msk.ru> Cc: stable@vger.kernel.org (3.10+) Fixed: 30bc9b53Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 14 1月, 2014 1 次提交
-
-
由 NeilBrown 提交于
The new iobarrier implementation in raid1 (which keeps normal writes and resync activity separate) counts every request what is not before the current resync point in either next_window_requests or current_window_requests. It flags that the request is counted by setting ->start_next_window. allow_barrier follows this model exactly and decrements one of the *_window_requests if and only if ->start_next_window is set. However wait_barrier(), which increments *_window_requests uses a slightly different test for setting -.start_next_window (which is set from the return value of this function). So there is a possibility of the counts getting out of sync, and this leads to the resync hanging. So change wait_barrier() to return a non-zero value in exactly the same cases that it increments *_window_requests. But was introduced in 3.13-rc1. Reported-by: NBruno Wolff III <bruno@wolff.to> URL: https://bugzilla.kernel.org/show_bug.cgi?id=68061 Fixes: 79ef3a8a Cc: majianpeng <majianpeng@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 24 11月, 2013 1 次提交
-
-
由 Kent Overstreet 提交于
Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6
-
- 19 11月, 2013 4 次提交
-
-
由 majianpeng 提交于
There is an iobarrier in raid1 because of contention between normal IO and resync IO. It suspends all normal IO when resync/recovery happens. However if normal IO is out side the resync window, there is no contention. So this patch changes the barrier mechanism to only block IO that could contend with the resync that is currently happening. We partition the whole space into five parts. |---------|-----------|------------|----------------|-------| start next_resync start_next_window end_window start + RESYNC_WINDOW = next_resync next_resync + NEXT_NORMALIO_DISTANCE = start_next_window start_next_window + NEXT_NORMALIO_DISTANCE = end_window Firstly we introduce some concepts: 1 - RESYNC_WINDOW: For resync, there are 32 resync requests at most at the same time. A sync request is RESYNC_BLOCK_SIZE(64*1024). So the RESYNC_WINDOW is 32 * RESYNC_BLOCK_SIZE, that is 2MB. 2 - NEXT_NORMALIO_DISTANCE: the distance between next_resync and start_next_window. It also indicates the distance between start_next_window and end_window. It is currently 3 * RESYNC_WINDOW_SIZE but could be tuned if this turned out not to be optimal. 3 - next_resync: the next sector at which we will do sync IO. 4 - start: a position which is at most RESYNC_WINDOW before next_resync. 5 - start_next_window: a position which is NEXT_NORMALIO_DISTANCE beyond next_resync. Normal-io after this position doesn't need to wait for resync-io to complete. 6 - end_window: a position which is 2 * NEXT_NORMALIO_DISTANCE beyond next_resync. This also doesn't need to wait, but is counted differently. 7 - current_window_requests: the count of normalIO between start_next_window and end_window. 8 - next_window_requests: the count of normalIO after end_window. NormalIO will be partitioned into four types: NormIO1: the end sector of bio is smaller or equal the start NormIO2: the start sector of bio larger or equal to end_window NormIO3: the start sector of bio larger or equal to start_next_window. NormIO4: the location between start_next_window and end_window |--------|-----------|--------------------|----------------|-------------| | start | next_resync | start_next_window | end_window | NormIO1 NormIO4 NormIO4 NormIO3 NormIO2 For NormIO1, we don't need any io barrier. For NormIO4, we used a similar approach to the original iobarrier mechanism. The normalIO and resyncIO must be kept separate. For NormIO2/3, we add two fields to struct r1conf: "current_window_requests" and "next_window_requests". They indicate the count of active requests in the two window. For these, we don't wait for resync io to complete. For resync action, if there are NormIO4s, we must wait for it. If not, we can proceed. But if resync action reaches start_next_window and current_window_requests > 0 (that is there are NormIO3s), we must wait until the current_window_requests becomes zero. When current_window_requests becomes zero, start_next_window also moves forward. Then current_window_requests will replaced by next_window_requests. There is a problem which when and how to change from NormIO2 to NormIO3. Only then can sync action progress. We add a field in struct r1conf "start_next_window". A: if start_next_window == MaxSector, it means there are no NormIO2/3. So start_next_window = next_resync + NEXT_NORMALIO_DISTANCE B: if current_window_requests == 0 && next_window_requests != 0, it means start_next_window move to end_window There is another problem which how to differentiate between old NormIO2(now it is NormIO3) and NormIO2. For example, there are many bios which are NormIO2 and a bio which is NormIO3. NormIO3 firstly completed, so the bios of NormIO2 became NormIO3. We add a field in struct r1bio "start_next_window". This is used to record the position conf->start_next_window when the call to wait_barrier() is made in make_request(). In allow_barrier(), we check the conf->start_next_window. If r1bio->stat_next_window == conf->start_next_window, it means there is no transition between NormIO2 and NormIO3. If r1bio->start_next_window != conf->start_next_window, it mean there was a transition between NormIO2 and NormIO3. There can only have been one transition. So it only means the bio is old NormIO2. For one bio, there may be many r1bio's. So we make sure all the r1bio->start_next_window are the same value. If we met blocked_dev in make_request(), it must call allow_barrier and wait_barrier. So the former and the later value of conf->start_next_window will be change. If there are many r1bio's with differnet start_next_window, for the relevant bio, it depend on the last value of r1bio. It will cause error. To avoid this, we must wait for previous r1bios to complete. Signed-off-by: NJianpeng Ma <majianpeng@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 majianpeng 提交于
In a subsequent patch, we'll use some const parameters. Using macros will make the code clearly. Signed-off-by: NJianpeng Ma <majianpeng@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 majianpeng 提交于
raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array. We used to use raise_barrier to suspend normal IO while we reconfigure the array. However raise_barrier will soon only suspend some normal IO, not all. So we need something else. Change it to use freeze_array. But freeze_array not only suspends normal io, it also suspends resync io. For the place where call raise_barrier for reconfigure, it isn't a problem. Signed-off-by: NJianpeng Ma <majianpeng@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 majianpeng 提交于
Because the following patch will rewrite the content between normal IO and resync IO. So we used a parameter to indicate whether raid is in freeze array. Signed-off-by: NJianpeng Ma <majianpeng@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 09 11月, 2013 1 次提交
-
-
由 Kent Overstreet 提交于
Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on, we should know better than this. Signed-off-by: NKent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 24 10月, 2013 1 次提交
-
-
由 Lukasz Dorau 提交于
Since: commit 7ceb17e8 md: Allow devices to be re-added to a read-only array. spares are activated on a read-only array. In case of raid1 and raid10 personalities it causes that not-in-sync devices are marked in-sync without checking if recovery has been finished. If a read-only array is degraded and one of its devices is not in-sync (because the array has been only partially recovered) recovery will be skipped. This patch adds checking if recovery has been finished before marking a device in-sync for raid1 and raid10 personalities. In case of raid5 personality such condition is already present (at raid5.c:6029). Bug was introduced in 3.10 and causes data corruption. Cc: stable@vger.kernel.org Signed-off-by: NPawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: NLukasz Dorau <lukasz.dorau@intel.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 18 7月, 2013 1 次提交
-
-
由 NeilBrown 提交于
Recent change to use bio_copy_data() in raid1 when repairing an array is faulty. The underlying may have changed the bio in various ways using bio_advance and these need to be undone not just for the 'sbio' which is being copied to, but also the 'pbio' (primary) which is being copied from. So perform the reset on all bios that were read from and do it early. This also ensure that the sbio->bi_io_vec[j].bv_len passed to memcmp is correct. This fixes a crash during a 'check' of a RAID1 array. The crash was introduced in 3.10 so this is suitable for 3.10-stable. Cc: stable@vger.kernel.org (3.10) Reported-by: NJoe Lawrence <joe.lawrence@stratus.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 14 6月, 2013 1 次提交
-
-
由 Jonathan Brassow 提交于
DM RAID: Add ability to restore transiently failed devices on resume This patch adds code to the resume function to check over the devices in the RAID array. If any are found to be marked as failed and their superblocks can be read, an attempt is made to reintegrate them into the array. This allows the user to refresh the array with a simple suspend and resume of the array - rather than having to load a completely new table, allocate and initialize all the structures and throw away the old instantiation. Signed-off-by: NJonathan Brassow <jbrassow@redhat.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
- 13 6月, 2013 3 次提交
-
-
由 H. Peter Anvin 提交于
There are cases where the kernel will believe that the WRITE SAME command is supported by a block device which does not, in fact, support WRITE SAME. This currently happens for SATA drivers behind a SAS controller, but there are probably a hundred other ways that can happen, including drive firmware bugs. After receiving an error for WRITE SAME the block layer will retry the request as a plain write of zeroes, but mdraid will consider the failure as fatal and consider the drive failed. This has the effect that all the mirrors containing a specific set of data are each offlined in very rapid succession resulting in data loss. However, just bouncing the request back up to the block layer isn't ideal either, because the whole initial request-retry sequence should be inside the write bitmap fence, which probably means that md needs to do its own conversion of WRITE SAME to write zero. Until the failure scenario has been sorted out, disable WRITE SAME for raid1, raid5, and raid10. [neilb: added raid5] This patch is appropriate for any -stable since 3.7 when write_same support was added. Cc: stable@vger.kernel.org Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 NeilBrown 提交于
Various places in raid1 and raid10 are calling raise_barrier when they really should call freeze_array. The former is only intended to be called from "make_request". The later has extra checks for 'nr_queued' and makes a call to flush_pending_writes(), so it is safe to call it from within the management thread. Using raise_barrier will sometimes deadlock. Using freeze_array should not. As 'freeze_array' currently expects one request to be pending (in handle_read_error - the only previous caller), we need to pass it the number of pending requests (extra) to ignore. The deadlock was made particularly noticeable by commits 050b6615 (raid10) and 6b740b8d (raid1) which appeared in 3.4, so the fix is appropriate for any -stable kernel since then. This patch probably won't apply directly to some early kernels and will need to be applied by hand. Cc: stable@vger.kernel.org Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-
由 Alex Lyakas 提交于
md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it. Without that fix, the following scenario could happen: - RAID1 with drives A and B; drive B was freshly-added and is rebuilding - Drive A fails - WRITE request arrives to the array. It is failed by drive A, so r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B succeeds in writing it, so the same r1_bio is marked as R1BIO_Uptodate. - r1_bio arrives to handle_write_finished, badblocks are disabled, md_error()->error() does nothing because we don't fail the last drive of raid1 - raid_end_bio_io() calls call_bio_endio() - As a result, in call_bio_endio(): if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) clear_bit(BIO_UPTODATE, &bio->bi_flags); this code doesn't clear the BIO_UPTODATE flag, and the whole master WRITE succeeds, back to the upper layer. So we returned success to the upper layer, even though we had written the data onto the rebuilding drive only. But when we want to read the data back, we would not read from the rebuilding drive, so this data is lost. [neilb - applied identical change to raid10 as well] This bug can result in lost data, so it is suitable for any -stable kernel. Cc: stable@vger.kernel.org Signed-off-by: NAlex Lyakas <alex@zadarastorage.com> Signed-off-by: NNeilBrown <neilb@suse.de>
-