1. 06 1月, 2016 2 次提交
  2. 01 11月, 2015 10 次提交
  3. 31 10月, 2015 1 次提交
    • R
      md/raid5: fix locking in handle_stripe_clean_event() · b8a9d66d
      Roman Gushchin 提交于
      After commit 566c09c5 ("raid5: relieve lock contention in get_active_stripe()")
      __find_stripe() is called under conf->hash_locks + hash.
      But handle_stripe_clean_event() calls remove_hash() under
      conf->device_lock.
      
      Under some cirscumstances the hash chain can be circuited,
      and we get an infinite loop with disabled interrupts and locked hash
      lock in __find_stripe(). This leads to hard lockup on multiple CPUs
      and following system crash.
      
      I was able to reproduce this behavior on raid6 over 6 ssd disks.
      The devices_handle_discard_safely option should be set to enable trim
      support. The following script was used:
      
      for i in `seq 1 32`; do
          dd if=/dev/zero of=large$i bs=10M count=100 &
      done
      
      neilb: original was against a 3.x kernel.  I forward-ported
        to 4.3-rc.  This verison is suitable for any kernel since
        Commit: 59fc630b ("RAID5: batch adjacent full stripe write")
        (v4.1+).  I'll post a version for earlier kernels to stable.
      Signed-off-by: NRoman Gushchin <klamm@yandex-team.ru>
      Fixes: 566c09c5 ("raid5: relieve lock contention in get_active_stripe()")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <stable@vger.kernel.org> # 3.13 - 4.2
      b8a9d66d
  4. 24 10月, 2015 4 次提交
    • S
      raid5: log reclaim support · 0576b1c6
      Shaohua Li 提交于
      This is the reclaim support for raid5 log. A stripe write will have
      following steps:
      
      1. reconstruct the stripe, read data/calculate parity. ops_run_io
      prepares to write data/parity to raid disks
      2. hijack ops_run_io. stripe data/parity is appending to log disk
      3. flush log disk cache
      4. ops_run_io run again and do normal operation. stripe data/parity is
      written in raid array disks. raid core can return io to upper layer.
      5. flush cache of all raid array disks
      6. update super block
      7. log disk space used by the stripe can be reused
      
      In practice, several stripes consist of an io_unit and we will batch
      several io_unit in different steps, but the whole process doesn't
      change.
      
      It's possible io return just after data/parity hit log disk, but then
      read IO will need read from log disk. For simplicity, IO return happens
      at step 4, where read IO can directly read from raid disks.
      
      Currently reclaim run if there is specific reclaimable space (1/4 disk
      size or 10G) or we are out of space. Reclaim is just to free log disk
      spaces, it doesn't impact data consistency. The size based force reclaim
      is to make sure log isn't too big, so recovery doesn't scan log too
      much.
      
      Recovery make sure raid disks and log disk have the same data of a
      stripe. If crash happens before 4, recovery might/might not recovery
      stripe's data/parity depending on if data/parity and its checksum
      matches. In either case, this doesn't change the syntax of an IO write.
      After step 3, stripe is guaranteed recoverable, because stripe's
      data/parity is persistent in log disk. In some cases, log disk content
      and raid disks content of a stripe are the same, but recovery will still
      copy log disk content to raid disks, this doesn't impact data
      consistency. space reuse happens after superblock update and cache
      flush.
      
      There is one situation we want to avoid. A broken meta in the middle of
      a log causes recovery can't find meta at the head of log. If operations
      require meta at the head persistent in log, we must make sure meta
      before it persistent in log too. The case is stripe data/parity is in
      log and we start write stripe to raid disks (before step 4). stripe
      data/parity must be persistent in log before we do the write to raid
      disks. The solution is we restrictly maintain io_unit list order. In
      this case, we only write stripes of an io_unit to raid disks till the
      io_unit is the first one whose data/parity is in log.
      
      The io_unit list order is important for other cases too. For example,
      some io_unit are reclaimable and others not. They can be mixed in the
      list, we shouldn't reuse space of an unreclaimable io_unit.
      
      Includes fixes to problems which were...
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      0576b1c6
    • S
      raid5: add basic stripe log · f6bed0ef
      Shaohua Li 提交于
      This introduces a simple log for raid5. Data/parity writing to raid
      array first writes to the log, then write to raid array disks. If
      crash happens, we can recovery data from the log. This can speed up
      raid resync and fix write hole issue.
      
      The log structure is pretty simple. Data/meta data is stored in block
      unit, which is 4k generally. It has only one type of meta data block.
      The meta data block can track 3 types of data, stripe data, stripe
      parity and flush block. MD superblock will point to the last valid
      meta data block. Each meta data block has checksum/seq number, so
      recovery can scan the log correctly. We store a checksum of stripe
      data/parity to the metadata block, so meta data and stripe data/parity
      can be written to log disk together. otherwise, meta data write must
      wait till stripe data/parity is finished.
      
      For stripe data, meta data block will record stripe data sector and
      size. Currently the size is always 4k. This meta data record can be made
      simpler if we just fix write hole (eg, we can record data of a stripe's
      different disks together), but this format can be extended to support
      caching in the future, which must record data address/size.
      
      For stripe parity, meta data block will record stripe sector. It's
      size should be 4k (for raid5) or 8k (for raid6). We always store p
      parity first. This format should work for caching too.
      
      flush block indicates a stripe is in raid array disks. Fixing write
      hole doesn't need this type of meta data, it's for caching extension.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      f6bed0ef
    • S
      raid5: add a new state for stripe log handling · b70abcb2
      Shaohua Li 提交于
      When a stripe finishes construction, we write the stripe to raid in
      ops_run_io normally. With log, we do a bunch of other operations before
      the stripe is written to raid. Mainly write the stripe to log disk,
      flush disk cache and so on. The operations are still driven by raid5d
      and run in the stripe state machine. We introduce a new state for such
      stripe (trapped into log). The stripe is in this state from the time it
      first enters ops_run_io (finish construction) to the time it is written
      to raid. Since we know the state is only for log, we bypass other
      check/operation in handle_stripe.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      b70abcb2
    • S
      raid5: export some functions · 6d036f7d
      Shaohua Li 提交于
      Next several patches use some raid5 functions, rename them with raid5
      prefix and export out.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      6d036f7d
  5. 12 10月, 2015 1 次提交
    • G
      md-cluster: Use a small window for resync · c40f341f
      Goldwyn Rodrigues 提交于
      Suspending the entire device for resync could take too long. Resync
      in small chunks.
      
      cluster's resync window (32M) is maintained in r1conf as
      cluster_sync_low and cluster_sync_high and processed in
      raid1's sync_request(). If the current resync is outside the cluster
      resync window:
      
      1. Set the cluster_sync_low to curr_resync_completed.
      2. Check if the sync will fit in the new window, if not issue a
         wait_barrier() and set cluster_sync_low to sector_nr.
      3. Set cluster_sync_high to cluster_sync_low + resync_window.
      4. Send a message to all nodes so they may add it in their suspension
         list.
      
      bitmap_cond_end_sync is modified to allow to force a sync inorder
      to get the curr_resync_completed uptodate with the sector passed.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c40f341f
  6. 02 10月, 2015 3 次提交
  7. 01 9月, 2015 9 次提交
    • N
      md/raid5: ensure device failure recorded before write request returns. · c3cce6cd
      NeilBrown 提交于
      When a write to one of the devices of a RAID5/6 fails, the failure is
      recorded in the metadata of the other devices so that after a restart
      the data on the failed drive wont be trusted even if that drive seems
      to be working again (maybe a cable was unplugged).
      
      Similarly when we record a bad-block in response to a write failure,
      we must not let the write complete until the bad-block update is safe.
      
      Currently there is no interlock between the write request completing
      and the metadata update.  So it is possible that the write will
      complete, the app will confirm success in some way, and then the
      machine will crash before the metadata update completes.
      
      This is an extremely small hole for a racy to fit in, but it is
      theoretically possible and so should be closed.
      
      So:
       - set MD_CHANGE_PENDING when requesting a metadata update for a
         failed device, so we can know with certainty when it completes
       - queue requests that completed when MD_CHANGE_PENDING is set to
         only be processed after the metadata update completes
       - call raid_end_bio_io() on bios in that queue when the time comes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      c3cce6cd
    • N
      md/raid5: use bio_list for the list of bios to return. · 34a6f80e
      NeilBrown 提交于
      This will make it easier to splice two lists together which will
      be needed in future patch.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      34a6f80e
    • N
      md/raid5: handle possible race as reshape completes. · 6cbd8148
      NeilBrown 提交于
      It is possible (though unlikely) for a reshape to be
      interrupted between the time that end_reshape is called
      and the time when raid5_finish_reshape is called.
      
      This can leave conf->reshape_progress set to MaxSector,
      but mddev->reshape_position not.
      
      This combination confused reshape_request() when ->reshape_backwards.
      As conf->reshape_progress is so high, it seems the reshape hasn't
      really begun.  But assuming MaxSector is a valid address only
      leads to sorrow.
      
      So ensure reshape_position and reshape_progress both agree,
      and add an extra check in reshape_request() just in case they don't.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      6cbd8148
    • N
      md: be careful when testing resync_max against curr_resync_completed. · c5e19d90
      NeilBrown 提交于
      While it generally shouldn't happen, it is not impossible for
      curr_resync_completed to exceed resync_max.
      This can particularly happen when reshaping RAID5 - the current
      status isn't copied to curr_resync_completed promptly, so when it
      is, it can exceed resync_max.
      This happens when the reshape is 'frozen', resync_max is set low,
      and reshape is re-enabled.
      
      Taking a difference between two unsigned numbers is always dangerous
      anyway, so add a test to behave correctly if
         curr_resync_completed > resync_max
      Signed-off-by: NNeilBrown <neilb@suse.com>
      c5e19d90
    • N
      md/raid5: remove incorrect "min_t()" when calculating writepos. · c74c0d76
      NeilBrown 提交于
      This code is calculating:
        writepos, which is the furthest along address (device-space) that we
           *will* be writing to
        readpos, which is the earliest address that we *could* possible read
           from, and
        safepos, which is the earliest address in the 'old' section that we
           might read from after a crash when the reshape position is
           recovered from metadata.
      
        The first is a precise calculation, so clipping at zero doesn't
        make sense.  As the reshape position is now guaranteed to always be
        a multiple of reshape_sectors and as we already BUG_ON when
        reshape_progress is zero, there is no point in this min_t() call.
      
        The readpos and safepos are worst case - actual value depends on
        precise geometry.  That worst case could be negative, which is only
        a problem because we are storing the value in an unsigned.
        So leave the min_t() for those.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      
      c74c0d76
    • N
      md/raid5: strengthen check on reshape_position at run. · 05256d98
      NeilBrown 提交于
      When reshaping, we work in units of the largest chunk size.
      If changing from a larger to a smaller chunk size, that means we
      reshape more than one stripe at a time.  So the required alignment
      of reshape_position needs to take into account both the old
      and new chunk size.
      
      This means that both 'here_new' and 'here_old' are calculated with
      respect to the same (maximum) chunk size, so testing if they are the
      same when delta_disks is zero becomes pointless.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      05256d98
    • N
      md/raid5: switch to use conf->chunk_sectors in place of mddev->chunk_sectors where possible · 3cb5edf4
      NeilBrown 提交于
      The chunk_sectors and new_chunk_sectors fields of mddev can be changed
      any time (via sysfs) that the reconfig mutex can be taken.  So raid5
      keeps internal copies in 'conf' which are stable except for a short
      locked moment when reshape stops/starts.
      
      So any access that does not hold reconfig_mutex should use the 'conf'
      values, not the 'mddev' values.
      Several don't.
      
      This could result in corruption if new values were written at awkward
      times.
      
      Also use min() or max() rather than open-coding.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      3cb5edf4
    • N
      md/raid5: always set conf->prev_chunk_sectors and ->prev_algo · 5cac6bcb
      NeilBrown 提交于
      These aren't really needed when no reshape is happening,
      but it is safer to have them always set to a meaningful value.
      The next patch will use ->prev_chunk_sectors without checking
      if a reshape is happening (because that makes the code simpler),
      and this patch makes that safe.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      5cac6bcb
    • N
      md/raid5: consider updating reshape_position at start of reshape. · 92140480
      NeilBrown 提交于
      md/raid5 only updates ->reshape_position (which is stored in
      metadata and is authoritative) occasionally, but particularly
      when getting closed to ->resync_max as it must be correct
      when ->resync_max is reached.
      
      When mdadm tries to stop an array which is reshaping it will:
       - freeze the reshape,
       - set resync_max to where the reshape has reached.
       - unfreeze the reshape.
      When this happens, the reshape is aborted and then restarted.
      
      The restart doesn't check that resync_max is close, and so doesn't
      update ->reshape_position like it should.
      This results in the reshape stopping, but ->reshape_position being
      incorrect.
      
      So on that first call to reshape_request, make sure ->reshape_position
      is updated if needed.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      92140480
  8. 14 8月, 2015 3 次提交
    • K
      block: kill merge_bvec_fn() completely · 8ae12666
      Kent Overstreet 提交于
      As generic_make_request() is now able to handle arbitrarily sized bios,
      it's no longer necessary for each individual block driver to define its
      own ->merge_bvec_fn() callback. Remove every invocation completely.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: ceph-devel@vger.kernel.org
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: also remove ->merge_bvec_fn() in dm-thin as well as
       dm-era-target, and resolve merge conflicts]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ae12666
    • K
      md/raid5: get rid of bio_fits_rdev() · 7140aafc
      Kent Overstreet 提交于
      Remove bio_fits_rdev() as sufficient merge_bvec_fn() handling is now
      performed by blk_queue_split() in md_make_request().
      
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Acked-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: add more description in commit message]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7140aafc
    • M
      md/raid5: split bio for chunk_aligned_read · 7ef6b12a
      Ming Lin 提交于
      If a read request fits entirely in a chunk, it will be passed directly to the
      underlying device (providing it hasn't failed of course).  If it doesn't fit,
      the slightly less efficient path that uses the stripe_cache is used.
      Requests that get to the stripe cache are always completely split up as
      necessary.
      
      So with RAID5, ripping out the merge_bvec_fn doesn't cause it to stop work,
      but could cause it to take the less efficient path more often.
      
      All that is needed to manage this is for 'chunk_aligned_read' do some bio
      splitting, much like the RAID0 code does.
      
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Acked-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7ef6b12a
  9. 12 8月, 2015 1 次提交
    • S
      block: don't access bio->bi_error after bio_put() · 9b81c842
      Sasha Levin 提交于
      Commit 4246a0b6 ("block: add a bi_error field to struct bio") has added a few
      dereferences of 'bio' after a call to bio_put(). This causes use-after-frees
      such as:
      
      [521120.719695] BUG: KASan: use after free in dio_bio_complete+0x2b3/0x320 at addr ffff880f36b38714
      [521120.720638] Read of size 4 by task mount.ocfs2/9644
      [521120.721212] =============================================================================
      [521120.722056] BUG kmalloc-256 (Not tainted): kasan: bad access detected
      [521120.722968] -----------------------------------------------------------------------------
      [521120.722968]
      [521120.723915] Disabling lock debugging due to kernel taint
      [521120.724539] INFO: Slab 0xffffea003cdace00 objects=32 used=25 fp=0xffff880f36b38600 flags=0x46fffff80004080
      [521120.726037] INFO: Object 0xffff880f36b38700 @offset=1792 fp=0xffff880f36b38800
      [521120.726037]
      [521120.726974] Bytes b4 ffff880f36b386f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.727898] Object ffff880f36b38700: 00 88 b3 36 0f 88 ff ff 00 00 d8 de 0b 88 ff ff  ...6............
      [521120.728822] Object ffff880f36b38710: 02 00 00 f0 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.729705] Object ffff880f36b38720: 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00  ................
      [521120.730623] Object ffff880f36b38730: 00 00 00 00 00 00 00 00 01 00 00 00 00 02 00 00  ................
      [521120.731621] Object ffff880f36b38740: 00 02 00 00 01 00 00 00 d0 f7 87 ad ff ff ff ff  ................
      [521120.732776] Object ffff880f36b38750: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.733640] Object ffff880f36b38760: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.734508] Object ffff880f36b38770: 01 00 03 00 01 00 00 00 88 87 b3 36 0f 88 ff ff  ...........6....
      [521120.735385] Object ffff880f36b38780: 00 73 22 ad 02 88 ff ff 40 13 e0 3c 00 ea ff ff  .s".....@..<....
      [521120.736667] Object ffff880f36b38790: 00 02 00 00 00 04 00 00 00 00 00 00 00 00 00 00  ................
      [521120.737596] Object ffff880f36b387a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.738524] Object ffff880f36b387b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.739388] Object ffff880f36b387c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.740277] Object ffff880f36b387d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.741187] Object ffff880f36b387e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.742233] Object ffff880f36b387f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.743229] CPU: 41 PID: 9644 Comm: mount.ocfs2 Tainted: G    B           4.2.0-rc6-next-20150810-sasha-00039-gf909086 #2420
      [521120.744274]  ffff880f36b38000 ffff880d89c8f638 ffffffffb6e9ba8a ffff880101c0e5c0
      [521120.745025]  ffff880d89c8f668 ffffffffad76a313 ffff880101c0e5c0 ffffea003cdace00
      [521120.745908]  ffff880f36b38700 ffff880f36b38798 ffff880d89c8f690 ffffffffad772854
      [521120.747063] Call Trace:
      [521120.747520] dump_stack (lib/dump_stack.c:52)
      [521120.748053] print_trailer (mm/slub.c:653)
      [521120.748582] object_err (mm/slub.c:660)
      [521120.749079] kasan_report_error (include/linux/kasan.h:20 mm/kasan/report.c:152 mm/kasan/report.c:194)
      [521120.750834] __asan_report_load4_noabort (mm/kasan/report.c:250)
      [521120.753580] dio_bio_complete (fs/direct-io.c:478)
      [521120.755752] do_blockdev_direct_IO (fs/direct-io.c:494 fs/direct-io.c:1291)
      [521120.759765] __blockdev_direct_IO (fs/direct-io.c:1322)
      [521120.761658] blkdev_direct_IO (fs/block_dev.c:162)
      [521120.762993] generic_file_read_iter (mm/filemap.c:1738)
      [521120.767405] blkdev_read_iter (fs/block_dev.c:1649)
      [521120.768556] __vfs_read (fs/read_write.c:423 fs/read_write.c:434)
      [521120.772126] vfs_read (fs/read_write.c:454)
      [521120.773118] SyS_pread64 (fs/read_write.c:607 fs/read_write.c:594)
      [521120.776062] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:186)
      [521120.777375] Memory state around the buggy address:
      [521120.778118]  ffff880f36b38600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [521120.779211]  ffff880f36b38680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [521120.780315] >ffff880f36b38700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [521120.781465]                          ^
      [521120.782083]  ffff880f36b38780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [521120.783717]  ffff880f36b38800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [521120.784818] ==================================================================
      
      This patch fixes a few of those places that I caught while auditing the patch, but the
      original patch should be audited further for more occurences of this issue since I'm
      not too familiar with the code.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9b81c842
  10. 03 8月, 2015 1 次提交
    • N
      md/raid5: don't let shrink_slab shrink too far. · 49895bcc
      NeilBrown 提交于
      I have a report of drop_one_stripe() called from
      raid5_cache_scan() apparently finding ->max_nr_stripes == 0.
      
      This should not be allowed.
      
      So add a test to keep max_nr_stripes above min_nr_stripes.
      
      Also use a 'mask' rather than a 'mod' in drop_one_stripe
      to ensure 'hash' is valid even if max_nr_stripes does reach zero.
      
      
      Fixes: edbe83ab ("md/raid5: allow the stripe_cache to grow and shrink.")
      Cc: stable@vger.kernel.org (4.1 - please release with 2d5b569b)
      Reported-by: NTomas Papan <tomas.papan@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      49895bcc
  11. 29 7月, 2015 2 次提交
    • J
      block: manipulate bio->bi_flags through helpers · b7c44ed9
      Jens Axboe 提交于
      Some places use helpers now, others don't. We only have the 'is set'
      helper, add helpers for setting and clearing flags too.
      
      It was a bit of a mess of atomic vs non-atomic access. With
      BIO_UPTODATE gone, we don't have any risk of concurrent access to the
      flags. So relax the restriction and don't make any of them atomic. The
      flags that do have serialization issues (reffed and chained), we
      already handle those separately.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b7c44ed9
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  12. 24 7月, 2015 1 次提交
  13. 22 7月, 2015 1 次提交
    • N
      md/raid5: avoid races when changing cache size. · 2d5b569b
      NeilBrown 提交于
      Cache size can grow or shrink due to various pressures at
      any time.  So when we resize the cache as part of a 'grow'
      operation (i.e. change the size to allow more devices) we need
      to blocks that automatic growing/shrinking.
      
      So introduce a mutex.  auto grow/shrink uses mutex_trylock()
      and just doesn't bother if there is a blockage.
      Resizing the whole cache holds the mutex to ensure that
      the correct number of new stripes is allocated.
      
      This bug can result in some stripes not being freed when an
      array is stopped.  This leads to the kmem_cache not being
      freed and a subsequent array can try to use the same kmem_cache
      and get confused.
      
      Fixes: edbe83ab ("md/raid5: allow the stripe_cache to grow and shrink.")
      Cc: stable@vger.kernel.org (4.1 - please delay until 2 weeks after release of 4.2)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      2d5b569b
  14. 17 6月, 2015 1 次提交