1. 14 8月, 2015 5 次提交
    • K
      block: kill merge_bvec_fn() completely · 8ae12666
      Kent Overstreet 提交于
      As generic_make_request() is now able to handle arbitrarily sized bios,
      it's no longer necessary for each individual block driver to define its
      own ->merge_bvec_fn() callback. Remove every invocation completely.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: ceph-devel@vger.kernel.org
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: also remove ->merge_bvec_fn() in dm-thin as well as
       dm-era-target, and resolve merge conflicts]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ae12666
    • K
      md/raid5: get rid of bio_fits_rdev() · 7140aafc
      Kent Overstreet 提交于
      Remove bio_fits_rdev() as sufficient merge_bvec_fn() handling is now
      performed by blk_queue_split() in md_make_request().
      
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Acked-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: add more description in commit message]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7140aafc
    • M
      md/raid5: split bio for chunk_aligned_read · 7ef6b12a
      Ming Lin 提交于
      If a read request fits entirely in a chunk, it will be passed directly to the
      underlying device (providing it hasn't failed of course).  If it doesn't fit,
      the slightly less efficient path that uses the stripe_cache is used.
      Requests that get to the stripe cache are always completely split up as
      necessary.
      
      So with RAID5, ripping out the merge_bvec_fn doesn't cause it to stop work,
      but could cause it to take the less efficient path more often.
      
      All that is needed to manage this is for 'chunk_aligned_read' do some bio
      splitting, much like the RAID0 code does.
      
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Acked-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7ef6b12a
    • K
      bcache: remove driver private bio splitting code · 749b61da
      Kent Overstreet 提交于
      The bcache driver has always accepted arbitrarily large bios and split
      them internally.  Now that every driver must accept arbitrarily large
      bios this code isn't nessecary anymore.
      
      Cc: linux-bcache@vger.kernel.org
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: add more description in commit message]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      749b61da
    • K
      block: make generic_make_request handle arbitrarily sized bios · 54efd50b
      Kent Overstreet 提交于
      The way the block layer is currently written, it goes to great lengths
      to avoid having to split bios; upper layer code (such as bio_add_page())
      checks what the underlying device can handle and tries to always create
      bios that don't need to be split.
      
      But this approach becomes unwieldy and eventually breaks down with
      stacked devices and devices with dynamic limits, and it adds a lot of
      complexity. If the block layer could split bios as needed, we could
      eliminate a lot of complexity elsewhere - particularly in stacked
      drivers. Code that creates bios can then create whatever size bios are
      convenient, and more importantly stacked drivers don't have to deal with
      both their own bio size limitations and the limitations of the
      (potentially multiple) devices underneath them.  In the future this will
      let us delete merge_bvec_fn and a bunch of other code.
      
      We do this by adding calls to blk_queue_split() to the various
      make_request functions that need it - a few can already handle arbitrary
      size bios. Note that we add the call _after_ any call to
      blk_queue_bounce(); this means that blk_queue_split() and
      blk_recalc_rq_segments() don't need to be concerned with bouncing
      affecting segment merging.
      
      Some make_request_fn() callbacks were simple enough to audit and verify
      they don't need blk_queue_split() calls. The skipped ones are:
      
       * nfhd_make_request (arch/m68k/emu/nfblock.c)
       * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
       * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
       * brd_make_request (ramdisk - drivers/block/brd.c)
       * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
       * loop_make_request
       * null_queue_bio
       * bcache's make_request fns
      
      Some others are almost certainly safe to remove now, but will be left
      for future patches.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Jim Paris <jim@jtan.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      54efd50b
  2. 12 8月, 2015 1 次提交
    • S
      block: don't access bio->bi_error after bio_put() · 9b81c842
      Sasha Levin 提交于
      Commit 4246a0b6 ("block: add a bi_error field to struct bio") has added a few
      dereferences of 'bio' after a call to bio_put(). This causes use-after-frees
      such as:
      
      [521120.719695] BUG: KASan: use after free in dio_bio_complete+0x2b3/0x320 at addr ffff880f36b38714
      [521120.720638] Read of size 4 by task mount.ocfs2/9644
      [521120.721212] =============================================================================
      [521120.722056] BUG kmalloc-256 (Not tainted): kasan: bad access detected
      [521120.722968] -----------------------------------------------------------------------------
      [521120.722968]
      [521120.723915] Disabling lock debugging due to kernel taint
      [521120.724539] INFO: Slab 0xffffea003cdace00 objects=32 used=25 fp=0xffff880f36b38600 flags=0x46fffff80004080
      [521120.726037] INFO: Object 0xffff880f36b38700 @offset=1792 fp=0xffff880f36b38800
      [521120.726037]
      [521120.726974] Bytes b4 ffff880f36b386f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.727898] Object ffff880f36b38700: 00 88 b3 36 0f 88 ff ff 00 00 d8 de 0b 88 ff ff  ...6............
      [521120.728822] Object ffff880f36b38710: 02 00 00 f0 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.729705] Object ffff880f36b38720: 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00  ................
      [521120.730623] Object ffff880f36b38730: 00 00 00 00 00 00 00 00 01 00 00 00 00 02 00 00  ................
      [521120.731621] Object ffff880f36b38740: 00 02 00 00 01 00 00 00 d0 f7 87 ad ff ff ff ff  ................
      [521120.732776] Object ffff880f36b38750: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.733640] Object ffff880f36b38760: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.734508] Object ffff880f36b38770: 01 00 03 00 01 00 00 00 88 87 b3 36 0f 88 ff ff  ...........6....
      [521120.735385] Object ffff880f36b38780: 00 73 22 ad 02 88 ff ff 40 13 e0 3c 00 ea ff ff  .s".....@..<....
      [521120.736667] Object ffff880f36b38790: 00 02 00 00 00 04 00 00 00 00 00 00 00 00 00 00  ................
      [521120.737596] Object ffff880f36b387a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.738524] Object ffff880f36b387b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.739388] Object ffff880f36b387c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.740277] Object ffff880f36b387d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.741187] Object ffff880f36b387e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.742233] Object ffff880f36b387f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      [521120.743229] CPU: 41 PID: 9644 Comm: mount.ocfs2 Tainted: G    B           4.2.0-rc6-next-20150810-sasha-00039-gf909086 #2420
      [521120.744274]  ffff880f36b38000 ffff880d89c8f638 ffffffffb6e9ba8a ffff880101c0e5c0
      [521120.745025]  ffff880d89c8f668 ffffffffad76a313 ffff880101c0e5c0 ffffea003cdace00
      [521120.745908]  ffff880f36b38700 ffff880f36b38798 ffff880d89c8f690 ffffffffad772854
      [521120.747063] Call Trace:
      [521120.747520] dump_stack (lib/dump_stack.c:52)
      [521120.748053] print_trailer (mm/slub.c:653)
      [521120.748582] object_err (mm/slub.c:660)
      [521120.749079] kasan_report_error (include/linux/kasan.h:20 mm/kasan/report.c:152 mm/kasan/report.c:194)
      [521120.750834] __asan_report_load4_noabort (mm/kasan/report.c:250)
      [521120.753580] dio_bio_complete (fs/direct-io.c:478)
      [521120.755752] do_blockdev_direct_IO (fs/direct-io.c:494 fs/direct-io.c:1291)
      [521120.759765] __blockdev_direct_IO (fs/direct-io.c:1322)
      [521120.761658] blkdev_direct_IO (fs/block_dev.c:162)
      [521120.762993] generic_file_read_iter (mm/filemap.c:1738)
      [521120.767405] blkdev_read_iter (fs/block_dev.c:1649)
      [521120.768556] __vfs_read (fs/read_write.c:423 fs/read_write.c:434)
      [521120.772126] vfs_read (fs/read_write.c:454)
      [521120.773118] SyS_pread64 (fs/read_write.c:607 fs/read_write.c:594)
      [521120.776062] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:186)
      [521120.777375] Memory state around the buggy address:
      [521120.778118]  ffff880f36b38600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [521120.779211]  ffff880f36b38680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [521120.780315] >ffff880f36b38700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [521120.781465]                          ^
      [521120.782083]  ffff880f36b38780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [521120.783717]  ffff880f36b38800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [521120.784818] ==================================================================
      
      This patch fixes a few of those places that I caught while auditing the patch, but the
      original patch should be audited further for more occurences of this issue since I'm
      not too familiar with the code.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9b81c842
  3. 29 7月, 2015 2 次提交
    • J
      block: manipulate bio->bi_flags through helpers · b7c44ed9
      Jens Axboe 提交于
      Some places use helpers now, others don't. We only have the 'is set'
      helper, add helpers for setting and clearing flags too.
      
      It was a bit of a mess of atomic vs non-atomic access. With
      BIO_UPTODATE gone, we don't have any risk of concurrent access to the
      flags. So relax the restriction and don't make any of them atomic. The
      flags that do have serialization issues (reffed and chained), we
      already handle those separately.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b7c44ed9
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  4. 17 7月, 2015 1 次提交
  5. 11 7月, 2015 1 次提交
  6. 01 7月, 2015 2 次提交
  7. 26 6月, 2015 4 次提交
  8. 25 6月, 2015 3 次提交
    • N
      md: clear Blocked flag on failed devices when array is read-only. · ab16bfc7
      Neil Brown 提交于
      The Blocked flag indicates that a device has failed but that this
      fact hasn't been recorded in the metadata yet.  Writes to such
      devices cannot be allowed until the metadata has been updated.
      
      On a read-only array, the Blocked flag will never be cleared.
      This prevents the device being removed from the array.
      
      If the metadata is being handled by the kernel
      (i.e. !mddev->external), then we can be sure that if the array is
      switch to writable, then a metadata update will happen and will
      record the failure.  So we don't need the flag set.
      
      If metadata is externally managed, it is upto the external manager
      to clear the 'blocked' flag.
      Reported-by: NXiaoNi <xni@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ab16bfc7
    • N
      md: unlock mddev_lock on an error path. · 9a8c0fa8
      NeilBrown 提交于
      This error path retuns while still holding the lock - bad.
      
      Fixes: 6791875e ("md: make reconfig_mutex optional for writes to md sysfs files.")
      Cc: stable@vger.kernel.org (v4.0+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      9a8c0fa8
    • N
      md: clear mddev->private when it has been freed. · bd691922
      NeilBrown 提交于
      If ->private is set when ->run is called, it is assumed to be
      a 'config'  prepared as part of 'reshape'.
      
      So it is important when we free that config, that we also clear ->private.
      This is not often a problem as the mddev will normally be discarded
      shortly after the config us freed.
      However if an 'assemble' races with a final close, the assemble can use
      the old mddev which has a stale ->private.  This leads to any of
      various sorts of crashes.
      
      So clear ->private after calling ->free().
      Reported-by: NNate Clark <nate@neworld.us>
      Cc: stable@vger.kernel.org (v4.0+)
      Fixes: afa0f557 ("md: rename ->stop to ->free")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      bd691922
  9. 24 6月, 2015 2 次提交
  10. 18 6月, 2015 5 次提交
  11. 17 6月, 2015 7 次提交
    • J
      dm space map metadata: fix occasional leak of a metadata block on resize · 6096d91a
      Joe Thornber 提交于
      The metadata space map has a simplified 'bootstrap' mode that is
      operational when extending the space maps.  Whilst in this mode it's
      possible for some refcount decrement operations to become queued (eg, as
      a result of shadowing one of the bitmap indexes).  These decrements were
      not being applied when switching out of bootstrap mode.
      
      The effect of this bug was the leaking of a 4k metadata block.  This is
      detected by the latest version of thin_check as a non fatal error.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      6096d91a
    • F
      md: fix a build warning · 4e023612
      Firo Yang 提交于
      Warning like this:
      
      drivers/md/md.c: In function "update_array_info":
      drivers/md/md.c:6394:26: warning: logical not is only applied
      to the left hand side of comparison [-Wlogical-not-parentheses]
            !mddev->persistent  != info->not_persistent||
      
      Fix it as Neil Brown said:
      mddev->persistent != !info->not_persistent ||
      Signed-off-by: NFiro Yang <firogm@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4e023612
    • S
      md/raid5: ignore released_stripes check · 713bc5c2
      Shaohua Li 提交于
      conf->released_stripes list isn't always related to where there are
      free stripes pending. Active stripes can be in the list too.
      And even free stripes were active very recently.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      713bc5c2
    • Y
      md/raid5: per hash value and exclusive wait_for_stripe · e9e4c377
      Yuanhan Liu 提交于
      I noticed heavy spin lock contention at get_active_stripe() with fsmark
      multiple thread write workloads.
      
      Here is how this hot contention comes from. We have limited stripes, and
      it's a multiple thread write workload. Hence, those stripes will be taken
      soon, which puts later processes to sleep for waiting free stripes. When
      enough stripes(>= 1/4 total stripes) are released, all process are woken,
      trying to get the lock. But there is one only being able to get this lock
      for each hash lock, making other processes spinning out there for acquiring
      the lock.
      
      Thus, it's effectiveless to wakeup all processes and let them battle for
      a lock that permits one to access only each time. Instead, we could make
      it be a exclusive wake up: wake up one process only. That avoids the heavy
      spin lock contention naturally.
      
      To do the exclusive wake up, we've to split wait_for_stripe into multiple
      wait queues, to make it per hash value, just like the hash lock.
      
      Here are some test results I have got with this patch applied(all test run
      3 times):
      
      `fsmark.files_per_sec'
      =====================
      
      next-20150317                 this patch
      -------------------------     -------------------------
      metric_value     ±stddev      metric_value     ±stddev     change      testbox/benchmark/testcase-params
      -------------------------     -------------------------   --------     ------------------------------
            25.600     ±0.0              92.700     ±2.5          262.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
            25.600     ±0.0              77.800     ±0.6          203.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
            32.000     ±0.0              93.800     ±1.7          193.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
            32.000     ±0.0              81.233     ±1.7          153.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
            48.800     ±14.5             99.667     ±2.0          104.2%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
             6.400     ±0.0              12.800     ±0.0          100.0%     ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
            63.133     ±8.2              82.800     ±0.7           31.2%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
           245.067     ±0.7             306.567     ±7.9           25.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-f2fs-4M-30G-fsyncBeforeClose
            17.533     ±0.3              21.000     ±0.8           19.8%     ivb44/fsmark/1x-1t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose
           188.167     ±1.9             215.033     ±3.1           14.3%     ivb44/fsmark/1x-1t-4BRD_12G-RAID5-btrfs-4M-30G-NoSync
           254.500     ±1.8             290.733     ±2.4           14.2%     ivb44/fsmark/1x-1t-9BRD_6G-RAID5-btrfs-4M-30G-NoSync
      
      `time.system_time'
      =====================
      
      next-20150317                 this patch
      -------------------------    -------------------------
      metric_value     ±stddev     metric_value     ±stddev     change       testbox/benchmark/testcase-params
      -------------------------    -------------------------    --------     ------------------------------
          7235.603     ±1.2             185.163     ±1.9          -97.4%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
          7666.883     ±2.9             202.750     ±1.0          -97.4%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
         14567.893     ±0.7             421.230     ±0.4          -97.1%     ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
          3697.667     ±14.0            148.190     ±1.7          -96.0%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
          5572.867     ±3.8             310.717     ±1.4          -94.4%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
          5565.050     ±0.5             313.277     ±1.5          -94.4%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
          2420.707     ±17.1            171.043     ±2.7          -92.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
          3743.300     ±4.6             379.827     ±3.5          -89.9%     ivb44/fsmark/1x-64t-3HDD-RAID5-ext4-4M-40G-fsyncBeforeClose
          3308.687     ±6.3             363.050     ±2.0          -89.0%     ivb44/fsmark/1x-64t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose
      
      Where,
      
           1x: where 'x' means iterations or loop, corresponding to the 'L' option of fsmark
      
           1t, 64t: where 't' means thread
      
           4M: means the single file size, corresponding to the '-s' option of fsmark
           40G, 30G, 120G: means the total test size
      
           4BRD_12G: BRD is the ramdisk, where '4' means 4 ramdisk, and where '12G' means
                     the size of one ramdisk. So, it would be 48G in total. And we made a
                     raid on those ramdisk
      
      As you can see, though there are no much performance gain for hard disk
      workload, the system time is dropped heavily, up to 97%. And as expected,
      the performance increased a lot, up to 260%, for fast device(ram disk).
      
      v2: use bits instead of array to note down wait queue need to wake up.
      Signed-off-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e9e4c377
    • Y
      md/raid5: split wait_for_stripe and introduce wait_for_quiescent · b1b46486
      Yuanhan Liu 提交于
      I noticed heavy spin lock contention at get_active_stripe(), introduced
      at being wake up stage, where a bunch of processes try to re-hold the
      spin lock again.
      
      After giving some thoughts on this issue, I found the lock could be
      relieved(and even avoided) if we turn the wait_for_stripe to per
      waitqueue for each lock hash and make the wake up exclusive: wake up
      one process each time, which avoids the lock contention naturally.
      
      Before go hacking with wait_for_stripe, I found it actually has 2
      usages: for the array to enter or leave the quiescent state, and also
      to wait for an available stripe in each of the hash lists.
      
      So this patch splits the first usage off into a separate wait_queue,
      wait_for_quiescent, and the next patch will turn the second usage into
      one waitqueue for each hash value, and make it exclusive, to relieve
      the lock contention.
      
      v2: wake_up(wait_for_quiescent) when (active_stripes == 0)
          Commit log refactor suggestion from Neil.
      Signed-off-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b1b46486
    • A
      md: convert to kstrto*() · 4c9309c0
      Alexey Dobriyan 提交于
      Convert away from deprecated simple_strto*() functions.
      
      Add "fit into sector_t" checks.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4c9309c0
    • K
      md/raid10: make sync_request_write() call bio_copy_data() · c31df25f
      Kent Overstreet 提交于
      Refactor sync_request_write() of md/raid10 to use bio_copy_data()
      instead of open coding bio_vec iterations.
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: add more description in commit message]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <mlin@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c31df25f
  12. 12 6月, 2015 7 次提交
    • N
      md: make sure MD_RECOVERY_DONE is clear before starting recovery/resync · ea358cd0
      NeilBrown 提交于
      MD_RECOVERY_DONE is normally cleared by md_check_recovery after a
      resync etc finished.  However it is possible for raid5_start_reshape
      to race and start a reshape before MD_RECOVERY_DONE is cleared.  This
      can lean to multiple reshapes running at the same time, which isn't
      good.
      
      To make sure it is cleared before starting a reshape, and also clear
      it when reaping a thread, just to be safe.
      Signed-off-by: NNeilBrown  <neilb@suse.de>
      ea358cd0
    • N
      md: Close race when setting 'action' to 'idle'. · 8e8e2518
      NeilBrown 提交于
      Checking ->sync_thread without holding the mddev_lock()
      isn't really safe, even after flushing the workqueue which
      ensures md_start_sync() has been run.
      
      While this code is waiting for the lock, md_check_recovery could reap
      the thread itself, and then start another thread (e.g. recovery might
      finish, then reshape starts).  When this thread gets the lock
      md_start_sync() hasn't run so it doesn't get reaped, but
      MD_RECOVERY_RUNNING gets cleared.  This allows two threads to start
      which leads to confusion.
      
      So don't both if MD_RECOVERY_RUNNING isn't set, but if it is do
      the flush and the test and the reap all under the mddev_lock to
      avoid any race with md_check_recovery.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Fixes: 6791875e ("md: make reconfig_mutex optional for writes to md sysfs files.")
      Cc: stable@vger.kernel.org (v4.0+)
      8e8e2518
    • N
      md: don't return 0 from array_state_store · c008f1d3
      NeilBrown 提交于
      Returning zero from a 'store' function is bad.
      The return value should be either len length of the string
      or an error.
      
      So use 'len' if 'err' is zero.
      
      Fixes: 6791875e ("md: make reconfig_mutex optional for writes to md sysfs files.")
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: stable@vger.kernel (v4.0+)
      c008f1d3
    • J
      dm thin metadata: fix a race when entering fail mode · b1f11aff
      Joe Thornber 提交于
      In dm_thin_find_block() the ->fail_io flag was checked outside the
      metadata device's root_lock, causing dm_thin_find_block() to race with
      the setting of this flag.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b1f11aff
    • M
      dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages · fd467696
      Mike Snitzer 提交于
      Use EOPNOTSUPP, rather than EINVAL, error code when user attempts to
      send the pool a message.  Otherwise usespace is led to believe the
      message failed due to invalid argument.
      Reported-by: NZdenek Kabelac <zkabelac@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      fd467696
    • J
      dm thin: range discard support · 34fbcf62
      Joe Thornber 提交于
      Previously REQ_DISCARD bios have been split into block sized chunks
      before submission to the thin target.  There are a couple of issues with
      this:
      
       - If the block size is small, a large discard request can
         get broken up into a great many bios which is both slow and causes
         a lot of memory pressure.
      
       - The thin pool block size and the discard granularity for the
         underlying data device need to be compatible if we want to passdown
         the discard.
      
      This patch relaxes the block size granularity for thin devices.  It
      makes use of the recent range locking added to the bio_prison to
      quiesce a whole range of thin blocks before unmapping them.  Once a
      thin range has been unmapped the discard can then be passed down to
      the data device for those sub ranges where the data blocks are no
      longer used (ie. they weren't shared in the first place).
      
      This patch also doesn't make any apologies about open-coding portions
      of block core as a means to supporting async discard completions in the
      near-term -- if/when late bio splitting lands it'll all get cleaned up.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      34fbcf62
    • J
      dm thin metadata: add dm_thin_remove_range() · 6550f075
      Joe Thornber 提交于
      Removes a range of blocks from the btree.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      6550f075