1. 06 10月, 2017 1 次提交
  2. 03 10月, 2017 1 次提交
  3. 11 9月, 2017 6 次提交
    • M
      dax: remove the pmem_dax_ops->flush abstraction · c3ca015f
      Mikulas Patocka 提交于
      Commit abebfbe2 ("dm: add ->flush() dax operation support") is
      buggy. A DM device may be composed of multiple underlying devices and
      all of them need to be flushed. That commit just routes the flush
      request to the first device and ignores the other devices.
      
      It could be fixed by adding more complex logic to the device mapper. But
      there is only one implementation of the method pmem_dax_ops->flush - that
      is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
      don't need the pmem_dax_ops->flush abstraction at all, we can call
      arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
      can't ever reach anything different from arch_wb_cache_pmem().
      
      It should be also pointed out that for some uses of persistent memory it
      is needed to flush only a very small amount of data (such as 1 cacheline),
      and it would be overkill if we go through that device mapper machinery for
      a single flushed cache line.
      
      Fix this by removing the pmem_dax_ops->flush abstraction and call
      arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
      mapper code that forwards the flushes.
      
      Fixes: abebfbe2 ("dm: add ->flush() dax operation support")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c3ca015f
    • A
      dm integrity: use init_completion instead of COMPLETION_INITIALIZER_ONSTACK · b5e8ad92
      Arnd Bergmann 提交于
      The new lockdep support for completions causeed the stack usage
      in dm-integrity to explode, in case of write_journal from 504 bytes
      to 1120 (using arm gcc-7.1.1):
      
      drivers/md/dm-integrity.c: In function 'write_journal':
      drivers/md/dm-integrity.c:827:1: error: the frame size of 1120 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
      
      The problem is that not only the size of 'struct completion' grows
      significantly, but we end up having multiple copies of it on the stack
      when we assign it from a local variable after the initial declaration.
      
      COMPLETION_INITIALIZER_ONSTACK() is the right thing to use when we
      want to declare and initialize a completion on the stack. However,
      this driver doesn't do that and instead initializes the completion
      just before it is used.
      
      In this case, init_completion() does the same thing more efficiently,
      and drops the stack usage for the function above down to 496 bytes.
      While the other functions in this file are not bad enough to cause
      a warning, they benefit equally from the change, so I do the change
      across the entire file. In the one place where we reuse a completion,
      I picked the cheaper reinit_completion() over init_completion().
      
      Fixes: cd8084f9 ("locking/lockdep: Apply crossrelease to completions")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b5e8ad92
    • B
      dm integrity: make blk_integrity_profile structure const · 7c373d66
      Bhumika Goyal 提交于
      Make this structure const as it is only stored in the profile field of a
      blk_integrity structure. This field is of type const, so make structure
      as const.
      Signed-off-by: NBhumika Goyal <bhumirks@gmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      7c373d66
    • H
      dm integrity: do not check integrity for failed read operations · b7e326f7
      Hyunchul Lee 提交于
      Even though read operations fail, dm_integrity_map_continue() calls
      integrity_metadata() to check integrity.  In this case, just complete
      these.
      
      This also makes it so read I/O errors do not generate integrity warnings
      in the kernel log.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NHyunchul Lee <cheol.lee@lge.com>
      Acked-by: NMilan Broz <gmazyland@gmail.com>
      Acked-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b7e326f7
    • J
      dm log writes: fix >512b sectorsize support · 228bb5b2
      Josef Bacik 提交于
      512b sectors vs device's physical sectorsize was not maintained
      consistently and as such the support for >512b sector devices has bugs.
      The log metadata expects native sectorsize but 512b sectors were being
      stored.  Also, device's sectorsize was assumed when assigning the
      bi_sector for blocks that were being logged.
      
      Fix this up by adding two helpers to convert between bio and dev
      sectors, and use these in the appropriate places to fix the problem and
      make it clear which units go where.  Doing so allows dm-log-writes use
      with 4k devices.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      228bb5b2
    • J
      dm log writes: don't use all the cpu while waiting to log blocks · 0c79c620
      Josef Bacik 提交于
      The check to see if the logging kthread needs to go to sleep is wrong,
      it checks lc->pending_blocks, which will be non-0 if there are any
      blocks that are pending, whether they are ready to be logged or not.
      What we really want is to go to sleep until it's time to log blocks, so
      change this check so we do actually go to sleep in between flushes.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      0c79c620
  4. 08 9月, 2017 1 次提交
    • T
      bcache: initialize dirty stripes in flash_dev_run() · 175206cf
      Tang Junhui 提交于
      bcache uses a Proportion-Differentiation Controller algorithm to control
      writeback rate to cached devices. In the PD controller algorithm, dirty
      stripes of thin flash device should not be counted in, because flash only
      volumes never write back dirty data.
      
      Currently dirty stripe counter for thin flash device is not initialized
      when the thin flash device starts. Which means the following calculation
      in PD controller will reference an undefined dirty stripes number, and
      all cached devices attached to the same cache set where the thin flash
      device lies on may have an inaccurate writeback rate.
      
      This patch calles bch_sectors_dirty_init() in flash_dev_run(), to
      correctly initialize dirty stripe counter when the thin flash device
      starts to run. This patch also does following parameter data type change,
       -void bch_sectors_dirty_init(struct cached_dev *dc);
       +void bch_sectors_dirty_init(struct bcache_device *);
      to call this function conveniently in flash_dev_run().
      
      (Commit log is composed by Coly Li)
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      175206cf
  5. 06 9月, 2017 14 次提交
    • M
      bcache: fix bch_hprint crash and improve output · 9276717b
      Michael Lyle 提交于
      Most importantly, solve a crash where %llu was used to format signed
      numbers.  This would cause a buffer overflow when reading sysfs
      writeback_rate_debug, as only 20 bytes were allocated for this and
      %llu writes 20 characters plus a null.
      
      Always use the units mechanism rather than having different output
      paths for simplicity.
      
      Also, correct problems with display output where 1.10 was a larger
      number than 1.09, by multiplying by 10 and then dividing by 1024 instead
      of dividing by 100.  (Remainders of >= 1000 would print as .10).
      
      Minor changes: Always display the decimal point instead of trying to
      omit it based on number of digits shown.  Decide what units to use
      based on 1000 as a threshold, not 1024 (in other words, always print
      at most 3 digits before the decimal point).
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reported-by: NDmitry Yu Okunev <dyokunev@ut.mephi.ru>
      Acked-by: NKent Overstreet <kent.overstreet@gmail.com>
      Reviewed-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9276717b
    • D
      bcache: Update continue_at() documentation · 7b6a8570
      Dan Carpenter 提交于
      continue_at() doesn't have a return statement anymore.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NColy Li <colyli@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7b6a8570
    • D
      bcache: silence static checker warning · da22f0ee
      Dan Carpenter 提交于
      In olden times, closure_return() used to have a hidden return built in.
      We removed the hidden return but forgot to add a new return here.  If
      "c" were NULL we would oops on the next line, but fortunately "c" is
      never NULL.  Let's just remove the if statement.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      da22f0ee
    • T
      bcache: fix for gc and write-back race · 9baf3097
      Tang Junhui 提交于
      gc and write-back get raced (see the email "bcache get stucked" I sended
      before):
      gc thread                               write-back thread
      |                                       |bch_writeback_thread()
      |bch_gc_thread()                        |
      |                                       |==>read_dirty()
      |==>bch_btree_gc()                      |
      |==>btree_root() //get btree root       |
      |                //node write locker    |
      |==>bch_btree_gc_root()                 |
      |                                       |==>read_dirty_submit()
      |                                       |==>write_dirty()
      |                                       |==>continue_at(cl,
      |                                       |               write_dirty_finish,
      |                                       |               system_wq);
      |                                       |==>write_dirty_finish()//excute
      |                                       |               //in system_wq
      |                                       |==>bch_btree_insert()
      |                                       |==>bch_btree_map_leaf_nodes()
      |                                       |==>__bch_btree_map_nodes()
      |                                       |==>btree_root //try to get btree
      |                                       |              //root node read
      |                                       |              //lock
      |                                       |-----stuck here
      |==>bch_btree_set_root()
      |==>bch_journal_meta()
      |==>bch_journal()
      |==>journal_try_write()
      |==>journal_write_unlocked() //journal_full(&c->journal)
      |                            //condition satisfied
      |==>continue_at(cl, journal_write, system_wq); //try to excute
      |                               //journal_write in system_wq
      |                               //but work queue is excuting
      |                               //write_dirty_finish()
      |==>closure_sync(); //wait journal_write execute
      |                   //over and wake up gc,
      |-------------stuck here
      |==>release root node write locker
      
      This patch alloc a separate work-queue for write-back thread to avoid such
      race.
      
      (Commit log re-organized by Coly Li to pass checkpatch.pl checking)
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Acked-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9baf3097
    • T
      bcache: increase the number of open buckets · 89b1fc54
      Tang Junhui 提交于
      In currently, we only alloc 6 open buckets for each cache set,
      but in usually, we always attach about 10 or so backend devices for
      each cache set, and the each bcache device are always accessed by
      about 10 or so threads in top application layer. So 6 open buckets
      are too few, It has led to that each of the same thread write data
      to different buckets, which would cause low efficiency write-back,
      and also cause buckets inefficient, and would be Very easy to run
      out of.
      
      I add debug message in bch_open_buckets_alloc() to print alloc bucket
      info, and test with ten bcache devices with a cache set, and each
      bcache device is accessed by ten threads.
      
      From the debug message, we can see that, after the modification, One
      bucket is more likely to assign to the same thread, and the data from
      the same thread are more likely to write the same bucket. Usually the
      same thread always write/read the same backend device, so it is good
      for write-back and also promote the usage efficiency of buckets.
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      89b1fc54
    • T
      bcache: Correct return value for sysfs attach errors · 77fa100f
      Tony Asleson 提交于
      If you encounter any errors in bch_cached_dev_attach it will return
      a negative error code.  The variable 'v' which stores the result is
      unsigned, thus user space sees a very large value returned for bytes
      written which can cause incorrect user space behavior.  Utilize 1
      signed variable to use throughout the function to preserve error return
      capability.
      Signed-off-by: NTony Asleson <tasleson@redhat.com>
      Acked-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      77fa100f
    • T
      bcache: correct cache_dirty_target in __update_writeback_rate() · a8394090
      Tang Junhui 提交于
      __update_write_rate() uses a Proportion-Differentiation Controller
      algorithm to control writeback rate. A dirty target number is used in
      this PD controller to control writeback rate. A larger target number
      will make the writeback rate smaller, on the versus, a smaller target
      number will make the writeback rate larger.
      
      bcache uses the following steps to calculate the target number,
      1) cache_sectors = all-buckets-of-cache-set * buckets-size
      2) cache_dirty_target = cache_sectors * cached-device-writeback_percent
      3) target = cache_dirty_target *
      (sectors-of-cached-device/sectors-of-all-cached-devices-of-this-cache-set)
      
      The calculation at step 1) for cache_sectors is incorrect, which does
      not consider dirty blocks occupied by flash only volume.
      
      A flash only volume can be took as a bcache device without cached
      device. All data sectors allocated for it are persistent on cache device
      and marked dirty, they are not touched by bcache writeback and garbage
      collection code. So data blocks of flash only volume should be ignore
      when calculating cache_sectors of cache set.
      
      Current code does not subtract dirty sectors of flash only volume, which
      results a larger target number from the above 3 steps. And in sequence
      the cache device's writeback rate is smaller then a correct value,
      writeback speed is slower on all cached devices.
      
      This patch fixes the incorrect slower writeback rate by subtracting
      dirty sectors of flash only volumes in __update_writeback_rate().
      
      (Commit log composed by Coly Li to pass checkpatch.pl checking)
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a8394090
    • T
      bcache: gc does not work when triggering by manual command · 0b43f49d
      Tang Junhui 提交于
      I try to execute the following command to trigger gc thread:
      [root@localhost internal]# echo 1 > trigger_gc
      But it does not work, I debug the code in gc_should_run(), It works only
      if in invalidating or sectors_to_gc < 0. So set sectors_to_gc to -1 to
      meet the condition when we trigger gc by manual command.
      
      (Code comments aded by Coly Li)
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0b43f49d
    • B
      bcache: Don't reinvent the wheel but use existing llist API · 09b3efec
      Byungchul Park 提交于
      Although llist provides proper APIs, they are not used. Make them used.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Acked-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      09b3efec
    • T
      bcache: do not subtract sectors_to_gc for bypassed IO · 69daf03a
      Tang Junhui 提交于
      Since bypassed IOs use no bucket, so do not subtract sectors_to_gc to
      trigger gc thread.
      Signed-off-by: Ntang.junhui <tang.junhui@zte.com.cn>
      Acked-by: NColy Li <colyli@suse.de>
      Reviewed-by: NEric Wheeler <bcache@linux.ewheeler.net>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      69daf03a
    • T
      bcache: fix sequential large write IO bypass · c81ffa32
      Tang Junhui 提交于
      Sequential write IOs were tested with bs=1M by FIO in writeback cache
      mode, these IOs were expected to be bypassed, but actually they did not.
      We debug the code, and find in check_should_bypass():
          if (!congested &&
              mode == CACHE_MODE_WRITEBACK &&
              op_is_write(bio_op(bio)) &&
              (bio->bi_opf & REQ_SYNC))
              goto rescale
      that means, If in writeback mode, a write IO with REQ_SYNC flag will not
      be bypassed though it is a sequential large IO, It's not a correct thing
      to do actually, so this patch remove these codes.
      Signed-off-by: Ntang.junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NKent Overstreet <kent.overstreet@gmail.com>
      Reviewed-by: NEric Wheeler <bcache@linux.ewheeler.net>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c81ffa32
    • J
      bcache: Fix leak of bdev reference · 4b758df2
      Jan Kara 提交于
      If blkdev_get_by_path() in register_bcache() fails, we try to lookup the
      block device using lookup_bdev() to detect which situation we are in to
      properly report error. However we never drop the reference returned to
      us from lookup_bdev(). Fix that.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4b758df2
    • D
      md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list · 184a09eb
      Dennis Yang 提交于
      In release_stripe_plug(), if a stripe_head has its STRIPE_ON_UNPLUG_LIST
      set, it indicates that this stripe_head is already in the raid5_plug_cb
      list and release_stripe() would be called instead to drop a reference
      count. Otherwise, the STRIPE_ON_UNPLUG_LIST bit would be set for this
      stripe_head and it will get queued into the raid5_plug_cb list.
      
      Since break_stripe_batch_list() did not preserve STRIPE_ON_UNPLUG_LIST,
      A stripe could be re-added to plug list while it is still on that list
      in the following situation. If stripe_head A is added to another
      stripe_head B's batch list, in this case A will have its
      batch_head != NULL and be added into the plug list. After that,
      stripe_head B gets handled and called break_stripe_batch_list() to
      reset all the batched stripe_head(including A which is still on
      the plug list)'s state and reset their batch_head to NULL.
      Before the plug list gets processed, if there is another write request
      comes in and get stripe_head A, A will have its batch_head == NULL
      (cleared by calling break_stripe_batch_list() on B) and be added to
      plug list once again.
      Signed-off-by: NDennis Yang <dennisyang@qnap.com>
      Cc: stable@vger.kernel.org (v4.1+)
      Signed-off-by: NShaohua Li <shli@fb.com>
      184a09eb
    • S
      md/raid5: fix a race condition in stripe batch · 3664847d
      Shaohua Li 提交于
      We have a race condition in below scenario, say have 3 continuous stripes, sh1,
      sh2 and sh3, sh1 is the stripe_head of sh2 and sh3:
      
      CPU1				CPU2				CPU3
      handle_stripe(sh3)
      				stripe_add_to_batch_list(sh3)
      				-> lock(sh2, sh3)
      				-> lock batch_lock(sh1)
      				-> add sh3 to batch_list of sh1
      				-> unlock batch_lock(sh1)
      								clear_batch_ready(sh1)
      								-> lock(sh1) and batch_lock(sh1)
      								-> clear STRIPE_BATCH_READY for all stripes in batch_list
      								-> unlock(sh1) and batch_lock(sh1)
      ->clear_batch_ready(sh3)
      -->test_and_clear_bit(STRIPE_BATCH_READY, sh3)
      --->return 0 as sh->batch == NULL
      				-> sh3->batch_head = sh1
      				-> unlock (sh2, sh3)
      
      In CPU1, handle_stripe will continue handle sh3 even it's in batch stripe list
      of sh1. By moving sh3->batch_head assignment in to batch_lock, we make it
      impossible to clear STRIPE_BATCH_READY before batch_head is set.
      
      Thanks Stephane for helping debug this tricky issue.
      Reported-and-tested-by: NStephane Thiell <sthiell@stanford.edu>
      Cc: stable@vger.kernel.org (v4.1+)
      Signed-off-by: NShaohua Li <shli@fb.com>
      3664847d
  6. 01 9月, 2017 1 次提交
    • N
      md/bitmap: disable bitmap_resize for file-backed bitmaps. · e8a27f83
      NeilBrown 提交于
      bitmap_resize() does not work for file-backed bitmaps.
      The buffer_heads are allocated and initialized when
      the bitmap is read from the file, but resize doesn't
      read from the file, it loads from the internal bitmap.
      When it comes time to write the new bitmap, the bh is
      non-existent and we crash.
      
      The common case when growing an array involves making the array larger,
      and that normally means making the bitmap larger.  Doing
      that inside the kernel is possible, but would need more code.
      It is probably easier to require people who use file-backed
      bitmaps to remove them and re-add after a reshape.
      
      So this patch disables the resizing of arrays which have
      file-backed bitmaps.  This is better than crashing.
      Reported-by: NZhilong Liu <zlliu@suse.com>
      Fixes: d60b479d ("md/bitmap: add bitmap_resize function to allow bitmap resizing.")
      Cc: stable@vger.kernel.org (v3.5+).
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      e8a27f83
  7. 28 8月, 2017 14 次提交
  8. 26 8月, 2017 2 次提交