1. 19 3月, 2018 1 次提交
    • C
      bcache: fix cached_dev->count usage for bch_cache_set_error() · 804f3c69
      Coly Li 提交于
      When bcache metadata I/O fails, bcache will call bch_cache_set_error()
      to retire the whole cache set. The expected behavior to retire a cache
      set is to unregister the cache set, and unregister all backing device
      attached to this cache set, then remove sysfs entries of the cache set
      and all attached backing devices, finally release memory of structs
      cache_set, cache, cached_dev and bcache_device.
      
      In my testing when journal I/O failure triggered by disconnected cache
      device, sometimes the cache set cannot be retired, and its sysfs
      entry /sys/fs/bcache/<uuid> still exits and the backing device also
      references it. This is not expected behavior.
      
      When metadata I/O failes, the call senquence to retire whole cache set is,
              bch_cache_set_error()
              bch_cache_set_unregister()
              bch_cache_set_stop()
              __cache_set_unregister()     <- called as callback by calling
                                              clousre_queue(&c->caching)
              cache_set_flush()            <- called as a callback when refcount
                                              of cache_set->caching is 0
              cache_set_free()             <- called as a callback when refcount
                                              of catch_set->cl is 0
              bch_cache_set_release()      <- called as a callback when refcount
                                              of catch_set->kobj is 0
      
      I find if kernel thread bch_writeback_thread() quits while-loop when
      kthread_should_stop() is true and searched_full_index is false, clousre
      callback cache_set_flush() set by continue_at() will never be called. The
      result is, bcache fails to retire whole cache set.
      
      cache_set_flush() will be called when refcount of closure c->caching is 0,
      and in function bcache_device_detach() refcount of closure c->caching is
      released to 0 by clousre_put(). In metadata error code path, function
      bcache_device_detach() is called by cached_dev_detach_finish(). This is a
      callback routine being called when cached_dev->count is 0. This refcount
      is decreased by cached_dev_put().
      
      The above dependence indicates, cache_set_flush() will be called when
      refcount of cache_set->cl is 0, and refcount of cache_set->cl to be 0
      when refcount of cache_dev->count is 0.
      
      The reason why sometimes cache_dev->count is not 0 (when metadata I/O fails
      and bch_cache_set_error() called) is, in bch_writeback_thread(), refcount
      of cache_dev is not decreased properly.
      
      In bch_writeback_thread(), cached_dev_put() is called only when
      searched_full_index is true and cached_dev->writeback_keys is empty, a.k.a
      there is no dirty data on cache. In most of run time it is correct, but
      when bch_writeback_thread() quits the while-loop while cache is still
      dirty, current code forget to call cached_dev_put() before this kernel
      thread exits. This is why sometimes cache_set_flush() is not executed and
      cache set fails to be retired.
      
      The reason to call cached_dev_put() in bch_writeback_rate() is, when the
      cache device changes from clean to dirty, cached_dev_get() is called, to
      make sure during writeback operatiions both backing and cache devices
      won't be released.
      
      Adding following code in bch_writeback_thread() does not work,
         static int bch_writeback_thread(void *arg)
              }
      
      +       if (atomic_read(&dc->has_dirty))
      +               cached_dev_put()
      +
              return 0;
       }
      because writeback kernel thread can be waken up and start via sysfs entry:
              echo 1 > /sys/block/bcache<N>/bcache/writeback_running
      It is difficult to check whether backing device is dirty without race and
      extra lock. So the above modification will introduce potential refcount
      underflow in some conditions.
      
      The correct fix is, to take cached dev refcount when creating the kernel
      thread, and put it before the kernel thread exits. Then bcache does not
      need to take a cached dev refcount when cache turns from clean to dirty,
      or to put a cached dev refcount when cache turns from ditry to clean. The
      writeback kernel thread is alwasy safe to reference data structure from
      cache set, cache and cached device (because a refcount of cache device is
      taken for it already), and no matter the kernel thread is stopped by I/O
      errors or system reboot, cached_dev->count can always be used correctly.
      
      The patch is simple, but understanding how it works is quite complicated.
      
      Changelog:
      v2: set dc->writeback_thread to NULL in this patch, as suggested by Hannes.
      v1: initial version for review.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Michael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      804f3c69
  2. 09 3月, 2018 1 次提交
  3. 28 2月, 2018 1 次提交
    • C
      bcache: correct flash only vols (check all uuids) · 02aa8a8b
      Coly Li 提交于
      Commit 2831231d ("bcache: reduce cache_set devices iteration by
      devices_max_used") adds c->devices_max_used to reduce iteration of
      c->uuids elements, this value is updated in bcache_device_attach().
      
      But for flash only volume, when calling flash_devs_run(), the function
      bcache_device_attach() is not called yet and c->devices_max_used is not
      updated. The unexpected result is, the flash only volume won't be run
      by flash_devs_run().
      
      This patch fixes the issue by iterate all c->uuids elements in
      flash_devs_run(). c->devices_max_used will be updated properly when
      bcache_device_attach() gets called.
      
      [mlyle: commit subject edited for character limit]
      
      Fixes: 2831231d ("bcache: reduce cache_set devices iteration by devices_max_used")
      Reported-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      02aa8a8b
  4. 08 2月, 2018 3 次提交
    • T
      bcache: fix for data collapse after re-attaching an attached device · 73ac105b
      Tang Junhui 提交于
      back-end device sdm has already attached a cache_set with ID
      f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with
      another cache set, and it returns with an error:
      [root]# cd /sys/block/sdm/bcache
      [root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f > attach
      -bash: echo: write error: Invalid argument
      
      After that, execute a command to modify the label of bcache
      device:
      [root]# echo data_disk1 > label
      
      Then we reboot the system, when the system power on, the back-end
      device can not attach to cache_set, a messages show in the log:
      Feb  5 12:05:52 ceph152 kernel: [922385.508498] bcache:
      bch_cached_dev_attach() couldn't find uuid for sdm in set
      
      In sysfs_attach(), dc->sb.set_uuid was assigned to the value
      which input through sysfs, no matter whether it is success
      or not in bch_cached_dev_attach(). For example, If the back-end
      device has already attached to an cache set, bch_cached_dev_attach()
      would fail, but dc->sb.set_uuid was changed. Then modify the
      label of bcache device, it will call bch_write_bdev_super(),
      which would write the dc->sb.set_uuid to the super block, so we
      record a wrong cache set ID in the super block, after the system
      reboot, the cache set couldn't find the uuid of the back-end
      device, so the bcache device couldn't exist and use any more.
      
      In this patch, we don't assigned cache set ID to dc->sb.set_uuid
      in sysfs_attach() directly, but input it into bch_cached_dev_attach(),
      and assigned dc->sb.set_uuid to the cache set ID after the back-end
      device attached to the cache set successful.
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      73ac105b
    • T
      bcache: fix for allocator and register thread race · 682811b3
      Tang Junhui 提交于
      After long time running of random small IO writing,
      I reboot the machine, and after the machine power on,
      I found bcache got stuck, the stack is:
      [root@ceph153 ~]# cat /proc/2510/task/*/stack
      [<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache]
      [<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache]
      [<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache]
      [<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache]
      [<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
      [<ffffffff810a631f>] kthread+0xcf/0xe0
      [<ffffffff8164c318>] ret_from_fork+0x58/0x90
      [<ffffffffffffffff>] 0xffffffffffffffff
      [root@ceph153 ~]# cat /proc/2038/task/*/stack
      [<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
      [<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache]
      [<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache]
      [<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache]
      [<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache]
      [<ffffffff812f702f>] kobj_attr_store+0xf/0x20
      [<ffffffff8125b216>] sysfs_write_file+0xc6/0x140
      [<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0
      [<ffffffff811e069f>] SyS_write+0x7f/0xe0
      [<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1
      The stack shows the register thread and allocator thread
      were getting stuck when registering cache device.
      
      I reboot the machine several times, the issue always
      exsit in this machine.
      
      I debug the code, and found the call trace as bellow:
      register_bcache()
         ==>run_cache_set()
            ==>bch_journal_replay()
               ==>bch_btree_insert()
                  ==>__bch_btree_map_nodes()
                     ==>btree_insert_fn()
                        ==>btree_split() //node need split
                           ==>btree_check_reserve()
      In btree_check_reserve(), It will check if there is enough buckets
      of RESERVE_BTREE type, since allocator thread did not work yet, so
      no buckets of RESERVE_BTREE type allocated, so the register thread
      waits on c->btree_cache_wait, and goes to sleep.
      
      Then the allocator thread initialized, the call trace is bellow:
      bch_allocator_thread()
      ==>bch_prio_write()
         ==>bch_journal_meta()
            ==>bch_journal()
               ==>journal_wait_for_write()
      In journal_wait_for_write(), It will check if journal is full by
      journal_full(), but the long time random small IO writing
      causes the exhaustion of journal buckets(journal.blocks_free=0),
      In order to release the journal buckets,
      the allocator calls btree_flush_write() to flush keys to
      btree nodes, and waits on c->journal.wait until btree nodes writing
      over or there has already some journal buckets space, then the
      allocator thread goes to sleep. but in btree_flush_write(), since
      bch_journal_replay() is not finished, so no btree nodes have journal
      (condition "if (btree_current_write(b)->journal)" never satisfied),
      so we got no btree node to flush, no journal bucket released,
      and allocator sleep all the times.
      
      Through the above analysis, we can see that:
      1) Register thread wait for allocator thread to allocate buckets of
         RESERVE_BTREE type;
      2) Alloctor thread wait for register thread to replay journal, so it
         can flush btree nodes and get journal bucket.
         then they are all got stuck by waiting for each other.
      
      Hua Rui provided a patch for me, by allocating some buckets of
      RESERVE_BTREE type in advance, so the register thread can get bucket
      when btree node splitting and no need to waiting for the allocator
      thread. I tested it, it has effect, and register thread run a step
      forward, but finally are still got stuck, the reason is only 8 bucket
      of RESERVE_BTREE type were allocated, and in bch_journal_replay(),
      after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left,
      then btree_check_reserve() is not satisfied anymore, so it goes to sleep
      again, and in the same time, alloctor thread did not flush enough btree
      nodes to release a journal bucket, so they all got stuck again.
      
      So we need to allocate more buckets of RESERVE_BTREE type in advance,
      but how much is enough?  By experience and test, I think it should be
      as much as journal buckets. Then I modify the code as this patch,
      and test in the machine, and it works.
      
      This patch modified base on Hua Rui’s patch, and allocate more buckets
      of RESERVE_BTREE type in advance to avoid register thread and allocate
      thread going to wait for each other.
      
      [patch v2] ca->sb.njournal_buckets would be 0 in the first time after
      cache creation, and no journal exists, so just 8 btree buckets is OK.
      Signed-off-by: NHua Rui <huarui.dev@gmail.com>
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      682811b3
    • C
      bcache: set error_limit correctly · 7ba0d830
      Coly Li 提交于
      Struct cache uses io_errors for two purposes,
      - Error decay: when cache set error_decay is set, io_errors is used to
        generate a small piece of delay when I/O error happens.
      - I/O errors counter: in order to generate big enough value for error
        decay, I/O errors counter value is stored by left shifting 20 bits (a.k.a
        IO_ERROR_SHIFT).
      
      In function bch_count_io_errors(), if I/O errors counter reaches cache set
      error limit, bch_cache_set_error() will be called to retire the whold cache
      set. But current code is problematic when checking the error limit, see the
      following code piece from bch_count_io_errors(),
      
       90     if (error) {
       91             char buf[BDEVNAME_SIZE];
       92             unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT,
       93                                                 &ca->io_errors);
       94             errors >>= IO_ERROR_SHIFT;
       95
       96             if (errors < ca->set->error_limit)
       97                     pr_err("%s: IO error on %s, recovering",
       98                            bdevname(ca->bdev, buf), m);
       99             else
      100                     bch_cache_set_error(ca->set,
      101                                         "%s: too many IO errors %s",
      102                                         bdevname(ca->bdev, buf), m);
      103     }
      
      At line 94, errors is right shifting IO_ERROR_SHIFT bits, now it is real
      errors counter to compare at line 96. But ca->set->error_limit is initia-
      lized with an amplified value in bch_cache_set_alloc(),
      1545         c->error_limit  = 8 << IO_ERROR_SHIFT;
      
      It means by default, in bch_count_io_errors(), before 8<<20 errors happened
      bch_cache_set_error() won't be called to retire the problematic cache
      device. If the average request size is 64KB, it means bcache won't handle
      failed device until 512GB data is requested. This is too large to be an I/O
      threashold. So I believe the correct error limit should be much less.
      
      This patch sets default cache set error limit to 8, then in
      bch_count_io_errors() when errors counter reaches 8 (if it is default
      value), function bch_cache_set_error() will be called to retire the whole
      cache set. This patch also removes bits shifting when store or show
      io_error_limit value via sysfs interface.
      
      Nowadays most of SSDs handle internal flash failure automatically by LBA
      address re-indirect mapping. If an I/O error can be observed by upper layer
      code, it will be a notable error because that SSD can not re-indirect
      map the problematic LBA address to an available flash block. This situation
      indicates the whole SSD will be failed very soon. Therefore setting 8 as
      the default io error limit value makes sense, it is enough for most of
      cache devices.
      
      Changelog:
      v2: add reviewed-by from Hannes.
      v1: initial version for review.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7ba0d830
  5. 09 1月, 2018 3 次提交
    • C
      bcache: fix misleading error message in bch_count_io_errors() · 5138ac67
      Coly Li 提交于
      Bcache only does recoverable I/O for read operations by calling
      cached_dev_read_error(). For write opertions there is no I/O recovery for
      failed requests.
      
      But in bch_count_io_errors() no matter read or write I/Os, before errors
      counter reaches io error limit, pr_err() always prints "IO error on %,
      recoverying". For write requests this information is misleading, because
      there is no I/O recovery at all.
      
      This patch adds a parameter 'is_read' to bch_count_io_errors(), and only
      prints "recovering" by pr_err() when the bio direction is READ.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5138ac67
    • C
      bcache: reduce cache_set devices iteration by devices_max_used · 2831231d
      Coly Li 提交于
      Member devices of struct cache_set is used to reference all attached
      bcache devices to this cache set. If it is treated as array of pointers,
      size of devices[] is indicated by member nr_uuids of struct cache_set.
      
      nr_uuids is calculated in drivers/md/super.c:bch_cache_set_alloc(),
      	bucket_bytes(c) / sizeof(struct uuid_entry)
      Bucket size is determined by user space tool "make-bcache", by default it
      is 1024 sectors (defined in bcache-tools/make-bcache.c:main()). So default
      nr_uuids value is 4096 from the above calculation.
      
      Every time when bcache code iterates bcache devices of a cache set, all
      the 4096 pointers are checked even only 1 bcache device is attached to the
      cache set, that's a wast of time and unncessary.
      
      This patch adds a member devices_max_used to struct cache_set. Its value
      is 1 + the maximum used index of devices[] in a cache set. When iterating
      all valid bcache devices of a cache set, use c->devices_max_used in
      for-loop may reduce a lot of useless checking.
      
      Personally, my motivation of this patch is not for performance, I use it
      in bcache debugging, which helps me to narrow down the scape to check
      valid bcached devices of a cache set.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2831231d
    • T
      bcache: stop writeback thread after detaching · 8d29c442
      Tang Junhui 提交于
      Currently, when a cached device detaching from cache, writeback thread is
      not stopped, and writeback_rate_update work is not canceled. For example,
      after the following command:
      echo 1 >/sys/block/sdb/bcache/detach
      you can still see the writeback thread. Then you attach the device to the
      cache again, bcache will create another writeback thread, for example,
      after below command:
      echo  ba0fb5cd-658a-4533-9806-6ce166d883b9 > /sys/block/sdb/bcache/attach
      then you will see 2 writeback threads.
      This patch stops writeback thread and cancels writeback_rate_update work
      when cached device detaching from cache.
      
      Compare with patch v1, this v2 patch moves code down into the register
      lock for safety in case of any future changes as Coly and Mike suggested.
      
      [edit by mlyle: commit log spelling/formatting]
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8d29c442
  6. 07 1月, 2018 1 次提交
  7. 31 10月, 2017 2 次提交
  8. 16 10月, 2017 2 次提交
    • Y
      bcache: Remove redundant set_capacity · e89d6759
      Yijing Wang 提交于
      set_capacity() has been called in bcache_device_init(),
      remove the redundant one.
      Signed-off-by: NYijing Wang <wangyijing@huawei.com>
      Reviewed-by: NEric Wheeler <bcache@linux.ewheeler.net>
      Acked-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e89d6759
    • C
      bcache: rewrite multiple partitions support · 1dbe32ad
      Coly Li 提交于
      Current partition support of bcache is confusing and buggy. It tries to
      trace non-continuous device minor numbers by an ida bit string, and
      mistakenly mixed bcache device index with minor numbers. This design
      generates several negative results,
      - Index of bcache device name is not consecutive under /dev/. If there are
        3 bcache devices, they name will be,
        /dev/bcache0, /dev/bcache16, /dev/bcache32
        Only bcache code indexes bcache device name is such an interesting way.
      - First minor number of each bcache device is traced by ida bit string.
        One bcache device will occupy 16 bits, this is not a good idea. Indeed
        only one bit is enough.
      - Because minor number and bcache device index are mixed, a device index
        is allocated by ida_simple_get(), but an first minor number is sent into
        ida_simple_remove() to release the device. It confused original author
        too.
      
      Root cause of the above errors is, bcache code should not handle device
      minor numbers at all! A standard process to support multiple partitions in
      Linux kernel is,
      - Device driver provides major device number, and indexes multiple device
        instances.
      - Device driver does not allocat nor trace device minor number, only
        provides a first minor number of a given device instance, and sets how
        many minor numbers (paritions) the device instance may have.
      All rested stuffs are handled by block layer code, most of the details can
      be found from block/{genhd, partition-generic}.c files.
      
      This patch re-writes multiple partitions support for bcache. It makes
      whole things to be more clear, and uses ida bit string in a more efficeint
      way.
      - Ida bit string only traces bcache device index, not minor number. For a
        bcache device with 128 partitions, only one bit in ida bit string is
        enough.
      - Device minor number and device index are separated in concept. Device
        index is used for /dev node naming, and ida bit string trace. Minor
        number is calculated from device index and only used to initialize
        first_minor of a bcache device.
      - It does not follow any standard for 16 partitions on a bcache device.
        This patch sets 128 partitions on single bcache device at max, this is
        the limitation from GPT (GUID Partition Table) and supported by fdisk.
      
      Considering a typical device minor number is 20 bits width, each bcache
      device may have 128 partitions (7 bits), there can be 8192 bcache devices
      existing on system. For most common deployment for a single server in
      now days, it should be enough.
      
      [minor spelling fixes in commit message by Michael Lyle]
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Eric Wheeler <bcache@lists.ewheeler.net>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1dbe32ad
  9. 08 9月, 2017 1 次提交
    • T
      bcache: initialize dirty stripes in flash_dev_run() · 175206cf
      Tang Junhui 提交于
      bcache uses a Proportion-Differentiation Controller algorithm to control
      writeback rate to cached devices. In the PD controller algorithm, dirty
      stripes of thin flash device should not be counted in, because flash only
      volumes never write back dirty data.
      
      Currently dirty stripe counter for thin flash device is not initialized
      when the thin flash device starts. Which means the following calculation
      in PD controller will reference an undefined dirty stripes number, and
      all cached devices attached to the same cache set where the thin flash
      device lies on may have an inaccurate writeback rate.
      
      This patch calles bch_sectors_dirty_init() in flash_dev_run(), to
      correctly initialize dirty stripe counter when the thin flash device
      starts to run. This patch also does following parameter data type change,
       -void bch_sectors_dirty_init(struct cached_dev *dc);
       +void bch_sectors_dirty_init(struct bcache_device *);
      to call this function conveniently in flash_dev_run().
      
      (Commit log is composed by Coly Li)
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      175206cf
  10. 06 9月, 2017 3 次提交
    • D
      bcache: silence static checker warning · da22f0ee
      Dan Carpenter 提交于
      In olden times, closure_return() used to have a hidden return built in.
      We removed the hidden return but forgot to add a new return here.  If
      "c" were NULL we would oops on the next line, but fortunately "c" is
      never NULL.  Let's just remove the if statement.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      da22f0ee
    • T
      bcache: fix for gc and write-back race · 9baf3097
      Tang Junhui 提交于
      gc and write-back get raced (see the email "bcache get stucked" I sended
      before):
      gc thread                               write-back thread
      |                                       |bch_writeback_thread()
      |bch_gc_thread()                        |
      |                                       |==>read_dirty()
      |==>bch_btree_gc()                      |
      |==>btree_root() //get btree root       |
      |                //node write locker    |
      |==>bch_btree_gc_root()                 |
      |                                       |==>read_dirty_submit()
      |                                       |==>write_dirty()
      |                                       |==>continue_at(cl,
      |                                       |               write_dirty_finish,
      |                                       |               system_wq);
      |                                       |==>write_dirty_finish()//excute
      |                                       |               //in system_wq
      |                                       |==>bch_btree_insert()
      |                                       |==>bch_btree_map_leaf_nodes()
      |                                       |==>__bch_btree_map_nodes()
      |                                       |==>btree_root //try to get btree
      |                                       |              //root node read
      |                                       |              //lock
      |                                       |-----stuck here
      |==>bch_btree_set_root()
      |==>bch_journal_meta()
      |==>bch_journal()
      |==>journal_try_write()
      |==>journal_write_unlocked() //journal_full(&c->journal)
      |                            //condition satisfied
      |==>continue_at(cl, journal_write, system_wq); //try to excute
      |                               //journal_write in system_wq
      |                               //but work queue is excuting
      |                               //write_dirty_finish()
      |==>closure_sync(); //wait journal_write execute
      |                   //over and wake up gc,
      |-------------stuck here
      |==>release root node write locker
      
      This patch alloc a separate work-queue for write-back thread to avoid such
      race.
      
      (Commit log re-organized by Coly Li to pass checkpatch.pl checking)
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Acked-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9baf3097
    • J
      bcache: Fix leak of bdev reference · 4b758df2
      Jan Kara 提交于
      If blkdev_get_by_path() in register_bcache() fails, we try to lookup the
      block device using lookup_bdev() to detect which situation we are in to
      properly report error. However we never drop the reference returned to
      us from lookup_bdev(). Fix that.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4b758df2
  11. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  12. 19 6月, 2017 2 次提交
    • N
      blk: make the bioset rescue_workqueue optional. · 47e0fb46
      NeilBrown 提交于
      This patch converts bioset_create() to not create a workqueue by
      default, so alloctions will never trigger punt_bios_to_rescuer().  It
      also introduces a new flag BIOSET_NEED_RESCUER which tells
      bioset_create() to preserve the old behavior.
      
      All callers of bioset_create() that are inside block device drivers,
      are given the BIOSET_NEED_RESCUER flag.
      
      biosets used by filesystems or other top-level users do not
      need rescuing as the bio can never be queued behind other
      bios.  This includes fs_bio_set, blkdev_dio_pool,
      btrfs_bioset, xfs_ioend_bioset, and one allocated by
      target_core_iblock.c.
      
      biosets used by md/raid do not need rescuing as
      their usage was recently audited and revised to never
      risk deadlock.
      
      It is hoped that most, if not all, of the remaining biosets
      can end up being the non-rescued version.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Credit-to: Ming Lei <ming.lei@redhat.com> (minor fixes)
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      47e0fb46
    • N
      blk: replace bioset_create_nobvec() with a flags arg to bioset_create() · 011067b0
      NeilBrown 提交于
      "flags" arguments are often seen as good API design as they allow
      easy extensibility.
      bioset_create_nobvec() is implemented internally as a variation in
      flags passed to __bioset_create().
      
      To support future extension, make the internal structure part of the
      API.
      i.e. add a 'flags' argument to bioset_create() and discard
      bioset_create_nobvec().
      
      Note that the bio_split allocations in drivers/md/raid* do not need
      the bvec mempool - they should have used bioset_create_nobvec().
      Suggested-by: NChristoph Hellwig <hch@infradead.org>
      Reviewed-by: NChristoph Hellwig <hch@infradead.org>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      011067b0
  13. 09 6月, 2017 1 次提交
  14. 09 5月, 2017 1 次提交
  15. 02 2月, 2017 1 次提交
  16. 18 12月, 2016 2 次提交
  17. 22 11月, 2016 1 次提交
  18. 01 11月, 2016 1 次提交
  19. 19 8月, 2016 3 次提交
    • E
      bcache: pr_err: more meaningful error message when nr_stripes is invalid · 90706094
      Eric Wheeler 提交于
      The original error was thought to be corruption, but was actually caused by:
      	make-bcache --data-offset N
      where N was in bytes and should have been in sectors.  While userspace
      tools should be updated to check --data-offset beyond end of volume,
      hopefully this will help others that might not have noticed the units.
      Signed-off-by: NEric Wheeler <bcache@linux.ewheeler.net>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      90706094
    • K
      bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two. · acc9cf8c
      Kent Overstreet 提交于
      This patch fixes a cachedev registration-time allocation deadlock.
      This can deadlock on boot if your initrd auto-registeres bcache devices:
      
      Allocator thread:
      [  720.727614] INFO: task bcache_allocato:3833 blocked for more than 120 seconds.
      [  720.732361]  [<ffffffff816eeac7>] schedule+0x37/0x90
      [  720.732963]  [<ffffffffa05192b8>] bch_bucket_alloc+0x188/0x360 [bcache]
      [  720.733538]  [<ffffffff810e6950>] ? prepare_to_wait_event+0xf0/0xf0
      [  720.734137]  [<ffffffffa05302bd>] bch_prio_write+0x19d/0x340 [bcache]
      [  720.734715]  [<ffffffffa05190bf>] bch_allocator_thread+0x3ff/0x470 [bcache]
      [  720.735311]  [<ffffffff816ee41c>] ? __schedule+0x2dc/0x950
      [  720.735884]  [<ffffffffa0518cc0>] ? invalidate_buckets+0x980/0x980 [bcache]
      
      Registration thread:
      [  720.710403] INFO: task bash:3531 blocked for more than 120 seconds.
      [  720.715226]  [<ffffffff816eeac7>] schedule+0x37/0x90
      [  720.715805]  [<ffffffffa05235cd>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
      [  720.716409]  [<ffffffffa0522d30>] ? bch_btree_insert_check_key+0x1c0/0x1c0 [bcache]
      [  720.717008]  [<ffffffffa05236e4>] bch_btree_insert+0xf4/0x170 [bcache]
      [  720.717586]  [<ffffffff810e6950>] ? prepare_to_wait_event+0xf0/0xf0
      [  720.718191]  [<ffffffffa0527d9a>] bch_journal_replay+0x14a/0x290 [bcache]
      [  720.718766]  [<ffffffff810cc90d>] ? ttwu_do_activate.constprop.94+0x5d/0x70
      [  720.719369]  [<ffffffff810cf684>] ? try_to_wake_up+0x1d4/0x350
      [  720.719968]  [<ffffffffa05317d0>] run_cache_set+0x580/0x8e0 [bcache]
      [  720.720553]  [<ffffffffa053302e>] register_bcache+0xe2e/0x13b0 [bcache]
      [  720.721153]  [<ffffffff81354cef>] kobj_attr_store+0xf/0x20
      [  720.721730]  [<ffffffff812a2dad>] sysfs_kf_write+0x3d/0x50
      [  720.722327]  [<ffffffff812a225a>] kernfs_fop_write+0x12a/0x180
      [  720.722904]  [<ffffffff81225177>] __vfs_write+0x37/0x110
      [  720.723503]  [<ffffffff81228048>] ? __sb_start_write+0x58/0x110
      [  720.724100]  [<ffffffff812cedb3>] ? security_file_permission+0x23/0xa0
      [  720.724675]  [<ffffffff812258a9>] vfs_write+0xa9/0x1b0
      [  720.725275]  [<ffffffff8102479c>] ? do_audit_syscall_entry+0x6c/0x70
      [  720.725849]  [<ffffffff81226755>] SyS_write+0x55/0xd0
      [  720.726451]  [<ffffffff8106a390>] ? do_page_fault+0x30/0x80
      [  720.727045]  [<ffffffff816f2cae>] system_call_fastpath+0x12/0x71
      
      The fifo code in upstream bcache can't use the last element in the buffer,
      which was the cause of the bug: if you asked for a power of two size,
      it'd give you a fifo that could hold one less than what you asked for
      rather than allocating a buffer twice as big.
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      Tested-by: NEric Wheeler <bcache@linux.ewheeler.net>
      Cc: stable@vger.kernel.org
      acc9cf8c
    • E
      bcache: register_bcache(): call blkdev_put() when cache_alloc() fails · d9dc1702
      Eric Wheeler 提交于
      register_cache() is supposed to return an error string on error so that
      register_bcache() will will blkdev_put and cleanup other user counters,
      but it does not set 'char *err' when cache_alloc() fails (eg, due to
      memory pressure) and thus register_bcache() performs no cleanup.
      
      register_bcache() <----------\  <- no jump to err_close, no blkdev_put()
         |                         |
         +->register_cache()       |  <- fails to set char *err
               |                   |
               +->cache_alloc() ---/  <- returns error
      
      This patch sets `char *err` for this failure case so that register_cache()
      will cause register_bcache() to correctly jump to err_close and do
      cleanup.  This was tested under OOM conditions that triggered the bug.
      Signed-off-by: NEric Wheeler <bcache@linux.ewheeler.net>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: stable@vger.kernel.org
      d9dc1702
  20. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  21. 06 7月, 2016 2 次提交
  22. 12 6月, 2016 1 次提交
  23. 08 6月, 2016 2 次提交
  24. 13 4月, 2016 1 次提交
  25. 09 3月, 2016 2 次提交
    • E
      bcache: fix cache_set_flush() NULL pointer dereference on OOM · f8b11260
      Eric Wheeler 提交于
      When bch_cache_set_alloc() fails to kzalloc the cache_set, the
      asyncronous closure handling tries to dereference a cache_set that
      hadn't yet been allocated inside of cache_set_flush() which is called
      by __cache_set_unregister() during cleanup.  This appears to happen only
      during an OOM condition on bcache_register.
      Signed-off-by: NEric Wheeler <bcache@linux.ewheeler.net>
      Cc: stable@vger.kernel.org
      f8b11260
    • E
      bcache: cleaned up error handling around register_cache() · 9b299728
      Eric Wheeler 提交于
      Fix null pointer dereference by changing register_cache() to return an int
      instead of being void.  This allows it to return -ENOMEM or -ENODEV and
      enables upper layers to handle the OOM case without NULL pointer issues.
      
      See this thread:
        http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3521
      
      Fixes this error:
        gargamel:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register
      
        bcache: register_cache() error opening sdh2: cannot allocate memory
        BUG: unable to handle kernel NULL pointer dereference at 00000000000009b8
        IP: [<ffffffffc05a7e8d>] cache_set_flush+0x102/0x15c [bcache]
        PGD 120dff067 PUD 1119a3067 PMD 0
        Oops: 0000 [#1] SMP
        Modules linked in: veth ip6table_filter ip6_tables
        (...)
        CPU: 4 PID: 3371 Comm: kworker/4:3 Not tainted 4.4.2-amd64-i915-volpreempt-20160213bc1 #3
        Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
        Workqueue: events cache_set_flush [bcache]
        task: ffff88020d5dc280 ti: ffff88020b6f8000 task.ti: ffff88020b6f8000
        RIP: 0010:[<ffffffffc05a7e8d>]  [<ffffffffc05a7e8d>] cache_set_flush+0x102/0x15c [bcache]
      Signed-off-by: NEric Wheeler <bcache@linux.ewheeler.net>
      Tested-by: NMarc MERLIN <marc@merlins.org>
      Cc: <stable@vger.kernel.org>
      9b299728