1. 01 11月, 2017 6 次提交
  2. 31 10月, 2017 6 次提交
    • L
      bcache: explicitly destroy mutex while exiting · 330a4db8
      Liang Chen 提交于
      mutex_destroy does nothing most of time, but it's better to call
      it to make the code future proof and it also has some meaning
      for like mutex debug.
      
      As Coly pointed out in a previous review, bcache_exit() may not be
      able to handle all the references properly if userspace registers
      cache and backing devices right before bch_debug_init runs and
      bch_debug_init failes later. So not exposing userspace interface
      until everything is ready to avoid that issue.
      Signed-off-by: NLiang Chen <liangchen.linux@gmail.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NColy Li <colyli@suse.de>
      Reviewed-by: NEric Wheeler <bcache@linux.ewheeler.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      330a4db8
    • T
      bcache: fix wrong cache_misses statistics · c1573137
      tang.junhui 提交于
      Currently, Cache missed IOs are identified by s->cache_miss, but actually,
      there are many situations that missed IOs are not assigned a value for
      s->cache_miss in cached_dev_cache_miss(), for example, a bypassed IO
      (s->iop.bypass = 1), or the cache_bio allocate failed. In these situations,
      it will go to out_put or out_submit, and s->cache_miss is null, which leads
      bch_mark_cache_accounting() to treat this IO as a hit IO.
      
      [ML: applied by 3-way merge]
      Signed-off-by: Ntang.junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c1573137
    • T
      bcache: update bucket_in_use in real time · d44c2f9e
      Tang Junhui 提交于
      bucket_in_use is updated in gc thread which triggered by invalidating or
      writing sectors_to_gc dirty data, It's a long interval. Therefore, when we
      use it to compare with the threshold, it is often not timely, which leads
      to inaccurate judgment and often results in bucket depletion.
      
      We have send a patch before, by the means of updating bucket_in_use
      periodically In gc thread, which Coly thought that would lead high
      latency, In this patch, we add avail_nbuckets to record the count of
      available buckets, and we calculate bucket_in_use when alloc or free
      bucket in real time.
      
      [edited by ML: eliminated some whitespace errors]
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d44c2f9e
    • E
      bcache: convert cached_dev.count from atomic_t to refcount_t · 3b304d24
      Elena Reshetova 提交于
      atomic_t variables are currently used to implement reference
      counters with the following properties:
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable cached_dev.count is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      Suggested-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: NHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3b304d24
    • C
      bcache: only permit to recovery read error when cache device is clean · d59b2379
      Coly Li 提交于
      When bcache does read I/Os, for example in writeback or writethrough mode,
      if a read request on cache device is failed, bcache will try to recovery
      the request by reading from cached device. If the data on cached device is
      not synced with cache device, then requester will get a stale data.
      
      For critical storage system like database, providing stale data from
      recovery may result an application level data corruption, which is
      unacceptible.
      
      With this patch, for a failed read request in writeback or writethrough
      mode, recovery a recoverable read request only happens when cache device
      is clean. That is to say, all data on cached device is up to update.
      
      For other cache modes in bcache, read request will never hit
      cached_dev_read_error(), they don't need this patch.
      
      Please note, because cache mode can be switched arbitrarily in run time, a
      writethrough mode might be switched from a writeback mode. Therefore
      checking dc->has_data in writethrough mode still makes sense.
      
      Changelog:
      V4: Fix parens error pointed by Michael Lyle.
      v3: By response from Kent Oversteet, he thinks recovering stale data is a
          bug to fix, and option to permit it is unnecessary. So this version
          the sysfs file is removed.
      v2: rename sysfs entry from allow_stale_data_on_failure  to
          allow_stale_data_on_failure, and fix the confusing commit log.
      v1: initial patch posted.
      
      [small change to patch comment spelling by mlyle]
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reported-by: NArne Wolf <awolf@lenovo.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Nix <nix@esperi.org.uk>
      Cc: Kai Krakow <hurikhan77@gmail.com>
      Cc: Eric Wheeler <bcache@lists.ewheeler.net>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d59b2379
    • B
      block: Fix a race between blk_cleanup_queue() and timeout handling · 4e9b6f20
      Bart Van Assche 提交于
      Make sure that if the timeout timer fires after a queue has been
      marked "dying" that the affected requests are finished.
      Reported-by: Nchenxiang (M) <chenxiang66@hisilicon.com>
      Fixes: commit 287922eb ("block: defer timeouts to a workqueue")
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Tested-by: Nchenxiang (M) <chenxiang66@hisilicon.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4e9b6f20
  3. 26 10月, 2017 6 次提交
  4. 25 10月, 2017 1 次提交
  5. 24 10月, 2017 1 次提交
  6. 18 10月, 2017 1 次提交
    • O
      kyber: fix hang on domain token wait queue · 8cf46660
      Omar Sandoval 提交于
      When we're getting a domain token, if we fail to get a token on our
      first attempt, we put the current hardware queue on a wait queue and
      then try again just in case a token was freed after our initial attempt
      but before we got on the wait queue. If this second attempt succeeds, we
      currently leave the hardware queue on the wait queue. Usually this is
      okay; we'll just run the hardware queue one extra time when another
      token is freed. However, if the hardware queue doesn't have any other
      requests waiting, then when it it gets the extra wakeup, it won't have
      anything to free and therefore won't wake up any other hardware queues.
      If tokens are limited, then we won't make forward progress and the
      device will hang.
      Reported-by: NBin Zha <zhabin.zb@alibaba-inc.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8cf46660
  7. 17 10月, 2017 3 次提交
  8. 16 10月, 2017 15 次提交
    • M
      bcache: MAINTAINERS: set bcache to MAINTAINED · 52b69ff5
      Michael Lyle 提交于
      Also add URL for IRC channel.
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      52b69ff5
    • K
      bcache: Add Michael Lyle to MAINTAINERS · 77c77a98
      Kent Overstreet 提交于
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      77c77a98
    • L
      bcache: safeguard a dangerous addressing in closure_queue · 6446c684
      Liang Chen 提交于
      The use of the union reduces the size of closure struct by taking advantage
      of the current size of its members. The offset of func in work_struct
      equals the size of the first three members, so that work.work_func will
      just reference the forth member - fn.
      
      This is smart but dangerous. It can be broken if work_struct or the other
      structs get changed, and can be a bit difficult to debug.
      Signed-off-by: NLiang Chen <liangchen.linux@gmail.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6446c684
    • M
      bcache: rearrange writeback main thread ratelimit · a8500fc8
      Michael Lyle 提交于
      The time spent searching for things to write back "counts" for the
      actual rate achieved, so don't flush the accumulated rate with each
      chunk.
      
      This will maintain better fidelity to user-commanded rates, but it
      may slightly increase the burstiness of writeback.  The writeback
      lock needs improvement to help mitigate this.
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a8500fc8
    • M
      bcache: writeback rate shouldn't artifically clamp · e41166c5
      Michael Lyle 提交于
      The previous code artificially limited writeback rate to 1000000
      blocks/second (NSEC_PER_MSEC), which is a rate that can be met on fast
      hardware.  The rate limiting code works fine (though with decreased
      precision) up to 3 orders of magnitude faster, so use NSEC_PER_SEC.
      
      Additionally, ensure that uint32_t is used as a type for rate throughout
      the rate management so that type checking/clamp_t can work properly.
      
      bch_next_delay should be rewritten for increased precision and better
      handling of high rates and long sleep periods, but this is adequate for
      now.
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reported-by: NColy Li <colyli@suse.de>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e41166c5
    • M
      bcache: smooth writeback rate control · ae82ddbf
      Michael Lyle 提交于
      This works in conjunction with the new PI controller.  Currently, in
      real-world workloads, the rate controller attempts to write back 1
      sector per second.  In practice, these minimum-rate writebacks are
      between 4k and 60k in test scenarios, since bcache aggregates and
      attempts to do contiguous writes and because filesystems on top of
      bcachefs typically write 4k or more.
      
      Previously, bcache used to guarantee to write at least once per second.
      This means that the actual writeback rate would exceed the configured
      amount by a factor of 8-120 or more.
      
      This patch adjusts to be willing to sleep up to 2.5 seconds, and to
      target writing 4k/second.  On the smallest writes, it will sleep 1
      second like before, but many times it will sleep longer and load the
      backing device less.  This keeps the loading on the cache and backing
      device related to writeback more consistent when writing back at low
      rates.
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ae82ddbf
    • M
      bcache: implement PI controller for writeback rate · 1d316e65
      Michael Lyle 提交于
      bcache uses a control system to attempt to keep the amount of dirty data
      in cache at a user-configured level, while not responding excessively to
      transients and variations in write rate.  Previously, the system was a
      PD controller; but the output from it was integrated, turning the
      Proportional term into an Integral term, and turning the Derivative term
      into a crude Proportional term.  Performance of the controller has been
      uneven in production, and it has tended to respond slowly, oscillate,
      and overshoot.
      
      This patch set replaces the current control system with an explicit PI
      controller and tuning that should be correct for most hardware.  By
      default, it attempts to write at a rate that would retire 1/40th of the
      current excess blocks per second.  An integral term in turn works to
      remove steady state errors.
      
      IMO, this yields benefits in simplicity (removing weighted average
      filtering, etc) and system performance.
      
      Another small change is a tunable parameter is introduced to allow the
      user to specify a minimum rate at which dirty blocks are retired.
      
      There is a slight difference from earlier versions of the patch in
      integral handling to prevent excessive negative integral windup.
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1d316e65
    • M
      bcache: don't write back data if reading it failed · 5fa89fb9
      Michael Lyle 提交于
      If an IO operation fails, and we didn't successfully read data from the
      cache, don't writeback invalid/partial data to the backing disk.
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5fa89fb9
    • Y
      bcache: remove unused parameter · 23850102
      Yijing Wang 提交于
      Parameter bio is no longer used, clean it.
      Signed-off-by: NYijing Wang <wangyijing@huawei.com>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      23850102
    • E
      bcache: update bio->bi_opf bypass/writeback REQ_ flag hints · b41c9b02
      Eric Wheeler 提交于
      Flag for bypass if the IO is for read-ahead or background, unless the
      read-ahead request is for metadata (eg, from gfs2).
              Bypass if:
                      bio->bi_opf & (REQ_RAHEAD|REQ_BACKGROUND) &&
      			!(bio->bi_opf & REQ_META))
      
              Writeback if:
                      op_is_sync(bio->bi_opf) ||
      			bio->bi_opf & (REQ_META|REQ_PRIO)
      Signed-off-by: NEric Wheeler <bcache@linux.ewheeler.net>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b41c9b02
    • Y
      bcache: Remove redundant set_capacity · e89d6759
      Yijing Wang 提交于
      set_capacity() has been called in bcache_device_init(),
      remove the redundant one.
      Signed-off-by: NYijing Wang <wangyijing@huawei.com>
      Reviewed-by: NEric Wheeler <bcache@linux.ewheeler.net>
      Acked-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e89d6759
    • C
      bcache: rewrite multiple partitions support · 1dbe32ad
      Coly Li 提交于
      Current partition support of bcache is confusing and buggy. It tries to
      trace non-continuous device minor numbers by an ida bit string, and
      mistakenly mixed bcache device index with minor numbers. This design
      generates several negative results,
      - Index of bcache device name is not consecutive under /dev/. If there are
        3 bcache devices, they name will be,
        /dev/bcache0, /dev/bcache16, /dev/bcache32
        Only bcache code indexes bcache device name is such an interesting way.
      - First minor number of each bcache device is traced by ida bit string.
        One bcache device will occupy 16 bits, this is not a good idea. Indeed
        only one bit is enough.
      - Because minor number and bcache device index are mixed, a device index
        is allocated by ida_simple_get(), but an first minor number is sent into
        ida_simple_remove() to release the device. It confused original author
        too.
      
      Root cause of the above errors is, bcache code should not handle device
      minor numbers at all! A standard process to support multiple partitions in
      Linux kernel is,
      - Device driver provides major device number, and indexes multiple device
        instances.
      - Device driver does not allocat nor trace device minor number, only
        provides a first minor number of a given device instance, and sets how
        many minor numbers (paritions) the device instance may have.
      All rested stuffs are handled by block layer code, most of the details can
      be found from block/{genhd, partition-generic}.c files.
      
      This patch re-writes multiple partitions support for bcache. It makes
      whole things to be more clear, and uses ida bit string in a more efficeint
      way.
      - Ida bit string only traces bcache device index, not minor number. For a
        bcache device with 128 partitions, only one bit in ida bit string is
        enough.
      - Device minor number and device index are separated in concept. Device
        index is used for /dev node naming, and ida bit string trace. Minor
        number is calculated from device index and only used to initialize
        first_minor of a bcache device.
      - It does not follow any standard for 16 partitions on a bcache device.
        This patch sets 128 partitions on single bcache device at max, this is
        the limitation from GPT (GUID Partition Table) and supported by fdisk.
      
      Considering a typical device minor number is 20 bits width, each bcache
      device may have 128 partitions (7 bits), there can be 8192 bcache devices
      existing on system. For most common deployment for a single server in
      now days, it should be enough.
      
      [minor spelling fixes in commit message by Michael Lyle]
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Eric Wheeler <bcache@lists.ewheeler.net>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1dbe32ad
    • C
      bcache: fix a comments typo in bch_alloc_sectors() · b1e8139e
      Coly Li 提交于
      Code comments in alloc.c:bch_alloc_sectors() mentions a function
      name find_data_bucket(), the correct function name should be
      pick_data_bucket() indeed. bch_alloc_sectors() is a quite important
      function in bcache allocation code, fixing the typo may help
      other people to have less confusion.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b1e8139e
    • C
      bcache: check ca->alloc_thread initialized before wake up it · 91af8300
      Coly Li 提交于
      In bcache code, sysfs entries are created before all resources get
      allocated, e.g. allocation thread of a cache set.
      
      There is posibility for NULL pointer deference if a resource is accessed
      but which is not initialized yet. Indeed Jorg Bornschein catches one on
      cache set allocation thread and gets a kernel oops.
      
      The reason for this bug is, when bch_bucket_alloc() is called during
      cache set registration and attaching, ca->alloc_thread is not properly
      allocated and initialized yet, call wake_up_process() on ca->alloc_thread
      triggers NULL pointer deference failure. A simple and fast fix is, before
      waking up ca->alloc_thread, checking whether it is allocated, and only
      wake up ca->alloc_thread when it is not NULL.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reported-by: NJorg Bornschein <jb@capsec.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      91af8300
    • P
      bcache: Avoid nested function definition · 58f913dc
      Peter Foley 提交于
      Fixes below error with clang:
      ../drivers/md/bcache/sysfs.c:759:3: error: function definition is not allowed here
                      {       return *((uint16_t *) r) - *((uint16_t *) l); }
                      ^
      ../drivers/md/bcache/sysfs.c:789:32: error: use of undeclared identifier 'cmp'
                      sort(p, n, sizeof(uint16_t), cmp, NULL);
                                                   ^
      2 errors generated.
      
      v2:
      rename function to __bch_cache_cmp
      Signed-off-by: NPeter Foley <pefoley2@pefoley.com>
      Reviewed-by: NColy Li <colyli@suse.de>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      58f913dc
  9. 14 10月, 2017 1 次提交