1. 31 1月, 2018 1 次提交
    • M
      blk-mq: introduce BLK_STS_DEV_RESOURCE · 86ff7c2a
      Ming Lei 提交于
      This status is returned from driver to block layer if device related
      resource is unavailable, but driver can guarantee that IO dispatch
      will be triggered in future when the resource is available.
      
      Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
      returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
      a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
      3 ms because both scsi-mq and nvmefc are using that magic value.
      
      If a driver can make sure there is in-flight IO, it is safe to return
      BLK_STS_DEV_RESOURCE because:
      
      1) If all in-flight IOs complete before examining SCHED_RESTART in
      blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
      is run immediately in this case by blk_mq_dispatch_rq_list();
      
      2) if there is any in-flight IO after/when examining SCHED_RESTART
      in blk_mq_dispatch_rq_list():
      - if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
      - otherwise, this request will be dispatched after any in-flight IO is
        completed via blk_mq_sched_restart()
      
      3) if SCHED_RESTART is set concurently in context because of
      BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
      cases and make sure IO hang can be avoided.
      
      One invariant is that queue will be rerun if SCHED_RESTART is set.
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Tested-by: NLaurence Oberman <loberman@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      86ff7c2a
  2. 11 1月, 2018 2 次提交
    • A
      null_blk: remove explicit 'select FAULT_INJECTION' · 33f782c4
      Arnd Bergmann 提交于
      Selecting FAULT_INJECTION causes a Kconfig warning when CONFIG_DEBUG_KERNEL
      is not set:
      
      warning: (BLK_DEV_NULL_BLK && DRM_I915_SELFTEST) selects FAULT_INJECTION which has unmet direct dependencies (DEBUG_KERNEL)
      
      The other drivers that use FAULT_INJECTION tend to have a separate
      Kconfig symbol for turning on that feature, so let's do the same
      thing here. This may add a bit more complexity than we like, but
      it avoids the warning and is more consistent with the rest of the
      kernel.
      
      Fixes: 93b57046 ("null_blk: add option for managing IO timeouts")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      33f782c4
    • J
      null_blk: add option for managing IO timeouts · 93b57046
      Jens Axboe 提交于
      Use the fault injection framework to provide a way for null_blk
      to configure timeouts. This only works for queue_mode 1 and 2,
      since the bio mode doesn't have code for tracking timeouts.
      
      Let's say you want to have a 10% chance of timing out every
      100,000 requests, and for 5 total timeouts, you could do:
      
      modprobe null_blk timeout="100000,10,0,5"
      
      This is useful for adding blktests to test that IO timeouts
      are handled appropriately.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      93b57046
  3. 10 1月, 2018 1 次提交
    • J
      null_blk: wire up timeouts · 5448aca4
      Jens Axboe 提交于
      This is needed to ensure that we actually handle timeouts.
      Without it, the queue_mode=1 path will never call blk_add_timer(),
      and the queue_mode=2 path will continually just return
      EH_RESET_TIMER and we never actually complete the offending request.
      
      This was used to test the new timeout code, and the changes around
      killing off REQ_ATOM_COMPLETE.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5448aca4
  4. 05 1月, 2018 1 次提交
  5. 21 12月, 2017 1 次提交
    • J
      null_blk: unalign call_single_data · 0864fe09
      Jens Axboe 提交于
      Commit 966a9671 randomly added alignment to this structure, but
      it's actually detrimental to performance of null_blk. Test case:
      
      Running on both the home and remote node shows a ~5% degradation
      in performance.
      
      While in there, move blk_status_t to the hole after the integer tag
      in the nullb_cmd structure. After this patch, we shrink the size
      from 192 to 152 bytes.
      
      Fixes: 966a9671 ("smp: Avoid using two cache lines for struct call_single_data")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0864fe09
  6. 22 11月, 2017 1 次提交
    • D
      null_blk: fix dev->badblocks leak · 1addb798
      David Disseldorp 提交于
      null_alloc_dev() allocates memory for dev->badblocks, but cleanup
      currently only occurs in the configfs release codepath, missing a number
      of other places.
      
      This bug was found running the blktests block/010 test, alongside
      kmemleak:
      rapido1:/blktests# ./check block/010
      ...
      rapido1:/blktests# echo scan > /sys/kernel/debug/kmemleak
      [  306.966708] kmemleak: 32 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
      rapido1:/blktests# cat /sys/kernel/debug/kmemleak
      unreferenced object 0xffff88001f86d000 (size 4096):
        comm "modprobe", pid 231, jiffies 4294892415 (age 318.252s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff814b0379>] kmemleak_alloc+0x49/0xa0
          [<ffffffff810f180f>] kmem_cache_alloc+0x9f/0xe0
          [<ffffffff8124e45f>] badblocks_init+0x2f/0x60
          [<ffffffffa0019fae>] 0xffffffffa0019fae
          [<ffffffffa0021273>] nullb_device_badblocks_store+0x63/0x130 [null_blk]
          [<ffffffff810004cd>] do_one_initcall+0x3d/0x170
          [<ffffffff8109fe0d>] do_init_module+0x56/0x1e9
          [<ffffffff8109ebd7>] load_module+0x1c47/0x26a0
          [<ffffffff8109f819>] SyS_finit_module+0xa9/0xd0
          [<ffffffff814b4f60>] entry_SYSCALL_64_fastpath+0x13/0x94
      
      Fixes: 2f54a613 ("nullb: badbblocks support")
      Reviewed-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NDavid Disseldorp <ddiss@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1addb798
  7. 19 10月, 2017 1 次提交
  8. 17 10月, 2017 1 次提交
  9. 30 9月, 2017 1 次提交
  10. 29 8月, 2017 3 次提交
    • Y
      smp: Avoid using two cache lines for struct call_single_data · 966a9671
      Ying Huang 提交于
      struct call_single_data is used in IPIs to transfer information between
      CPUs.  Its size is bigger than sizeof(unsigned long) and less than
      cache line size.  Currently it is not allocated with any explicit alignment
      requirements.  This makes it possible for allocated call_single_data to
      cross two cache lines, which results in double the number of the cache lines
      that need to be transferred among CPUs.
      
      This can be fixed by requiring call_single_data to be aligned with the
      size of call_single_data. Currently the size of call_single_data is the
      power of 2.  If we add new fields to call_single_data, we may need to
      add padding to make sure the size of new definition is the power of 2
      as well.
      
      Fortunately, this is enforced by GCC, which will report bad sizes.
      
      To set alignment requirements of call_single_data to the size of
      call_single_data, a struct definition and a typedef is used.
      
      To test the effect of the patch, I used the vm-scalability multiple
      thread swap test case (swap-w-seq-mt).  The test will create multiple
      threads and each thread will eat memory until all RAM and part of swap
      is used, so that huge number of IPIs are triggered when unmapping
      memory.  In the test, the throughput of memory writing improves ~5%
      compared with misaligned call_single_data, because of faster IPIs.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NHuang, Ying <ying.huang@intel.com>
      [ Add call_single_data_t and align with size of call_single_data. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/87bmnqd6lz.fsf@yhuang-mobile.sh.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      966a9671
    • J
      null_blk: use available 'dev' in nullb_device_power_store() · b3c30512
      Jens Axboe 提交于
      We already have this pointer, no need to use to_nullb_device()
      again.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b3c30512
    • S
      block/nullb: delete unnecessary memory free · 060fd198
      Shaohua Li 提交于
      Commit 2984c868(nullb: factor disk parameters) has a typo. The
      nullb_device allocation/free is done outside of null_add_dev. The commit
      accidentally frees the nullb_device in error code path.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      060fd198
  11. 26 8月, 2017 2 次提交
  12. 23 8月, 2017 9 次提交
    • S
      nullb: badbblocks support · 2f54a613
      Shaohua Li 提交于
      Sometime disk could have tracks broken and data there is inaccessable,
      but data in other parts can be accessed in normal way. MD RAID supports
      such disks. But we don't have a good way to test it, because we can't
      control which part of a physical disk is bad. For a virtual disk, this
      can be easily controlled.
      
      This patch adds a new 'badblock' attribute. Configure it in this way:
      echo "+1-100" > xxx/badblock, this will make sector [1-100] as bad
      blocks.
      echo "-20-30" > xxx/badblock, this will make sector [20-30] good
      
      If badblocks are accessed, the nullb disk will return IO error. Other
      parts of the disk can accessed in normal way.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2f54a613
    • S
      nullb: emulate cache · deb78b41
      Shaohua Li 提交于
      Software must flush disk cache to guarantee data safety. To check if
      software correctly does disk cache flush, we must know the behavior of
      disk. But physical disk behavior is uncontrollable. Even software
      doesn't do the flush, the disk probably does the flush. This patch tries
      to emulate a cache in the test disk.
      
      All write will go to a cache first, when the cache is full, we then
      flush some data to disk storage. A flush request will flush all data of
      the cache to disk storage. A FUA write will write to memory store
      directly and revalidate data in cache. If there is a power failure (by
      writing to power attribute, 'echo 0 > disk_name/power'), we discard all
      data in the cache, but preserve the data in disk storage. Later we can
      power on the disk again as usual (write 1 to 'power' attribute), then we
      can check data integrity and very if software does everything correctly.
      
      A new attribute 'cache_size' (in MB) is added to configure cache size.
      
      Based on original patch from Kyungchan Koh
      Signed-off-by: NKyungchan Koh <kkc6196@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      deb78b41
    • S
      nullb: bandwidth control · eff2c4f1
      Shaohua Li 提交于
      In test, we usually expect controllable disk speed. For example, in a
      raid array, we'd like some disks are fast and some are slow. MD RAID
      actually has a feature for this. To test the feature, we'd like to make
      the disk run in specific speed.
      
      block throttling probably can be used for this purpose, but it requires
      cgroup setup. Here we just implement a simple throttling mechanism in
      the driver. There is slight fluctuation in the mechanism, but it's good
      enough for test.
      
      To configure the bandwidth cap, user sets the 'mbps' attribute. mbps is
      MB/s.
      
      Based on original patch from Kyungchan Koh
      Signed-off-by: NKyungchan Koh <kkc6196@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eff2c4f1
    • S
      nullb: support discard · 306eb6b4
      Shaohua Li 提交于
      discard makes sense for memory backed disk. And also it's useful to test
      if upper layer supports dicard correctly.
      
      User configures 'discard' attribute to enable/disable dicard support.
      
      Based on original patch from Kyungchan Koh
      Signed-off-by: NKyungchan Koh <kkc6196@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      306eb6b4
    • S
      nullb: support memory backed store · 5bcd0e0c
      Shaohua Li 提交于
      This adds memory backed store in nullb.
      
      User configure 'memory_backed' attribute for this. By default, nullb
      disk doesn't use memory backed store.
      
      Based on original patch from Kyungchan Koh
      Signed-off-by: NKyungchan Koh <kkc6196@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5bcd0e0c
    • S
      nullb: use ida to manage index · 94bc02e3
      Shaohua Li 提交于
      We now dynamically create disks. Managing the disk index with ida to
      avoid bump up the index too much.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      94bc02e3
    • S
      nullb: add interface to power on disk · cedcafad
      Shaohua Li 提交于
      The device created in nullb configfs interface isn't power on by
      default. After user configures the device, user can do 'echo 1 >
      xxx/nullb/device_name/power' to power on the device, which will create a
      disk. the xxx/nullb/device_name/index is the disk index, so if the index
      is 2, the new created disk should be named as /dev/nullb2. Note, the
      'index' is only valid after disk is power on.
      
      'echo 0 > xxx/nullb/device_name/power' will remove the disk. Note, this
      doesn't remove the device. To remove the device, user should do 'rmdir
      xxx/nullb/device_name'. Removing the device will remove the disk too.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cedcafad
    • S
      nullb: add configfs interface · 3bf2bd20
      Shaohua Li 提交于
      Add configfs interface for nullb. configfs interface is more flexible
      and easy to configure in a per-disk basis.
      
      Configuration is something like this:
      mount -t configfs none /mnt
      
      Checking which features the driver supports:
      cat /mnt/nullb/features
      
      The 'features' attribute is for future extension. We probably will add
      new features into the driver, userspace can check this attribute to find
      the supported features.
      
      Create/remove a device:
      mkdir/rmdir /mnt/nullb/a
      
      Then configure the device by setting attributes under /mnt/nullb/a, most
      of nullb supported module parameters are converted to attributes:
      size; /* device size in MB */
      completion_nsec; /* time in ns to complete a request */
      submit_queues; /* number of submission queues */
      home_node; /* home node for the device */
      queue_mode; /* block interface */
      blocksize; /* block size */
      irqmode; /* IRQ completion handler */
      hw_queue_depth; /* queue depth */
      use_lightnvm; /* register as a LightNVM device */
      blocking; /* blocking blk-mq device */
      use_per_node_hctx; /* use per-node allocation for hardware context */
      
      Note, creating a device doesn't create a disk immediately. Creating a
      disk is done in two phases: create a device and then power on the
      device. Next patch will introduce device power on.
      
      Based on original patch from Kyungchan Koh
      Signed-off-by: NKyungchan Koh <kkc6196@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3bf2bd20
    • S
      nullb: factor disk parameters · 2984c868
      Shaohua Li 提交于
      When we switch to configfs interface, each disk could have different
      configuration. To prepare for the change, we move most disk setting to a
      separate data structure. The existing module parameter interface is
      kept. The 'nr_devices' and 'shared_tags' don't make sense for per-disk
      setting, so they are remained as global settings.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2984c868
  13. 08 8月, 2017 2 次提交
  14. 06 7月, 2017 1 次提交
  15. 21 6月, 2017 1 次提交
    • J
      null_blk: add support for shared tags · 82f402fe
      Jens Axboe 提交于
      Some storage drivers need to share tag sets between devices. It's
      useful to be able to model that with null_blk, to find hangs or
      performance issues.
      
      Add a 'shared_tags' bool module parameter that. If that is set to
      true and nr_devices is bigger than 1, all devices allocated will
      share the same tag set.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      82f402fe
  16. 09 6月, 2017 2 次提交
    • C
      blk-mq: switch ->queue_rq return value to blk_status_t · fc17b653
      Christoph Hellwig 提交于
      Use the same values for use for request completion errors as the return
      value from ->queue_rq.  BLK_STS_RESOURCE is special cased to cause
      a requeue, and all the others are completed as-is.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fc17b653
    • C
      block: introduce new block status code type · 2a842aca
      Christoph Hellwig 提交于
      Currently we use nornal Linux errno values in the block layer, and while
      we accept any error a few have overloaded magic meanings.  This patch
      instead introduces a new  blk_status_t value that holds block layer specific
      status codes and explicitly explains their meaning.  Helpers to convert from
      and to the previous special meanings are provided for now, but I suspect
      we want to get rid of them in the long run - those drivers that have a
      errno input (e.g. networking) usually get errnos that don't know about
      the special block layer overloads, and similarly returning them to userspace
      will usually return somethings that strictly speaking isn't correct
      for file system operations, but that's left as an exercise for later.
      
      For now the set of errors is a very limited set that closely corresponds
      to the previous overloaded errno values, but there is some low hanging
      fruite to improve it.
      
      blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
      typechecking, so that we can easily catch places passing the wrong values.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2a842aca
  17. 21 4月, 2017 2 次提交
  18. 20 4月, 2017 1 次提交
  19. 31 3月, 2017 2 次提交
  20. 01 2月, 2017 1 次提交
    • C
      block: fold cmd_type into the REQ_OP_ space · aebf526b
      Christoph Hellwig 提交于
      Instead of keeping two levels of indirection for requests types, fold it
      all into the operations.  The little caveat here is that previously
      cmd_type only applied to struct request, while the request and bio op
      fields were set to plain REQ_OP_READ/WRITE even for passthrough
      operations.
      
      Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
      private requests, althought it has to add two for each so that we
      can communicate the data in/out nature of the request.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      aebf526b
  21. 31 1月, 2017 2 次提交
  22. 26 12月, 2016 1 次提交
    • T
      ktime: Cleanup ktime_set() usage · 8b0e1953
      Thomas Gleixner 提交于
      ktime_set(S,N) was required for the timespec storage type and is still
      useful for situations where a Seconds and Nanoseconds part of a time value
      needs to be converted. For anything where the Seconds argument is 0, this
      is pointless and can be replaced with a simple assignment.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8b0e1953
  23. 16 11月, 2016 1 次提交