1. 01 5月, 2019 1 次提交
  2. 12 4月, 2019 1 次提交
    • J
      block: fix the return errno for direct IO · a89afe58
      Jason Yan 提交于
      If the last bio returned is not dio->bio, the status of the bio will
      not assigned to dio->bio if it is error. This will cause the whole IO
      status wrong.
      
          ksoftirqd/21-117   [021] ..s.  4017.966090:   8,0    C   N 4883648 [0]
                <idle>-0     [018] ..s.  4017.970888:   8,0    C  WS 4924800 + 1024 [0]
                <idle>-0     [018] ..s.  4017.970909:   8,0    D  WS 4935424 + 1024 [<idle>]
                <idle>-0     [018] ..s.  4017.970924:   8,0    D  WS 4936448 + 321 [<idle>]
          ksoftirqd/21-117   [021] ..s.  4017.995033:   8,0    C   R 4883648 + 336 [65475]
          ksoftirqd/21-117   [021] d.s.  4018.001988: myprobe1: (blkdev_bio_end_io+0x0/0x168) bi_status=7
          ksoftirqd/21-117   [021] d.s.  4018.001992: myprobe: (aio_complete_rw+0x0/0x148) x0=0xffff802f2595ad80 res=0x12a000 res2=0x0
      
      We always have to assign bio->bi_status to dio->bio.bi_status because we
      will only check dio->bio.bi_status when we return the whole IO to
      the upper layer.
      
      Fixes: 542ff7bf ("block: new direct I/O implementation")
      Cc: stable@vger.kernel.org
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a89afe58
  3. 19 3月, 2019 1 次提交
    • J
      block: add BIO_NO_PAGE_REF flag · 399254aa
      Jens Axboe 提交于
      If bio_iov_iter_get_pages() is called on an iov_iter that is flagged
      with NO_REF, then we don't need to add a page reference for the pages
      that we add.
      
      Add BIO_NO_PAGE_REF to track this in the bio, so IO completion knows
      not to drop a reference to these pages.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      399254aa
  4. 24 2月, 2019 2 次提交
  5. 15 2月, 2019 1 次提交
  6. 15 1月, 2019 1 次提交
    • J
      blockdev: Fix livelocks on loop device · 04906b2f
      Jan Kara 提交于
      bd_set_size() updates also block device's block size. This is somewhat
      unexpected from its name and at this point, only blkdev_open() uses this
      functionality. Furthermore, this can result in changing block size under
      a filesystem mounted on a loop device which leads to livelocks inside
      __getblk_gfp() like:
      
      Sending NMI from CPU 0 to CPUs 1:
      NMI backtrace for cpu 1
      CPU: 1 PID: 10863 Comm: syz-executor0 Not tainted 4.18.0-rc5+ #151
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
      01/01/2011
      RIP: 0010:__sanitizer_cov_trace_pc+0x3f/0x50 kernel/kcov.c:106
      ...
      Call Trace:
       init_page_buffers+0x3e2/0x530 fs/buffer.c:904
       grow_dev_page fs/buffer.c:947 [inline]
       grow_buffers fs/buffer.c:1009 [inline]
       __getblk_slow fs/buffer.c:1036 [inline]
       __getblk_gfp+0x906/0xb10 fs/buffer.c:1313
       __bread_gfp+0x2d/0x310 fs/buffer.c:1347
       sb_bread include/linux/buffer_head.h:307 [inline]
       fat12_ent_bread+0x14e/0x3d0 fs/fat/fatent.c:75
       fat_ent_read_block fs/fat/fatent.c:441 [inline]
       fat_alloc_clusters+0x8ce/0x16e0 fs/fat/fatent.c:489
       fat_add_cluster+0x7a/0x150 fs/fat/inode.c:101
       __fat_get_block fs/fat/inode.c:148 [inline]
      ...
      
      Trivial reproducer for the problem looks like:
      
      truncate -s 1G /tmp/image
      losetup /dev/loop0 /tmp/image
      mkfs.ext4 -b 1024 /dev/loop0
      mount -t ext4 /dev/loop0 /mnt
      losetup -c /dev/loop0
      l /mnt
      
      Fix the problem by moving initialization of a block device block size
      into a separate function and call it when needed.
      
      Thanks to Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> for help with
      debugging the problem.
      
      Reported-by: syzbot+9933e4476f365f5d5a1b@syzkaller.appspotmail.com
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      04906b2f
  7. 03 1月, 2019 1 次提交
    • L
      block: don't use un-ordered __set_current_state(TASK_UNINTERRUPTIBLE) · 1ac5cd49
      Linus Torvalds 提交于
      This mostly reverts commit 849a3700 ("block: avoid ordered task
      state change for polled IO").  It was wrongly claiming that the ordering
      wasn't necessary.  The memory barrier _is_ necessary.
      
      If something is truly polling and not going to sleep, it's the whole
      state setting that is unnecessary, not the memory barrier.  Whenever you
      set your state to a sleeping state, you absolutely need the memory
      barrier.
      
      Note that sometimes the memory barrier can be elsewhere.  For example,
      the ordering might be provided by an external lock, or by setting the
      process state to sleeping before adding yourself to the wait queue list
      that is used for waking up (where the wait queue lock itself will
      guarantee that any wakeup will correctly see the sleeping state).
      
      But none of those cases were true here.
      
      NOTE! Some of the polling paths may indeed be able to drop the state
      setting entirely, at which point the memory barrier also goes away.
      
      (Also note that this doesn't revert the TASK_RUNNING cases: there is no
      race between a wakeup and setting the process state to TASK_RUNNING,
      since the end result doesn't depend on ordering).
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ac5cd49
  8. 29 12月, 2018 1 次提交
  9. 30 11月, 2018 1 次提交
    • C
      block: avoid extra bio reference for async O_DIRECT · 531724ab
      Christoph Hellwig 提交于
      The bio referencing has a trick that doesn't do any actual atomic
      inc/dec on the reference count until we have to elevator to > 1. For the
      async IO O_DIRECT case, we can't use the simple DIO variants, so we use
      __blkdev_direct_IO(). It always grabs an extra reference to the bio
      after allocation, which means we then enter the slower path of actually
      having to do atomic_inc/dec on the count.
      
      We don't need to do that for the async case, unless we end up going
      multi-bio, in which case we're already doing huge amounts of IO. For the
      smaller IO case (< BIO_MAX_PAGES), we can do without the extra ref.
      
      Based on an earlier patch (and commit log) from Jens Axboe.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      531724ab
  10. 26 11月, 2018 1 次提交
    • J
      block: make blk_poll() take a parameter on whether to spin or not · 0a1b8b87
      Jens Axboe 提交于
      blk_poll() has always kept spinning until it found an IO. This is
      fine for SYNC polling, since we need to find one request we have
      pending, but in preparation for ASYNC polling it can be beneficial
      to just check if we have any entries available or not.
      
      Existing callers are converted to pass in 'spin == true', to retain
      the old behavior.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0a1b8b87
  11. 19 11月, 2018 1 次提交
  12. 16 11月, 2018 3 次提交
  13. 08 11月, 2018 1 次提交
  14. 24 10月, 2018 1 次提交
  15. 27 7月, 2018 1 次提交
  16. 18 7月, 2018 1 次提交
    • T
      block: make bdev_ops->rw_page() take a REQ_OP instead of bool · 3f289dcb
      Tejun Heo 提交于
      c11f0c0b ("block/mm: make bdev_ops->rw_page() take a bool for
      read/write") replaced @OP with boolean @is_write, which limited the
      amount of information going into ->rw_page() and more importantly
      page_endio(), which removed the need to expose block internals to mm.
      
      Unfortunately, we want to track discards separately and @is_write
      isn't enough information.  This patch updates bdev_ops->rw_page() to
      take REQ_OP instead but leaves page_endio() to take bool @is_write.
      This allows the block part of operations to have enough information
      while not leaking it to mm.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3f289dcb
  17. 13 6月, 2018 1 次提交
    • K
      treewide: kmalloc() -> kmalloc_array() · 6da2ec56
      Kees Cook 提交于
      The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
      patch replaces cases of:
      
              kmalloc(a * b, gfp)
      
      with:
              kmalloc_array(a * b, gfp)
      
      as well as handling cases of:
      
              kmalloc(a * b * c, gfp)
      
      with:
      
              kmalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kmalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kmalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The tools/ directory was manually excluded, since it has its own
      implementation of kmalloc().
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kmalloc
      + kmalloc_array
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(sizeof(THING) * C2, ...)
      |
        kmalloc(sizeof(TYPE) * C2, ...)
      |
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(C1 * C2, ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6da2ec56
  18. 31 5月, 2018 2 次提交
  19. 29 5月, 2018 2 次提交
  20. 06 4月, 2018 1 次提交
  21. 31 3月, 2018 1 次提交
  22. 27 2月, 2018 3 次提交
    • J
      blockdev: Avoid two active bdev inodes for one device · 560e7cb2
      Jan Kara 提交于
      When blkdev_open() races with device removal and creation it can happen
      that unhashed bdev inode gets associated with newly created gendisk
      like:
      
      CPU0					CPU1
      blkdev_open()
        bdev = bd_acquire()
      					del_gendisk()
      					  bdev_unhash_inode(bdev);
      					remove device
      					create new device with the same number
        __blkdev_get()
          disk = get_gendisk()
            - gets reference to gendisk of the new device
      
      Now another blkdev_open() will not find original 'bdev' as it got
      unhashed, create a new one and associate it with the same 'disk' at
      which point problems start as we have two independent page caches for
      one device.
      
      Fix the problem by verifying that the bdev inode didn't get unhashed
      before we acquired gendisk reference. That way we make sure gendisk can
      get associated only with visible bdev inodes.
      Tested-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      560e7cb2
    • J
      genhd: Fix use after free in __blkdev_get() · 89736653
      Jan Kara 提交于
      When two blkdev_open() calls race with device removal and recreation,
      __blkdev_get() can use looked up gendisk after it is freed:
      
      CPU0				CPU1			CPU2
      							del_gendisk(disk);
      							  bdev_unhash_inode(inode);
      blkdev_open()			blkdev_open()
        bdev = bd_acquire(inode);
          - creates and returns new inode
      				  bdev = bd_acquire(inode);
      				    - returns the same inode
        __blkdev_get(devt)		  __blkdev_get(devt)
          disk = get_gendisk(devt);
            - got structure of device going away
      							<finish device removal>
      							<new device gets
      							 created under the same
      							 device number>
      				  disk = get_gendisk(devt);
      				    - got new device structure
      				  if (!bdev->bd_openers) {
      				    does the first open
      				  }
          if (!bdev->bd_openers)
            - false
          } else {
            put_disk_and_module(disk)
              - remember this was old device - this was last ref and disk is
                now freed
          }
          disk_unblock_events(disk); -> oops
      
      Fix the problem by making sure we drop reference to disk in
      __blkdev_get() only after we are really done with it.
      Reported-by: NHou Tao <houtao1@huawei.com>
      Tested-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      89736653
    • J
      genhd: Add helper put_disk_and_module() · 9df6c299
      Jan Kara 提交于
      Add a proper counterpart to get_disk_and_module() -
      put_disk_and_module(). Currently it is opencoded in several places.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9df6c299
  23. 11 11月, 2017 1 次提交
    • B
      block, scsi: Make SCSI quiesce and resume work reliably · 3a0a5299
      Bart Van Assche 提交于
      The contexts from which a SCSI device can be quiesced or resumed are:
      * Writing into /sys/class/scsi_device/*/device/state.
      * SCSI parallel (SPI) domain validation.
      * The SCSI device power management methods. See also scsi_bus_pm_ops.
      
      It is essential during suspend and resume that neither the filesystem
      state nor the filesystem metadata in RAM changes. This is why while
      the hibernation image is being written or restored that SCSI devices
      are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
      and scsi_device_resume(). In the SDEV_QUIESCE state execution of
      non-preempt requests is deferred. This is realized by returning
      BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
      devices. Avoid that a full queue prevents power management requests
      to be submitted by deferring allocation of non-preempt requests for
      devices in the quiesced state. This patch has been tested by running
      the following commands and by verifying that after each resume the
      fio job was still running:
      
      for ((i=0; i<10; i++)); do
        (
          cd /sys/block/md0/md &&
          while true; do
            [ "$(<sync_action)" = "idle" ] && echo check > sync_action
            sleep 1
          done
        ) &
        pids=($!)
        for d in /sys/class/block/sd*[a-z]; do
          bdev=${d#/sys/class/block/}
          hcil=$(readlink "$d/device")
          hcil=${hcil#../../../}
          echo 4 > "$d/queue/nr_requests"
          echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
          fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 \
            --rw=randread --ioengine=libaio --numjobs=4 --iodepth=16       \
            --iodepth_batch=1 --thread --loops=$((2**31)) &
          pids+=($!)
        done
        sleep 1
        echo "$(date) Hibernating ..." >>hibernate-test-log.txt
        systemctl hibernate
        sleep 10
        kill "${pids[@]}"
        echo idle > /sys/block/md0/md/sync_action
        wait
        echo "$(date) Done." >>hibernate-test-log.txt
      done
      Reported-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      References: "I/O hangs after resuming from suspend-to-ram" (https://marc.info/?l=linux-block&m=150340235201348).
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Tested-by: NMartin Steigerwald <martin@lichtvoll.de>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3a0a5299
  24. 04 11月, 2017 1 次提交
  25. 14 10月, 2017 1 次提交
  26. 13 10月, 2017 1 次提交
  27. 05 9月, 2017 1 次提交
  28. 24 8月, 2017 2 次提交
  29. 06 7月, 2017 2 次提交
    • J
      block: convert to errseq_t based writeback error tracking · 372cf243
      Jeff Layton 提交于
      This is a very minimal conversion to errseq_t based error tracking
      for raw block device access. Just have it use the standard
      file_write_and_wait_range call.
      
      Note that there are internal callers that call sync_blockdev
      and the like that are not affected by this. They'll continue
      to use the AS_EIO/AS_ENOSPC flags for error reporting like
      they always have for now.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      372cf243
    • J
      fs: new infrastructure for writeback error handling and reporting · 5660e13d
      Jeff Layton 提交于
      Most filesystems currently use mapping_set_error and
      filemap_check_errors for setting and reporting/clearing writeback errors
      at the mapping level. filemap_check_errors is indirectly called from
      most of the filemap_fdatawait_* functions and from
      filemap_write_and_wait*. These functions are called from all sorts of
      contexts to wait on writeback to finish -- e.g. mostly in fsync, but
      also in truncate calls, getattr, etc.
      
      The non-fsync callers are problematic. We should be reporting writeback
      errors during fsync, but many places spread over the tree clear out
      errors before they can be properly reported, or report errors at
      nonsensical times.
      
      If I get -EIO on a stat() call, there is no reason for me to assume that
      it is because some previous writeback failed. The fact that it also
      clears out the error such that a subsequent fsync returns 0 is a bug,
      and a nasty one since that's potentially silent data corruption.
      
      This patch adds a small bit of new infrastructure for setting and
      reporting errors during address_space writeback. While the above was my
      original impetus for adding this, I think it's also the case that
      current fsync semantics are just problematic for userland. Most
      applications that call fsync do so to ensure that the data they wrote
      has hit the backing store.
      
      In the case where there are multiple writers to the file at the same
      time, this is really hard to determine. The first one to call fsync will
      see any stored error, and the rest get back 0. The processes with open
      fds may not be associated with one another in any way. They could even
      be in different containers, so ensuring coordination between all fsync
      callers is not really an option.
      
      One way to remedy this would be to track what file descriptor was used
      to dirty the file, but that's rather cumbersome and would likely be
      slow. However, there is a simpler way to improve the semantics here
      without incurring too much overhead.
      
      This set adds an errseq_t to struct address_space, and a corresponding
      one is added to struct file. Writeback errors are recorded in the
      mapping's errseq_t, and the one in struct file is used as the "since"
      value.
      
      This changes the semantics of the Linux fsync implementation such that
      applications can now use it to determine whether there were any
      writeback errors since fsync(fd) was last called (or since the file was
      opened in the case of fsync having never been called).
      
      Note that those writeback errors may have occurred when writing data
      that was dirtied via an entirely different fd, but that's the case now
      with the current mapping_set_error/filemap_check_error infrastructure.
      This will at least prevent you from getting a false report of success.
      
      The new behavior is still consistent with the POSIX spec, and is more
      reliable for application developers. This patch just adds some basic
      infrastructure for doing this, and ensures that the f_wb_err "cursor"
      is properly set when a file is opened. Later patches will change the
      existing code to use this new infrastructure for reporting errors at
      fsync time.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      5660e13d
  30. 29 6月, 2017 1 次提交
    • J
      block: provide bio_uninit() free freeing integrity/task associations · 9ae3b3f5
      Jens Axboe 提交于
      Wen reports significant memory leaks with DIF and O_DIRECT:
      
      "With nvme devive + T10 enabled, On a system it has 256GB and started
      logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
      it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
      leaking.
      
      /proc/meminfo | grep SUnreclaim...
      
      SUnreclaim:      6752128 kB
      SUnreclaim:      6874880 kB
      SUnreclaim:      7238080 kB
      ....
      SUnreclaim:     22307264 kB
      SUnreclaim:     22485888 kB
      SUnreclaim:     22720256 kB
      
      When testcases with T10 enabled call into __blkdev_direct_IO_simple,
      code doesn't free memory allocated by bio_integrity_alloc. The patch
      fixes the issue. HTX has been run with +60 hours without failure."
      
      Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
      doesn't go through the regular bio free. This means that any ancillary
      data allocated with the bio through the stack is not freed. Hence, we
      can leak the integrity data associated with the bio, if the device is
      using DIF/DIX.
      
      Fix this by providing a bio_uninit() and export it, so that we can use
      it to free this data. Note that this is a minimal fix for this issue.
      Any current user of bio's that are allocated outside of
      bio_alloc_bioset() suffers from this issue, most notably some drivers.
      We will fix those in a more comprehensive patch for 4.13. This also
      means that the commit marked as being fixed by this isn't the real
      culprit, it's just the most obvious one out there.
      
      Fixes: 542ff7bf ("block: new direct I/O implementation")
      Reported-by: NWen Xiong <wenxiong@linux.vnet.ibm.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9ae3b3f5
  31. 28 6月, 2017 1 次提交