1. 26 8月, 2017 1 次提交
  2. 18 8月, 2017 1 次提交
  3. 15 8月, 2017 1 次提交
  4. 10 8月, 2017 3 次提交
    • B
      block: Make blk_mq_delay_kick_requeue_list() rerun the queue at a quiet time · d4acf365
      Bart Van Assche 提交于
      The blk_mq_delay_kick_requeue_list() function is used by the device
      mapper and only by the device mapper to rerun the queue and requeue
      list after a delay. This function is called once per request that
      gets requeued. Modify this function such that the queue is run once
      per path change event instead of once per request that is requeued.
      
      Fixes: commit 2849450a ("blk-mq: introduce blk_mq_delay_kick_requeue_list()")
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Laurence Oberman <loberman@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d4acf365
    • C
      bio-integrity: only verify integrity on the lowest stacked driver · f86e28c4
      Christoph Hellwig 提交于
      This gets us back to the behavior in 4.12 and earlier.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Fixes: 7c20f116 ("bio-integrity: stop abusing bi_end_io")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f86e28c4
    • M
      bio-integrity: Fix regression if profile verify_fn is NULL · c775d209
      Milan Broz 提交于
      In dm-integrity target we register integrity profile that have
      both generate_fn and verify_fn callbacks set to NULL.
      
      This is used if dm-integrity is stacked under a dm-crypt device
      for authenticated encryption (integrity payload contains authentication
      tag and IV seed).
      
      In this case the verification is done through own crypto API
      processing inside dm-crypt; integrity profile is only holder
      of these data. (And memory is owned by dm-crypt as well.)
      
      After the commit (and previous changes)
        Commit 7c20f116
        Author: Christoph Hellwig <hch@lst.de>
        Date:   Mon Jul 3 16:58:43 2017 -0600
      
          bio-integrity: stop abusing bi_end_io
      
      we get this crash:
      
      : BUG: unable to handle kernel NULL pointer dereference at   (null)
      : IP:   (null)
      : *pde = 00000000
      ...
      :
      : Workqueue: kintegrityd bio_integrity_verify_fn
      : task: f48ae180 task.stack: f4b5c000
      : EIP:   (null)
      : EFLAGS: 00210286 CPU: 0
      : EAX: f4b5debc EBX: 00001000 ECX: 00000001 EDX: 00000000
      : ESI: 00001000 EDI: ed25f000 EBP: f4b5dee8 ESP: f4b5dea4
      :  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
      : CR0: 80050033 CR2: 00000000 CR3: 32823000 CR4: 001406d0
      : Call Trace:
      :  ? bio_integrity_process+0xe3/0x1e0
      :  bio_integrity_verify_fn+0xea/0x150
      :  process_one_work+0x1c7/0x5c0
      :  worker_thread+0x39/0x380
      :  kthread+0xd6/0x110
      :  ? process_one_work+0x5c0/0x5c0
      :  ? kthread_worker_fn+0x100/0x100
      :  ? kthread_worker_fn+0x100/0x100
      :  ret_from_fork+0x19/0x24
      : Code:  Bad EIP value.
      : EIP:   (null) SS:ESP: 0068:f4b5dea4
      : CR2: 0000000000000000
      
      Patch just skip the whole verify workqueue if verify_fn is set to NULL.
      
      Fixes: 7c20f116 ("bio-integrity: stop abusing bi_end_io")
      Signed-off-by: NMilan Broz <gmazyland@gmail.com>
      [hch: trivial whitespace fix]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c775d209
  5. 02 8月, 2017 1 次提交
  6. 30 7月, 2017 2 次提交
    • P
      block, bfq: consider also in_service_entity to state whether an entity is active · 46d556e6
      Paolo Valente 提交于
      Groups of BFQ queues are represented by generic entities in BFQ. When
      a queue belonging to a parent entity is deactivated, the parent entity
      may need to be deactivated too, in case the deactivated queue was the
      only active queue for the parent entity. This deactivation may need to
      be propagated upwards if the entity belongs, in its turn, to a further
      higher-level entity, and so on. In particular, the upward propagation
      of deactivation stops at the first parent entity that remains active
      even if one of its child entities has been deactivated.
      
      To decide whether the last non-deactivation condition holds for a
      parent entity, BFQ checks whether the field next_in_service is still
      not NULL for the parent entity, after the deactivation of one of its
      child entity. If it is not NULL, then there are certainly other active
      entities in the parent entity, and deactivations can stop.
      
      Unfortunately, this check misses a corner case: if in_service_entity
      is not NULL, then next_in_service may happen to be NULL, although the
      parent entity is evidently active. This happens if: 1) the entity
      pointed by in_service_entity is the only active entity in the parent
      entity, and 2) according to the definition of next_in_service, the
      in_service_entity cannot be considered as next_in_service. See the
      comments on the definition of next_in_service for details on this
      second point.
      
      Hitting the above corner case causes crashes.
      
      To address this issue, this commit:
      1) Extends the above check on only next_in_service to controlling both
      next_in_service and in_service_entity (if any of them is not NULL,
      then no further deactivation is performed)
      2) Improves the (important) comments on how next_in_service is defined
      and updated; in particular it fixes a few rather obscure paragraphs
      Reported-by: NEric Wheeler <bfq-sched@lists.ewheeler.net>
      Reported-by: NRick Yiu <rick_yiu@htc.com>
      Reported-by: NTom X Nguyen <tom81094@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Tested-by: NEric Wheeler <bfq-sched@lists.ewheeler.net>
      Tested-by: NRick Yiu <rick_yiu@htc.com>
      Tested-by: NLaurentiu Nicola <lnicola@dend.ro>
      Tested-by: NTom X Nguyen <tom81094@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      46d556e6
    • P
      block, bfq: reset in_service_entity if it becomes idle · 6ab1d8da
      Paolo Valente 提交于
      BFQ implements hierarchical scheduling by representing each group of
      queues with a generic parent entity. For each parent entity, BFQ
      maintains an in_service_entity pointer: if one of the child entities
      happens to be in service, in_service_entity points to it.  The
      resetting of these pointers happens only on queue expirations: when
      the in-service queue is expired, i.e., stops to be the queue in
      service, BFQ resets all in_service_entity pointers along the
      parent-entity path from this queue to the root entity.
      
      Functions handling the scheduling of entities assume, naturally, that
      in-service entities are active, i.e., have pending I/O requests (or,
      as a special case, even if they have no pending requests, they are
      expected to receive a new request very soon, with the scheduler idling
      the storage device while waiting for such an event). Unfortunately,
      the above resetting scheme of the in_service_entity pointers may cause
      this assumption to be violated.  For example, the in-service queue may
      happen to remain without requests because of a request merge. In this
      case the queue does become idle, and all related data structures are
      updated accordingly. But in_service_entity still points to the queue
      in the parent entity. This inconsistency may even propagate to
      higher-level parent entities, if they happen to become idle as well,
      as a consequence of the leaf queue becoming idle. For this queue and
      parent entities, scheduling functions have an undefined behaviour,
      and, as reported, may easily lead to kernel crashes or hangs.
      
      This commit addresses this issue by simply resetting the
      in_service_entity field also when it is detected to point to an entity
      becoming idle (regardless of why the entity becomes idle).
      Reported-by: NLaurentiu Nicola <lnicola@dend.ro>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Tested-by: NLaurentiu Nicola <lnicola@dend.ro>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6ab1d8da
  7. 25 7月, 2017 1 次提交
  8. 24 7月, 2017 1 次提交
  9. 12 7月, 2017 2 次提交
    • H
      bfq: dispatch request to prevent queue stalling after the request completion · 3f7cb4f4
      Hou Tao 提交于
      There are mq devices (eg., virtio-blk, nbd and loopback) which don't
      invoke blk_mq_run_hw_queues() after the completion of a request.
      If bfq is enabled on these devices and the slice_idle attribute or
      strict_guarantees attribute is set as zero, it is possible that
      after a request completion the remaining requests of busy bfq queue
      will stalled in the bfq schedule until a new request arrives.
      
      To fix the scheduler latency problem, we need to check whether or not
      all issued requests have completed and dispatch more requests to driver
      if there is no request in driver.
      
      The problem can be reproduced by running the following script
      on a virtio-blk device with nr_hw_queues as 1:
      
      #!/bin/sh
      
      dev=vdb
      # mount point for dev
      mp=/tmp/mnt
      cd $mp
      
      job=strict.job
      cat <<EOF > $job
      [global]
      direct=1
      bs=4k
      size=256M
      rw=write
      ioengine=libaio
      iodepth=128
      runtime=5
      time_based
      
      [1]
      filename=1.data
      
      [2]
      new_group
      filename=2.data
      EOF
      
      echo bfq > /sys/block/$dev/queue/scheduler
      echo 1 > /sys/block/$dev/queue/iosched/strict_guarantees
      fio $job
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Reviewed-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3f7cb4f4
    • H
      bfq: fix typos in comments about B-WF2Q+ algorithm · 38c91407
      Hou Tao 提交于
      The start time of eligible entity should be less than or equal to
      the current virtual time, and the entity in idle tree has a finish
      time being greater than the current virtual time.
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Reviewed-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      38c91407
  10. 11 7月, 2017 1 次提交
    • S
      block: call bio_uninit in bio_endio · b222dd2f
      Shaohua Li 提交于
      bio_free isn't a good place to free cgroup info. There are a
      lot of cases bio is allocated in special way (for example, in stack) and
      never gets called by bio_put hence bio_free, we are leaking memory. This
      patch moves the free to bio endio, which should be called anyway. The
      bio_uninit call in bio_free is kept, in case the bio never gets called
      bio endio.
      
      This assumes ->bi_end_io() doesn't access cgroup info, which seems true
      in my audit.
      
      This along with Christoph's integrity patch should fix the memory leak
      issue.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b222dd2f
  11. 06 7月, 2017 1 次提交
    • D
      block: Fix __blkdev_issue_zeroout loop · 615d22a5
      Damien Le Moal 提交于
      The BIO issuing loop in __blkdev_issue_zeroout() is allocating BIOs
      with a maximum number of bvec (pages) equal to
      
      min(nr_sects, (sector_t)BIO_MAX_PAGES)
      
      This works since the requested number of bvecs will always be limited
      to the absolute maximum number supported (BIO_MAX_PAGES), but this is
      ineficient as too many bvec entries may be requested due to the
      different units being used in the min() operation (number of sectors vs
      number of pages).
      To fix this, introduce the helper __blkdev_sectors_to_bio_pages() to
      correctly calculate the number of bvecs for zeroout BIOs as the issuing
      loop progresses. The calculation is done using consistent units and
      makes sure that the number of pages return is at least 1 (for cases
      where the number of sectors is less that the number of sectors in
      a page).
      
      Also remove a trailing space after the bit shift in the internal loop
      min() call.
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      615d22a5
  12. 05 7月, 2017 1 次提交
  13. 04 7月, 2017 9 次提交
  14. 30 6月, 2017 2 次提交
  15. 29 6月, 2017 4 次提交
    • M
      blk-mq: map all HWQ also in hyperthreaded system · fe631457
      Max Gurtovoy 提交于
      This patch performs sequential mapping between CPUs and queues.
      In case the system has more CPUs than HWQs then there are still
      CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs
      and their siblings to the same HWQ.
      This actually fixes a bug that found unmapped HWQs in a system with
      2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs)
      running NVMEoF (opens upto maximum of 64 HWQs).
      
      Performance results running fio (72 jobs, 128 iodepth)
      using null_blk (w/w.o patch):
      
      bs      IOPS(read submit_queues=72)   IOPS(write submit_queues=72)   IOPS(read submit_queues=24)  IOPS(write submit_queues=24)
      -----  ----------------------------  ------------------------------ ---------------------------- -----------------------------
      512    4890.4K/4723.5K                 4524.7K/4324.2K                   4280.2K/4264.3K               3902.4K/3909.5K
      1k     4910.1K/4715.2K                 4535.8K/4309.6K                   4296.7K/4269.1K               3906.8K/3914.9K
      2k     4906.3K/4739.7K                 4526.7K/4330.6K                   4301.1K/4262.4K               3890.8K/3900.1K
      4k     4918.6K/4730.7K                 4556.1K/4343.6K                   4297.6K/4264.5K               3886.9K/3893.9K
      8k     4906.4K/4748.9K                 4550.9K/4346.7K                   4283.2K/4268.8K               3863.4K/3858.2K
      16k    4903.8K/4782.6K                 4501.5K/4233.9K                   4292.3K/4282.3K               3773.1K/3773.5K
      32k    4885.8K/4782.4K                 4365.9K/4184.2K                   4307.5K/4289.4K               3780.3K/3687.3K
      64k    4822.5K/4762.7K                 2752.8K/2675.1K                   4308.8K/4312.3K               2651.5K/2655.7K
      128k   2388.5K/2313.8K                 1391.9K/1375.7K                   2142.8K/2152.2K               1395.5K/1374.2K
      Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fe631457
    • J
      block: provide bio_uninit() free freeing integrity/task associations · 9ae3b3f5
      Jens Axboe 提交于
      Wen reports significant memory leaks with DIF and O_DIRECT:
      
      "With nvme devive + T10 enabled, On a system it has 256GB and started
      logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
      it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
      leaking.
      
      /proc/meminfo | grep SUnreclaim...
      
      SUnreclaim:      6752128 kB
      SUnreclaim:      6874880 kB
      SUnreclaim:      7238080 kB
      ....
      SUnreclaim:     22307264 kB
      SUnreclaim:     22485888 kB
      SUnreclaim:     22720256 kB
      
      When testcases with T10 enabled call into __blkdev_direct_IO_simple,
      code doesn't free memory allocated by bio_integrity_alloc. The patch
      fixes the issue. HTX has been run with +60 hours without failure."
      
      Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
      doesn't go through the regular bio free. This means that any ancillary
      data allocated with the bio through the stack is not freed. Hence, we
      can leak the integrity data associated with the bio, if the device is
      using DIF/DIX.
      
      Fix this by providing a bio_uninit() and export it, so that we can use
      it to free this data. Note that this is a minimal fix for this issue.
      Any current user of bio's that are allocated outside of
      bio_alloc_bioset() suffers from this issue, most notably some drivers.
      We will fix those in a more comprehensive patch for 4.13. This also
      means that the commit marked as being fixed by this isn't the real
      culprit, it's just the most obvious one out there.
      
      Fixes: 542ff7bf ("block: new direct I/O implementation")
      Reported-by: NWen Xiong <wenxiong@linux.vnet.ibm.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9ae3b3f5
    • C
      blk-mq: Create hctx for each present CPU · 4b855ad3
      Christoph Hellwig 提交于
      Currently we only create hctx for online CPUs, which can lead to a lot
      of churn due to frequent soft offline / online operations.  Instead
      allocate one for each present CPU to avoid this and dramatically simplify
      the code.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: linux-block@vger.kernel.org
      Cc: linux-nvme@lists.infradead.org
      Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      4b855ad3
    • C
      blk-mq: Include all present CPUs in the default queue mapping · 5f042e7c
      Christoph Hellwig 提交于
      This way we get a nice distribution independent of the current cpu
      online / offline state.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: linux-block@vger.kernel.org
      Cc: linux-nvme@lists.infradead.org
      Link: http://lkml.kernel.org/r/20170626102058.10200-2-hch@lst.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      5f042e7c
  16. 28 6月, 2017 9 次提交