1. 27 4月, 2017 1 次提交
  2. 22 4月, 2017 1 次提交
    • I
      block: get rid of blk_integrity_revalidate() · 19b7ccf8
      Ilya Dryomov 提交于
      Commit 25520d55 ("block: Inline blk_integrity in struct gendisk")
      introduced blk_integrity_revalidate(), which seems to assume ownership
      of the stable pages flag and unilaterally clears it if no blk_integrity
      profile is registered:
      
          if (bi->profile)
                  disk->queue->backing_dev_info->capabilities |=
                          BDI_CAP_STABLE_WRITES;
          else
                  disk->queue->backing_dev_info->capabilities &=
                          ~BDI_CAP_STABLE_WRITES;
      
      It's called from revalidate_disk() and rescan_partitions(), making it
      impossible to enable stable pages for drivers that support partitions
      and don't use blk_integrity: while the call in revalidate_disk() can be
      trivially worked around (see zram, which doesn't support partitions and
      hence gets away with zram_revalidate_disk()), rescan_partitions() can
      be triggered from userspace at any time.  This breaks rbd, where the
      ceph messenger is responsible for generating/verifying CRCs.
      
      Since blk_integrity_{un,}register() "must" be used for (un)registering
      the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
      setting there.  This way drivers that call blk_integrity_register() and
      use integrity infrastructure won't interfere with drivers that don't
      but still want stable pages.
      
      Fixes: 25520d55 ("block: Inline blk_integrity in struct gendisk")
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 4.4+, needs backporting
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      19b7ccf8
  3. 21 4月, 2017 18 次提交
    • J
      nvmet_fc: Rework target side abort handling · a97ec51b
      James Smart 提交于
      target transport:
      ----------------------
      There are cases when there is a need to abort in-progress target
      operations (writedata) so that controller termination or errors can
      clean up. That can't happen currently as the abort is another target
      op type, so it can't be used till the running one finishes (and it may
      not).  Solve by removing the abort op type and creating a separate
      downcall from the transport to the lldd to request an io to be aborted.
      
      The transport will abort ios on queue teardown or io errors. In general
      the transport tries to call the lldd abort only when the io state is
      idle. Meaning: ops that transmit data (readdata or rsp) will always
      finish their transmit (or the lldd will see a state on the
      link or initiator port that fails the transmit) and the done call for
      the operation will occur. The transport will wait for the op done
      upcall before calling the abort function, and as the io is idle, the
      io can be cleaned up immediately after the abort call; Similarly, ios
      that are not waiting for data or transmitting data must be in the nvmet
      layer being processed. The transport will wait for the nvmet layer
      completion before calling the abort function, and as the io is idle,
      the io can be cleaned up immediately after the abort call; As for ops
      that are waiting for data (writedata), they may be outstanding
      indefinitely if the lldd doesn't see a condition where the initiatior
      port or link is bad. In those cases, the transport will call the abort
      function and wait for the lldd's op done upcall for the operation, where
      it will then clean up the io.
      
      Additionally, if a lldd receives an ABTS and matches it to an outstanding
      request in the transport, A new new transport upcall was created to abort
      the outstanding request in the transport. The transport expects any
      outstanding op call (readdata or writedata) will completed by the lldd and
      the operation upcall made. The transport doesn't act on the reported
      abort (e.g. clean up the io) until an op done upcall occurs, a new op is
      attempted, or the nvmet layer completes the io processing.
      
      fcloop:
      ----------------------
      Updated to support the new target apis.
      On fcp io aborts from the initiator, the loopback context is updated to
      NULL out the half that has completed. The initiator side is immediately
      called after the abort request with an io completion (abort status).
      On fcp io aborts from the target, the io is stopped and the initiator side
      sees it as an aborted io. Target side ops, perhaps in progress while the
      initiator side is done, continue but noop the data movement as there's no
      structure on the initiator side to reference.
      
      patch also contains:
      ----------------------
      Revised lpfc to support the new abort api
      
      commonized rsp buffer syncing and nulling of private data based on
      calling paths.
      
      errors in op done calls don't take action on the fod. They're bad
      operations which implies the fod may be bad.
      Signed-off-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      a97ec51b
    • J
      nvmet_fc: add req_release to lldd api · 19b58d94
      James Smart 提交于
      With the advent of the opdone calls changing context, the lldd can no
      longer assume that once the op->done call returns for RSP operations
      that the request struct is no longer being accessed.
      
      As such, revise the lldd api for a req_release callback that the
      transport will call when the job is complete. This will also be used
      with abort cases.
      
      Fixed text in api header for change in io complete semantics.
      
      Revised lpfc to support the new req_release api.
      Signed-off-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      19b58d94
    • J
      nvmet_fc: add target feature flags for upcall isr contexts · 39498fae
      James Smart 提交于
      Two new feature flags were added to control whether upcalls to the
      transport result in context switches or stay in the calling context.
      
      NVMET_FCTGTFEAT_CMD_IN_ISR:
        By default, if the flag is not set, the transport assumes the
        lldd is in a non-isr context and in the cpu context it should be
        for the io queue. As such, the cmd handler is called directly in the
        calling context.
        If the flag is set, indicating the upcall is an isr context, the
        transport mandates a transition to a workqueue. The workqueue assigned
        to the queue is used for the context.
      NVMET_FCTGTFEAT_OPDONE_IN_ISR
        By default, if the flag is not set, the transport assumes the
        lldd is in a non-isr context and in the cpu context it should be
        for the io queue. As such, the fcp operation done callback is called
        directly in the calling context.
        If the flag is set, indicating the upcall is an isr context, the
        transport mandates a transition to a workqueue. The workqueue assigned
        to the queue is used for the context.
      
      Updated lpfc for flags
      Signed-off-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      39498fae
    • H
      nvme: improve performance for virtual NVMe devices · f9f38e33
      Helen Koike 提交于
      This change provides a mechanism to reduce the number of MMIO doorbell
      writes for the NVMe driver. When running in a virtualized environment
      like QEMU, the cost of an MMIO is quite hefy here. The main idea for
      the patch is provide the device two memory location locations:
       1) to store the doorbell values so they can be lookup without the doorbell
          MMIO write
       2) to store an event index.
      I believe the doorbell value is obvious, the event index not so much.
      Similar to the virtio specification, the virtual device can tell the
      driver (guest OS) not to write MMIO unless you are writing past this
      value.
      
      FYI: doorbell values are written by the nvme driver (guest OS) and the
      event index is written by the virtual device (host OS).
      
      The patch implements a new admin command that will communicate where
      these two memory locations reside. If the command fails, the nvme
      driver will work as before without any optimizations.
      
      Contributions:
        Eric Northup <digitaleric@google.com>
        Frank Swiderski <fes@google.com>
        Ted Tso <tytso@mit.edu>
        Keith Busch <keith.busch@intel.com>
      
      Just to give an idea on the performance boost with the vendor
      extension: Running fio [1], a stock NVMe driver I get about 200K read
      IOPs with my vendor patch I get about 1000K read IOPs. This was
      running with a null device i.e. the backing device simply returned
      success on every read IO request.
      
      [1] Running on a 4 core machine:
        fio --time_based --name=benchmark --runtime=30
        --filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32
        --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4
        --rw=randread --blocksize=4k --randrepeat=false
      Signed-off-by: NRob Nelson <rlnelson@google.com>
      [mlin: port for upstream]
      Signed-off-by: NMing Lin <mlin@kernel.org>
      [koike: updated for upstream]
      Signed-off-by: NHelen Koike <helen.koike@collabora.co.uk>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      f9f38e33
    • S
      blk-mq: Fix poll_stat for new size-based bucketing. · 0206319f
      Stephen Bates 提交于
      Fixes an issue where the size of the poll_stat array in request_queue
      does not match the size expected by the new size based bucketing for
      IO completion polling.
      
      Fixes: 720b8ccc ("blk-mq: Add a polling specific stats function")
      Signed-off-by: NStephen Bates <sbates@raithlin.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0206319f
    • C
    • C
      block: add a error_count field to struct request · e26738e0
      Christoph Hellwig 提交于
      This is for the legacy floppy and ataflop drivers that currently abuse
      ->errors for this purpose.  It's stashed away in a union to not grow
      the struct size, the other fields are either used by modern drivers
      for different purposes or the I/O scheduler before queing the I/O
      to drivers.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e26738e0
    • C
      blk-mq: remove the error argument to blk_mq_complete_request · 08e0029a
      Christoph Hellwig 提交于
      Now that all drivers that call blk_mq_complete_requests have a
      ->complete callback we can remove the direct call to blk_mq_end_request,
      as well as the error argument to blk_mq_complete_request.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      08e0029a
    • C
      scsi: introduce a result field in struct scsi_request · 17d5363b
      Christoph Hellwig 提交于
      This passes on the scsi_cmnd result field to users of passthrough
      requests.  Currently we abuse req->errors for this purpose, but that
      field will go away in its current form.
      
      Note that the old IDE code abuses the errors field in very creative
      ways and stores all kinds of different values in it.  I didn't dare
      to touch this magic, so the abuses are brought forward 1:1.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      17d5363b
    • C
      block: remove the blk_execute_rq return value · b7819b92
      Christoph Hellwig 提交于
      The function only returns -EIO if rq->errors is non-zero, which is not
      very useful and lets a large number of callers ignore the return value.
      
      Just let the callers figure out their error themselves.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b7819b92
    • J
      bdi: Drop 'parent' argument from bdi_register[_va]() · 7c4cc300
      Jan Kara 提交于
      Drop 'parent' argument of bdi_register() and bdi_register_va().  It is
      always NULL.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7c4cc300
    • J
      block: Remove unused functions · 2e82b84c
      Jan Kara 提交于
      Now that all backing_dev_info structure are allocated separately, we can
      drop some unused functions.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2e82b84c
    • J
      fs: Remove SB_I_DYNBDI flag · c1844d53
      Jan Kara 提交于
      Now that all bdi structures filesystems use are properly refcounted, we
      can remove the SB_I_DYNBDI flag.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c1844d53
    • J
      nfs: Convert to separately allocated bdi · 0db10944
      Jan Kara 提交于
      Allocate struct backing_dev_info separately instead of embedding it
      inside the superblock. This unifies handling of bdi among users.
      
      CC: Anna Schumaker <anna.schumaker@netapp.com>
      CC: linux-nfs@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0db10944
    • J
      coda: Convert to separately allocated bdi · a5695a79
      Jan Kara 提交于
      Allocate struct backing_dev_info separately instead of embedding it
      inside the superblock. This unifies handling of bdi among users.
      
      CC: Jan Harkes <jaharkes@cs.cmu.edu>
      CC: coda@cs.cmu.edu
      CC: codalist@coda.cs.cmu.edu
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a5695a79
    • J
      mtd: Convert to dynamically allocated bdi infrastructure · fa06052d
      Jan Kara 提交于
      MTD already allocates backing_dev_info dynamically. Convert it to use
      generic infrastructure for this including proper refcounting. We drop
      mtd->backing_dev_info as its only use was to pass mtd_bdi pointer from
      one file into another and if we wanted to keep that in a clean way, we'd
      have to make mtd hold and drop bdi reference as needed which seems
      pointless for passing one global pointer...
      
      CC: David Woodhouse <dwmw2@infradead.org>
      CC: Brian Norris <computersforpeace@gmail.com>
      CC: linux-mtd@lists.infradead.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fa06052d
    • J
      fs: Provide infrastructure for dynamic BDIs in filesystems · fca39346
      Jan Kara 提交于
      Provide helper functions for setting up dynamically allocated
      backing_dev_info structures for filesystems and cleaning them up on
      superblock destruction.
      
      CC: linux-mtd@lists.infradead.org
      CC: linux-nfs@vger.kernel.org
      CC: Petr Vandrovec <petr@vandrovec.name>
      CC: linux-nilfs@vger.kernel.org
      CC: cluster-devel@redhat.com
      CC: osd-dev@open-osd.org
      CC: codalist@coda.cs.cmu.edu
      CC: linux-afs@lists.infradead.org
      CC: ecryptfs@vger.kernel.org
      CC: linux-cifs@vger.kernel.org
      CC: ceph-devel@vger.kernel.org
      CC: linux-btrfs@vger.kernel.org
      CC: v9fs-developer@lists.sourceforge.net
      CC: lustre-devel@lists.lustre.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fca39346
    • J
      bdi: Provide bdi_register_va() and bdi_alloc() · baf7a616
      Jan Kara 提交于
      Add function that registers bdi and takes va_list instead of variable
      number of arguments.
      
      Add bdi_alloc() as simple wrapper for NUMA-unaware users allocating BDI.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      baf7a616
  4. 20 4月, 2017 4 次提交
  5. 19 4月, 2017 1 次提交
    • A
      block, bfq: add full hierarchical scheduling and cgroups support · e21b7a0b
      Arianna Avanzini 提交于
      Add complete support for full hierarchical scheduling, with a cgroups
      interface. Full hierarchical scheduling is implemented through the
      'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues
      associated with processes, and groups are represented in general by
      entities. Given the bfq_queues associated with the processes belonging
      to a given group, the entities representing these queues are sons of
      the entity representing the group. At higher levels, if a group, say
      G, contains other groups, then the entity representing G is the parent
      entity of the entities representing the groups in G.
      
      Hierarchical scheduling is performed as follows: if the timestamps of
      a leaf entity (i.e., of a bfq_queue) change, and such a change lets
      the entity become the next-to-serve entity for its parent entity, then
      the timestamps of the parent entity are recomputed as a function of
      the budget of its new next-to-serve leaf entity. If the parent entity
      belongs, in its turn, to a group, and its new timestamps let it become
      the next-to-serve for its parent entity, then the timestamps of the
      latter parent entity are recomputed as well, and so on. When a new
      bfq_queue must be set in service, the reverse path is followed: the
      next-to-serve highest-level entity is chosen, then its next-to-serve
      child entity, and so on, until the next-to-serve leaf entity is
      reached, and the bfq_queue that this entity represents is set in
      service.
      
      Writeback is accounted for on a per-group basis, i.e., for each group,
      the async I/O requests of the processes of the group are enqueued in a
      distinct bfq_queue, and the entity associated with this queue is a
      child of the entity associated with the group.
      
      Weights can be assigned explicitly to groups and processes through the
      cgroups interface, differently from what happens, for single
      processes, if the cgroups interface is not used (as explained in the
      description of the previous patch). In particular, since each node has
      a full scheduler, each group can be assigned its own weight.
      Signed-off-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e21b7a0b
  6. 17 4月, 2017 3 次提交
  7. 15 4月, 2017 4 次提交
  8. 09 4月, 2017 6 次提交
  9. 08 4月, 2017 2 次提交