1. 23 3月, 2015 1 次提交
  2. 24 2月, 2015 1 次提交
  3. 20 2月, 2015 6 次提交
    • K
      NVMe: Fix potential corruption on sync commands · 0c0f9b95
      Keith Busch 提交于
      This makes all sync commands uninterruptible and schedules without timeout
      so the controller either has to post a completion or the timeout recovery
      fails the command. This fixes potential memory or data corruption from
      a command timing out too early or woken by a signal. Previously any DMA
      buffers mapped for that command would have been released even though we
      don't know what the controller is planning to do with those addresses.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      0c0f9b95
    • K
      NVMe: Remove unused variables · 48328518
      Keith Busch 提交于
      We don't track queues in a llist, subscribe to hot-cpu notifications,
      or internally retry commands. Delete the unused artifacts.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      48328518
    • K
      NVMe: Fix potential corruption during shutdown · 07836e65
      Keith Busch 提交于
      The driver has to end unreturned commands at some point even if the
      controller has not provided a completion. The driver tried to be safe by
      deleting IO queues prior to ending all unreturned commands. That should
      cause the controller to internally abort inflight commands, but IO queue
      deletion request does not have to be successful, so all bets are off. We
      still have to make progress, so to be extra safe, this patch doesn't
      clear a queue to release the dma mapping for a command until after the
      pci device has been disabled.
      
      This patch removes the special handling during device initialization
      so controller recovery can be done all the time. This is possible since
      initialization is not inlined with pci probe anymore.
      Reported-by: NNilish Choudhury <nilesh.choudhury@oracle.com>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      07836e65
    • K
      NVMe: Asynchronous controller probe · 2e1d8448
      Keith Busch 提交于
      This performs the longest parts of nvme device probe in scheduled work.
      This speeds up probe significantly when multiple devices are in use.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      2e1d8448
    • K
      NVMe: Register management handle under nvme class · b3fffdef
      Keith Busch 提交于
      This creates a new class type for nvme devices to register their
      management character devices with. This is so we do not rely on miscdev
      to provide enough minors for as many nvme devices some people plan to
      use. The previous limit was approximately 60 NVMe controllers, depending
      on the platform and kernel. Now the limit is 1M, which ought to be enough
      for anybody.
      
      Since we have a new device class, it makes sense to attach the block
      devices under this as well, so part of this patch moves the management
      handle initialization prior to the namespaces discovery.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      b3fffdef
    • K
      NVMe: Metadata format support · e1e5e564
      Keith Busch 提交于
      Adds support for NVMe metadata formats and exposes block devices for
      all namespaces regardless of their format. Namespace formats that are
      unusable will have disk capacity set to 0, but a handle to the block
      device is created to simplify device management. A namespace is not
      usable when the format requires host interleave block and metadata in
      single buffer, has no provisioned storage, or has better data but failed
      to register with blk integrity.
      
      The namespace has to be scanned in two phases to support separate
      metadata formats. The first establishes the sector size and capacity
      prior to invoking add_disk. If metadata is required, the capacity will
      be temporarilly set to 0 until it can be revalidated and registered with
      the integrity extenstions after add_disk completes.
      
      The driver relies on the integrity extensions to provide the metadata
      buffer. NVMe requires this be a single physically contiguous region,
      so only one integrity segment is allowed per command. If the metadata
      is used for T10 PI, the driver provides mappings to save and restore
      the reftag physical block translation. The driver provides no-op
      functions for generate and verify if metadata is not used for protection
      information. This way the setup is always provided by the block layer.
      
      If a request does not supply a required metadata buffer, the command
      is failed with bad address. This could only happen if a user manually
      disables verify/generate on such a disk. The only exception to where
      this is okay is if the controller is capable of stripping/generating
      the metadata, which is possible on some types of formats.
      
      The metadata scatter gather list now occupies the spot in the nvme_iod
      that used to be used to link retryable IOD's, but we don't do that
      anymore, so the field was unused.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      e1e5e564
  4. 30 1月, 2015 1 次提交
    • J
      NVMe: avoid kmalloc/kfree for smaller IO · ac3dd5bd
      Jens Axboe 提交于
      Currently we allocate an nvme_iod for each IO, which holds the
      sg list, prps, and other IO related info. Set a threshold of
      2 pages and/or 8KB of data, below which we can just embed this
      in the per-command pdu in blk-mq. For any IO at or below
      NVME_INT_PAGES and NVME_INT_BYTES, we save a kmalloc and kfree.
      
      For higher IOPS, this saves up to 1% of CPU time.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      ac3dd5bd
  5. 22 1月, 2015 1 次提交
  6. 16 1月, 2015 1 次提交
  7. 09 1月, 2015 6 次提交
  8. 03 1月, 2015 1 次提交
  9. 23 12月, 2014 1 次提交
  10. 12 12月, 2014 2 次提交
    • J
      NVMe: fix race condition in nvme_submit_sync_cmd() · 849c6e77
      Jens Axboe 提交于
      If we have a race between the schedule timing out and the command
      completing, we could have the task issuing the command exit
      nvme_submit_sync_cmd() while the irq is running sync_completion().
      If that happens, we could be corrupting memory, since the stack
      that held 'cmdinfo' is no longer valid.
      
      Fix this by always calling nvme_abort_cmd_info(). Once that call
      completes, we know that we have either run sync_completion() if
      the completion came in, or that we will never run it since we now
      have special_completion() as the command callback handler.
      Acked-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      849c6e77
    • J
      NVMe: fix retry/error logic in nvme_queue_rq() · fe54303e
      Jens Axboe 提交于
      The logic around retrying and erroring IO in nvme_queue_rq() is broken
      in a few ways:
      
      - If we fail allocating dma memory for a discard, we return retry. We
        have the 'iod' stored in ->special, but we free the 'iod'.
      
      - For a normal request, if we fail dma mapping of setting up prps, we
        have the same iod situation. Additionally, we haven't set the callback
        for the request yet, so we also potentially leak IOMMU resources.
      
      Get rid of the ->special 'iod' store. The retry is uncommon enough that
      it's not worth optimizing for or holding on to resources to attempt to
      speed it up. Additionally, it's usually best practice to free any
      request related resources when doing retries.
      Acked-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fe54303e
  11. 11 12月, 2014 3 次提交
  12. 04 12月, 2014 1 次提交
  13. 22 11月, 2014 1 次提交
  14. 21 11月, 2014 2 次提交
  15. 20 11月, 2014 1 次提交
  16. 18 11月, 2014 3 次提交
  17. 11 11月, 2014 1 次提交
  18. 06 11月, 2014 1 次提交
  19. 05 11月, 2014 6 次提交
    • M
      NVMe: Convert to blk-mq · a4aea562
      Matias Bjørling 提交于
      This converts the NVMe driver to a blk-mq request-based driver.
      
      The NVMe driver is currently bio-based and implements queue logic within
      itself.  By using blk-mq, a lot of these responsibilities can be moved
      and simplified.
      
      The patch is divided into the following blocks:
      
       * Per-command data and cmdid have been moved into the struct request
         field. The cmdid_data can be retrieved using blk_mq_rq_to_pdu() and id
         maintenance are now handled by blk-mq through the rq->tag field.
      
       * The logic for splitting bio's has been moved into the blk-mq layer.
         The driver instead notifies the block layer about limited gap support in
         SG lists.
      
       * blk-mq handles timeouts and is reimplemented within nvme_timeout().
         This both includes abort handling and command cancelation.
      
       * Assignment of nvme queues to CPUs are replaced with the blk-mq
         version. The current blk-mq strategy is to assign the number of
         mapped queues and CPUs to provide synergy, while the nvme driver
         assign as many nvme hw queues as possible. This can be implemented in
         blk-mq if needed.
      
       * NVMe queues are merged with the tags structure of blk-mq.
      
       * blk-mq takes care of setup/teardown of nvme queues and guards invalid
         accesses. Therefore, RCU-usage for nvme queues can be removed.
      
       * IO tracing and accounting are handled by blk-mq and therefore removed.
      
       * Queue suspension logic is replaced with the logic from the block
         layer.
      
      Contributions in this patch from:
      
        Sam Bradshaw <sbradshaw@micron.com>
        Jens Axboe <axboe@fb.com>
        Keith Busch <keith.busch@intel.com>
        Robert Nelson <rlnelson@google.com>
      Acked-by: NKeith Busch <keith.busch@intel.com>
      Acked-by: NJens Axboe <axboe@fb.com>
      
      Updated for new ->queue_rq() prototype.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a4aea562
    • K
      NVMe: Do not over allocate for discard requests · 9dbbfab7
      Keith Busch 提交于
      Discard requests are often for very large ranges. The discard size is not
      representative of the data transfer size so we don't need to allocate
      for such a large prp list. This patch requests allocating only enough
      for the memory needed for the data transfer and saves a little over 8k
      of memory per max discard request.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Reported-by: NPaul Grabinar <paul.grabinar@ranbarg.com>
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9dbbfab7
    • K
      NVMe: Do not open disks that are being deleted · 9e60352c
      Keith Busch 提交于
      It is possible the block layer will request to open a block device after
      the driver deleted it. Subsequent releases will cause a double free,
      or the disk's private_data is pointing to freed memory. This patch
      protects the driver's freed disks from being opened and accessed: the
      nvme namespaces are freed only when the device's refcount is 0, so at
      that moment there were no active openers and no more should be allowed,
      and it is safe to clear the disk's private_data that is about to be freed.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Reported-by: NHenry Chow <henry.chow@oracle.com>
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9e60352c
    • K
      NVMe: Clear QUEUE_FLAG_STACKABLE · 5940c857
      Keith Busch 提交于
      The nvme namespace request_queue's flags are initialized to
      QUEUE_FLAG_DEFAULT, which currently sets QUEUE_FLAG_STACKABLE. The
      device-mapper indicates this flag means the block driver is requset
      based, though this driver is bio-based and problems will occur if an nvme
      namespace is used with a request based dm device. This patch clears the
      stackable flag.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5940c857
    • K
      NVMe: Fix device probe waiting on kthread · 387caa5a
      Keith Busch 提交于
      If we ever do parallel device probing, we need to wake up all processes
      waiting for nvme kthread to start, not just one. This is currently
      serialized so the bug is not reachable today, but fixing this anyway in
      the hopes we implement parallel or asynchronous probe in the future.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      387caa5a
    • K
      NVMe: Passthrough IOCTL for IO commands · 7963e521
      Keith Busch 提交于
      The NVME_IOCTL_SUBMIT_IO only works for IO commands with block data
      transfers and isn't usable for other NVMe commands like flush,
      data set management, or any sort of vendor unique command. The
      NVME_IOCTL_ADMIN_CMD, however, can easily be modified to accept arbitrary
      IO commands in addition to arbitrary admin commands without breaking
      backward compatibility. This patch just adds a new IOCTL to distinguish
      if the driver should submit the command on an IO or Admin queue.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7963e521