1. 20 4月, 2015 1 次提交
  2. 23 3月, 2015 1 次提交
  3. 05 3月, 2015 1 次提交
  4. 01 3月, 2015 1 次提交
  5. 24 2月, 2015 1 次提交
  6. 20 2月, 2015 8 次提交
    • K
      NVMe: Fix potential corruption on sync commands · 0c0f9b95
      Keith Busch 提交于
      This makes all sync commands uninterruptible and schedules without timeout
      so the controller either has to post a completion or the timeout recovery
      fails the command. This fixes potential memory or data corruption from
      a command timing out too early or woken by a signal. Previously any DMA
      buffers mapped for that command would have been released even though we
      don't know what the controller is planning to do with those addresses.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      0c0f9b95
    • K
      NVMe: Remove unused variables · 48328518
      Keith Busch 提交于
      We don't track queues in a llist, subscribe to hot-cpu notifications,
      or internally retry commands. Delete the unused artifacts.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      48328518
    • K
      NVMe: Fix scsi mode select llbaa setting · 9ac16938
      Keith Busch 提交于
      It should be a logical bitwise AND, not conditional.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      9ac16938
    • K
      NVMe: Fix potential corruption during shutdown · 07836e65
      Keith Busch 提交于
      The driver has to end unreturned commands at some point even if the
      controller has not provided a completion. The driver tried to be safe by
      deleting IO queues prior to ending all unreturned commands. That should
      cause the controller to internally abort inflight commands, but IO queue
      deletion request does not have to be successful, so all bets are off. We
      still have to make progress, so to be extra safe, this patch doesn't
      clear a queue to release the dma mapping for a command until after the
      pci device has been disabled.
      
      This patch removes the special handling during device initialization
      so controller recovery can be done all the time. This is possible since
      initialization is not inlined with pci probe anymore.
      Reported-by: NNilish Choudhury <nilesh.choudhury@oracle.com>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      07836e65
    • K
      NVMe: Asynchronous controller probe · 2e1d8448
      Keith Busch 提交于
      This performs the longest parts of nvme device probe in scheduled work.
      This speeds up probe significantly when multiple devices are in use.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      2e1d8448
    • K
      NVMe: Register management handle under nvme class · b3fffdef
      Keith Busch 提交于
      This creates a new class type for nvme devices to register their
      management character devices with. This is so we do not rely on miscdev
      to provide enough minors for as many nvme devices some people plan to
      use. The previous limit was approximately 60 NVMe controllers, depending
      on the platform and kernel. Now the limit is 1M, which ought to be enough
      for anybody.
      
      Since we have a new device class, it makes sense to attach the block
      devices under this as well, so part of this patch moves the management
      handle initialization prior to the namespaces discovery.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      b3fffdef
    • K
      NVMe: Update SCSI Inquiry VPD 83h translation · 4f1982b4
      Keith Busch 提交于
      The original translation created collisions on Inquiry VPD 83 for many
      existing devices. Newer specifications provide other ways to translate
      based on the device's version can be used to create unique identifiers.
      
      Version 1.1 provides an EUI64 field that uniquely identifies each
      namespace, and 1.2 added the longer NGUID field for the same reason.
      Both follow the IEEE EUI format and readily translate to the SCSI device
      identification EUI designator type 2h. For devices implementing either,
      the translation will use this type, defaulting to the EUI64 8-byte type if
      implemented then NGUID's 16 byte version if not. If neither are provided,
      the 1.0 translation is used, and is updated to use the SCSI String format
      to guarantee a unique identifier.
      
      Knowing when to use the new fields depends on the nvme controller's
      revision. The NVME_VS macro was not decoding this correctly, so that is
      fixed in this patch and moved to a more appropriate place.
      
      Since the Identify Namespace structure required an update for the NGUID
      field, this patch adds the remaining new 1.2 fields to the structure.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      4f1982b4
    • K
      NVMe: Metadata format support · e1e5e564
      Keith Busch 提交于
      Adds support for NVMe metadata formats and exposes block devices for
      all namespaces regardless of their format. Namespace formats that are
      unusable will have disk capacity set to 0, but a handle to the block
      device is created to simplify device management. A namespace is not
      usable when the format requires host interleave block and metadata in
      single buffer, has no provisioned storage, or has better data but failed
      to register with blk integrity.
      
      The namespace has to be scanned in two phases to support separate
      metadata formats. The first establishes the sector size and capacity
      prior to invoking add_disk. If metadata is required, the capacity will
      be temporarilly set to 0 until it can be revalidated and registered with
      the integrity extenstions after add_disk completes.
      
      The driver relies on the integrity extensions to provide the metadata
      buffer. NVMe requires this be a single physically contiguous region,
      so only one integrity segment is allowed per command. If the metadata
      is used for T10 PI, the driver provides mappings to save and restore
      the reftag physical block translation. The driver provides no-op
      functions for generate and verify if metadata is not used for protection
      information. This way the setup is always provided by the block layer.
      
      If a request does not supply a required metadata buffer, the command
      is failed with bad address. This could only happen if a user manually
      disables verify/generate on such a disk. The only exception to where
      this is okay is if the controller is capable of stripping/generating
      the metadata, which is possible on some types of formats.
      
      The metadata scatter gather list now occupies the spot in the nvme_iod
      that used to be used to link retryable IOD's, but we don't do that
      anymore, so the field was unused.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      e1e5e564
  7. 19 2月, 2015 4 次提交
  8. 17 2月, 2015 1 次提交
    • M
      brd: rename XIP to DAX · a7a97fc9
      Matthew Wilcox 提交于
      Since this is relating to FS_XIP, not KERNEL_XIP, it should be called
      DAX instead of XIP.
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7a97fc9
  9. 13 2月, 2015 9 次提交
    • R
      kernel.h: remove ancient __FUNCTION__ hack · 02f1f217
      Rasmus Villemoes 提交于
      __FUNCTION__ hasn't been treated as a string literal since gcc 3.4, so
      this only helps people who only test-compile using 3.3 (compiler-gcc3.h
      barks at anything older than that).  Besides, there are almost no
      occurrences of __FUNCTION__ left in the tree.
      
      [akpm@linux-foundation.org: convert remaining __FUNCTION__ references]
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02f1f217
    • G
      mm/zpool: add name argument to create zpool · 3eba0c6a
      Ganesh Mahendran 提交于
      Currently the underlay of zpool: zsmalloc/zbud, do not know who creates
      them.  There is not a method to let zsmalloc/zbud find which caller they
      belong to.
      
      Now we want to add statistics collection in zsmalloc.  We need to name the
      debugfs dir for each pool created.  The way suggested by Minchan Kim is to
      use a name passed by caller(such as zram) to create the zsmalloc pool.
      
          /sys/kernel/debug/zsmalloc/zram0
      
      This patch adds an argument `name' to zs_create_pool() and other related
      functions.
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3eba0c6a
    • S
      zram: remove request_queue from struct zram · ee980160
      Sergey Senozhatsky 提交于
      `struct zram' contains both `struct gendisk' and `struct request_queue'.
      the latter can be deleted, because zram->disk carries ->queue pointer, and
      ->queue carries zram pointer:
      
      create_device()
      	zram->queue->queuedata = zram
      	zram->disk->queue = zram->queue
      	zram->disk->private_data = zram
      
      so zram->queue is not needed, we can access all necessary data anyway.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee980160
    • M
      zram: remove init_lock in zram_make_request · 08eee69f
      Minchan Kim 提交于
      Admin could reset zram during I/O operation going on so we have used
      zram->init_lock as read-side lock in I/O path to prevent sudden zram
      meta freeing.
      
      However, the init_lock is really troublesome.  We can't do call
      zram_meta_alloc under init_lock due to lockdep splat because
      zram_rw_page is one of the function under reclaim path and hold it as
      read_lock while other places in process context hold it as write_lock.
      So, we have used allocation out of the lock to avoid lockdep warn but
      it's not good for readability and fainally, I met another lockdep splat
      between init_lock and cpu_hotplug from kmem_cache_destroy during working
      zsmalloc compaction.  :(
      
      Yes, the ideal is to remove horrible init_lock of zram in rw path.  This
      patch removes it in rw path and instead, add atomic refcount for meta
      lifetime management and completion to free meta in process context.
      It's important to free meta in process context because some of resource
      destruction needs mutex lock, which could be held if we releases the
      resource in reclaim context so it's deadlock, again.
      
      As a bonus, we could remove init_done check in rw path because
      zram_meta_get will do a role for it, instead.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08eee69f
    • M
      zram: check bd_openers instead of bd_holders · 2b269ce6
      Minchan Kim 提交于
      bd_holders is increased only when user open the device file as FMODE_EXCL
      so if something opens zram0 as !FMODE_EXCL and request I/O while another
      user reset zram0, we can see following warning.
      
        zram0: detected capacity change from 0 to 64424509440
        Buffer I/O error on dev zram0, logical block 180823, lost async page write
        Buffer I/O error on dev zram0, logical block 180824, lost async page write
        Buffer I/O error on dev zram0, logical block 180825, lost async page write
        Buffer I/O error on dev zram0, logical block 180826, lost async page write
        Buffer I/O error on dev zram0, logical block 180827, lost async page write
        Buffer I/O error on dev zram0, logical block 180828, lost async page write
        Buffer I/O error on dev zram0, logical block 180829, lost async page write
        Buffer I/O error on dev zram0, logical block 180830, lost async page write
        Buffer I/O error on dev zram0, logical block 180831, lost async page write
        Buffer I/O error on dev zram0, logical block 180832, lost async page write
        ------------[ cut here ]------------
        WARNING: CPU: 11 PID: 1996 at fs/block_dev.c:57 __blkdev_put+0x1d7/0x210()
        Modules linked in:
        CPU: 11 PID: 1996 Comm: dd Not tainted 3.19.0-rc6-next-20150202+ #1125
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x45/0x57
          warn_slowpath_common+0x8a/0xc0
          warn_slowpath_null+0x1a/0x20
          __blkdev_put+0x1d7/0x210
          blkdev_put+0x50/0x130
          blkdev_close+0x25/0x30
          __fput+0xdf/0x1e0
          ____fput+0xe/0x10
          task_work_run+0xa7/0xe0
          do_notify_resume+0x49/0x60
          int_signal+0x12/0x17
        ---[ end trace 274fbbc5664827d2 ]---
      
      The warning comes from bdev_write_node in blkdev_put path.
      
         static void bdev_write_inode(struct inode *inode)
         {
              spin_lock(&inode->i_lock);
              while (inode->i_state & I_DIRTY) {
                      spin_unlock(&inode->i_lock);
                      WARN_ON_ONCE(write_inode_now(inode, true)); <========= here.
                      spin_lock(&inode->i_lock);
              }
              spin_unlock(&inode->i_lock);
         }
      
      The reason is dd process encounters I/O fails due to sudden block device
      disappear so in filemap_check_errors in __writeback_single_inode returns
      -EIO.
      
      If we check bd_openers instead of bd_holders, we could address the
      problem.  When I see the brd, it already have used it rather than
      bd_holders so although I'm not a expert of block layer, it seems to be
      better.
      
      I can make following warning with below simple script.  In addition, I
      added msleep(2000) below set_capacity(zram->disk, 0) after applying your
      patch to make window huge(Kudos to Ganesh!)
      
      script:
      
         echo $((60<<30)) > /sys/block/zram0/disksize
         setsid dd if=/dev/zero of=/dev/zram0 &
         sleep 1
         setsid echo 1 > /sys/block/zram0/reset
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b269ce6
    • S
      zram: rework reset and destroy path · a096cafc
      Sergey Senozhatsky 提交于
      We need to return set_capacity(disk, 0) from reset_store() back to
      zram_reset_device(), a catch by Ganesh Mahendran.  Potentially, we can
      race set_capacity() calls from init and reset paths.
      
      The problem is that zram_reset_device() is also getting called from
      zram_exit(), which performs operations in misleading reversed order -- we
      first create_device() and then init it, while zram_exit() perform
      destroy_device() first and then does zram_reset_device().  This is done to
      remove sysfs group before we reset device, so we can continue with device
      reset/destruction not being raced by sysfs attr write (f.e.  disksize).
      
      Apart from that, destroy_device() releases zram->disk (but we still have
      ->disk pointer), so we cannot acces zram->disk in later
      zram_reset_device() call, which may cause additional errors in the future.
      
      So, this patch rework and cleanup destroy path.
      
      1) remove several unneeded goto labels in zram_init()
      
      2) factor out zram_init() error path and zram_exit() into
         destroy_devices() function, which takes the number of devices to
         destroy as its argument.
      
      3) remove sysfs group in destroy_devices() first, so we can reorder
         operations -- reset device (as expected) goes before disk destroy and
         queue cleanup.  So we can always access ->disk in zram_reset_device().
      
      4) and, finally, return set_capacity() back under ->init_lock.
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a096cafc
    • S
      zram: fix umount-reset_store-mount race condition · ba6b17d6
      Sergey Senozhatsky 提交于
      Ganesh Mahendran was the first one who proposed to use bdev->bd_mutex to
      avoid ->bd_holders race condition:
      
              CPU0                            CPU1
      umount /* zram->init_done is true */
      reset_store()
      bdev->bd_holders == 0                   mount
      ...                                     zram_make_request()
      zram_reset_device()
      
      However, his solution required some considerable amount of code movement,
      which we can avoid.
      
      Apart from using bdev->bd_mutex in reset_store(), this patch also
      simplifies zram_reset_device().
      
      zram_reset_device() has a bool parameter reset_capacity which tells it
      whether disk capacity and itself disk should be reset.  There are two
      zram_reset_device() callers:
      
      -- zram_exit() passes reset_capacity=false
      -- reset_store() passes reset_capacity=true
      
      So we can move reset_capacity-sensitive work out of zram_reset_device()
      and perform it unconditionally in reset_store().  This also lets us drop
      reset_capacity parameter from zram_reset_device() and pass zram pointer
      only.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba6b17d6
    • G
      zram: free meta table in zram_meta_free · 1fec1172
      Ganesh Mahendran 提交于
      zram_meta_alloc() and zram_meta_free() are a pair.  In
      zram_meta_alloc(), meta table is allocated.  So it it better to free it
      in zram_meta_free().
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1fec1172
    • S
      zram: clean up zram_meta_alloc() · b8179958
      Sergey Senozhatsky 提交于
      A trivial cleanup of zram_meta_alloc() error handling.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8179958
  10. 11 2月, 2015 2 次提交
  11. 03 2月, 2015 1 次提交
  12. 30 1月, 2015 1 次提交
    • J
      NVMe: avoid kmalloc/kfree for smaller IO · ac3dd5bd
      Jens Axboe 提交于
      Currently we allocate an nvme_iod for each IO, which holds the
      sg list, prps, and other IO related info. Set a threshold of
      2 pages and/or 8KB of data, below which we can just embed this
      in the per-command pdu in blk-mq. For any IO at or below
      NVME_INT_PAGES and NVME_INT_BYTES, we save a kmalloc and kfree.
      
      For higher IOPS, this saves up to 1% of CPU time.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      ac3dd5bd
  13. 28 1月, 2015 4 次提交
  14. 24 1月, 2015 1 次提交
    • S
      block: support different tag allocation policy · ee1b6f7a
      Shaohua Li 提交于
      The libata tag allocation is using a round-robin policy. Next patch will
      make libata use block generic tag allocation, so let's add a policy to
      tag allocation.
      
      Currently two policies: FIFO (default) and round-robin.
      
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ee1b6f7a
  15. 22 1月, 2015 2 次提交
  16. 21 1月, 2015 2 次提交