1. 31 3月, 2015 2 次提交
    • J
      MIPS: Add CDMM bus support · 8286ae03
      James Hogan 提交于
      Add MIPS Common Device Memory Map (CDMM) support in the form of a bus in
      the standard Linux device model. Each device attached via CDMM is
      discoverable via an 8-bit type identifier and may contain a number of
      blocks of memory mapped registers in the CDMM region. IRQs are expected
      to be handled separately.
      
      Due to the per-cpu (per-VPE for MT cores) nature of the CDMM devices,
      all the driver callbacks take place from workqueues which are run on the
      right CPU for the device in question, so that the driver doesn't need to
      be as concerned about which CPU it is running on. Callbacks also exist
      for when CPUs are taken offline, so that any per-CPU resources used by
      the driver can be disabled so they don't get forcefully migrated. CDMM
      devices are created as children of the CPU device they are attached to.
      
      Any existing CDMM configuration by the bootloader will be inherited,
      however platforms wishing to enable CDMM should implement the weak
      mips_cdmm_phys_base() function (see asm/cdmm.h) so that the bus driver
      knows where it should put the CDMM region in the physical address space
      if the bootloader hasn't already enabled it.
      
      A mips_cdmm_early_probe() function is also provided to allow early boot
      or particularly low level code to set up the CDMM region and probe for a
      specific device type, for example early console or KGDB IO drivers for
      the EJTAG Fast Debug Channel (FDC) CDMM device.
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: linux-mips@linux-mips.org
      Cc: linux-kernel@vger.kernel.org
      Patchwork: https://patchwork.linux-mips.org/patch/9599/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      8286ae03
    • J
      IRQCHIP: mips-gic: Add missing definitions for FDC IRQ · ea3c023e
      James Hogan 提交于
      Add missing VPE_PEND, VPE_RMASK and VPE_SMASK definitions for the local
      FDC interrupt.
      
      These local interrupt definitions aren't directly used, but if they
      exist they should be complete.
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Andrew Bresticker <abrestic@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: linux-mips@linux-mips.org
      Reviewed-by: NAndrew Bresticker <abrestic@chromium.org>
      Cc: linux-kernel@vger.kernel.org
      Patchwork: https://patchwork.linux-mips.org/patch/9127/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      ea3c023e
  2. 18 3月, 2015 1 次提交
  3. 17 3月, 2015 1 次提交
    • P
      livepatch: Fix subtle race with coming and going modules · 8cb2c2dc
      Petr Mladek 提交于
      There is a notifier that handles live patches for coming and going modules.
      It takes klp_mutex lock to avoid races with coming and going patches but
      it does not keep the lock all the time. Therefore the following races are
      possible:
      
        1. The notifier is called sometime in STATE_MODULE_COMING. The module
           is visible by find_module() in this state all the time. It means that
           new patch can be registered and enabled even before the notifier is
           called. It might create wrong order of stacked patches, see below
           for an example.
      
         2. New patch could still see the module in the GOING state even after
            the notifier has been called. It will try to initialize the related
            object structures but the module could disappear at any time. There
            will stay mess in the structures. It might even cause an invalid
            memory access.
      
      This patch solves the problem by adding a boolean variable into struct module.
      The value is true after the coming and before the going handler is called.
      New patches need to be applied when the value is true and they need to ignore
      the module when the value is false.
      
      Note that we need to know state of all modules on the system. The races are
      related to new patches. Therefore we do not know what modules will get
      patched.
      
      Also note that we could not simply ignore going modules. The code from the
      module could be called even in the GOING state until mod->exit() finishes.
      If we start supporting patches with semantic changes between function
      calls, we need to apply new patches to any still usable code.
      See below for an example.
      
      Finally note that the patch solves only the situation when a new patch is
      registered. There are no such problems when the patch is being removed.
      It does not matter who disable the patch first, whether the normal
      disable_patch() or the module notifier. There is nothing to do
      once the patch is disabled.
      
      Alternative solutions:
      ======================
      
      + reject new patches when a patched module is coming or going; this is ugly
      
      + wait with adding new patch until the module leaves the COMING and GOING
        states; this might be dangerous and complicated; we would need to release
        kgr_lock in the middle of the patch registration to avoid a deadlock
        with the coming and going handlers; also we might need a waitqueue for
        each module which seems to be even bigger overhead than the boolean
      
      + stop modules from entering COMING and GOING states; wait until modules
        leave these states when they are already there; looks complicated; we would
        need to ignore the module that asked to stop the others to avoid a deadlock;
        also it is unclear what to do when two modules asked to stop others and
        both are in COMING state (situation when two new patches are applied)
      
      + always register/enable new patches and fix up the potential mess (registered
        patches order) in klp_module_init(); this is nasty and prone to regressions
        in the future development
      
      + add another MODULE_STATE where the kallsyms are visible but the module is not
        used yet; this looks too complex; the module states are checked on "many"
        locations
      
      Example of patch stacking breakage:
      ===================================
      
      The notifier could _not_ _simply_ ignore already initialized module objects.
      For example, let's have three patches (P1, P2, P3) for functions a() and b()
      where a() is from vmcore and b() is from a module M. Something like:
      
      	a()	b()
      P1	a1()	b1()
      P2	a2()	b2()
      P3	a3()	b3(3)
      
      If you load the module M after all patches are registered and enabled.
      The ftrace ops for function a() and b() has listed the functions in this
      order:
      
      	ops_a->func_stack -> list(a3,a2,a1)
      	ops_b->func_stack -> list(b3,b2,b1)
      
      , so the pointer to b3() is the first and will be used.
      
      Then you might have the following scenario. Let's start with state when patches
      P1 and P2 are registered and enabled but the module M is not loaded. Then ftrace
      ops for b() does not exist. Then we get into the following race:
      
      CPU0					CPU1
      
      load_module(M)
      
        complete_formation()
      
        mod->state = MODULE_STATE_COMING;
        mutex_unlock(&module_mutex);
      
      					klp_register_patch(P3);
      					klp_enable_patch(P3);
      
      					# STATE 1
      
        klp_module_notify(M)
          klp_module_notify_coming(P1);
          klp_module_notify_coming(P2);
          klp_module_notify_coming(P3);
      
      					# STATE 2
      
      The ftrace ops for a() and b() then looks:
      
        STATE1:
      
      	ops_a->func_stack -> list(a3,a2,a1);
      	ops_b->func_stack -> list(b3);
      
        STATE2:
      	ops_a->func_stack -> list(a3,a2,a1);
      	ops_b->func_stack -> list(b2,b1,b3);
      
      therefore, b2() is used for the module but a3() is used for vmcore
      because they were the last added.
      
      Example of the race with going modules:
      =======================================
      
      CPU0					CPU1
      
      delete_module()  #SYSCALL
      
         try_stop_module()
           mod->state = MODULE_STATE_GOING;
      
         mutex_unlock(&module_mutex);
      
      					klp_register_patch()
      					klp_enable_patch()
      
      					#save place to switch universe
      
      					b()     # from module that is going
      					  a()   # from core (patched)
      
         mod->exit();
      
      Note that the function b() can be called until we call mod->exit().
      
      If we do not apply patch against b() because it is in MODULE_STATE_GOING,
      it will call patched a() with modified semantic and things might get wrong.
      
      [jpoimboe@redhat.com: use one boolean instead of two]
      Signed-off-by: NPetr Mladek <pmladek@suse.cz>
      Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      8cb2c2dc
  4. 13 3月, 2015 3 次提交
  5. 12 3月, 2015 2 次提交
    • E
      xps: must clear sender_cpu before forwarding · c29390c6
      Eric Dumazet 提交于
      John reported that my previous commit added a regression
      on his router.
      
      This is because sender_cpu & napi_id share a common location,
      so get_xps_queue() can see garbage and perform an out of bound access.
      
      We need to make sure sender_cpu is cleared before doing the transmit,
      otherwise any NIC busy poll enabled (skb_mark_napi_id()) can trigger
      this bug.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJohn <jw@nuclearfallout.net>
      Bisected-by: NJohn <jw@nuclearfallout.net>
      Fixes: 2bd82484 ("xps: fix xps for stacked devices")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c29390c6
    • M
      clk: introduce clk_is_match · 3d3801ef
      Michael Turquette 提交于
      Some drivers compare struct clk pointers as a means of knowing
      if the two pointers reference the same clock hardware. This behavior is
      dubious (drivers must not dereference struct clk), but did not cause any
      regressions until the per-user struct clk patch was merged. Now the test
      for matching clk's will always fail with per-user struct clk's.
      
      clk_is_match is introduced to fix the regression and prevent drivers
      from comparing the pointers manually.
      
      Fixes: 035a61c3 ("clk: Make clk API return per-user struct clk instances")
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Shawn Guo <shawn.guo@linaro.org>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Signed-off-by: NMichael Turquette <mturquette@linaro.org>
      [arnd@arndb.de: Fix COMMON_CLK=N && HAS_CLK=Y config]
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      [sboyd@codeaurora.org: const arguments to clk_is_match() and
      remove unnecessary ternary operation]
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      3d3801ef
  6. 08 3月, 2015 2 次提交
  7. 07 3月, 2015 2 次提交
  8. 06 3月, 2015 1 次提交
  9. 05 3月, 2015 3 次提交
    • T
      workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE · 8603e1b3
      Tejun Heo 提交于
      cancel[_delayed]_work_sync() are implemented using
      __cancel_work_timer() which grabs the PENDING bit using
      try_to_grab_pending() and then flushes the work item with PENDING set
      to prevent the on-going execution of the work item from requeueing
      itself.
      
      try_to_grab_pending() can always grab PENDING bit without blocking
      except when someone else is doing the above flushing during
      cancelation.  In that case, try_to_grab_pending() returns -ENOENT.  In
      this case, __cancel_work_timer() currently invokes flush_work().  The
      assumption is that the completion of the work item is what the other
      canceling task would be waiting for too and thus waiting for the same
      condition and retrying should allow forward progress without excessive
      busy looping
      
      Unfortunately, this doesn't work if preemption is disabled or the
      latter task has real time priority.  Let's say task A just got woken
      up from flush_work() by the completion of the target work item.  If,
      before task A starts executing, task B gets scheduled and invokes
      __cancel_work_timer() on the same work item, its try_to_grab_pending()
      will return -ENOENT as the work item is still being canceled by task A
      and flush_work() will also immediately return false as the work item
      is no longer executing.  This puts task B in a busy loop possibly
      preventing task A from executing and clearing the canceling state on
      the work item leading to a hang.
      
      task A			task B			worker
      
      						executing work
      __cancel_work_timer()
        try_to_grab_pending()
        set work CANCELING
        flush_work()
          block for work completion
      						completion, wakes up A
      			__cancel_work_timer()
      			while (forever) {
      			  try_to_grab_pending()
      			    -ENOENT as work is being canceled
      			  flush_work()
      			    false as work is no longer executing
      			}
      
      This patch removes the possible hang by updating __cancel_work_timer()
      to explicitly wait for clearing of CANCELING rather than invoking
      flush_work() after try_to_grab_pending() fails with -ENOENT.
      
      Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com
      
      v3: bit_waitqueue() can't be used for work items defined in vmalloc
          area.  Switched to custom wake function which matches the target
          work item and exclusive wait and wakeup.
      
      v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if
          the target bit waitqueue has wait_bit_queue's on it.  Use
          DEFINE_WAIT_BIT() and __wake_up_bit() instead.  Reported by Tomeu
          Vizoso.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NRabin Vincent <rabin.vincent@axis.com>
      Cc: Tomeu Vizoso <tomeu.vizoso@gmail.com>
      Cc: stable@vger.kernel.org
      Tested-by: NJesper Nilsson <jesper.nilsson@axis.com>
      Tested-by: NRabin Vincent <rabin.vincent@axis.com>
      8603e1b3
    • L
      Revert "pinctrl: consumer: use correct retval for placeholder functions" · 40eeb111
      Linus Walleij 提交于
      This reverts commit 5a7d2efd.
      
      As per discussion on the mailing list, this is not the right
      thing to do. NULL cookies are valid in the stubs.
      Reported-by: NWolfram Sang <wsa@the-dreams.de>
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      40eeb111
    • R
      genirq / PM: Add flag for shared NO_SUSPEND interrupt lines · 17f48034
      Rafael J. Wysocki 提交于
      It currently is required that all users of NO_SUSPEND interrupt
      lines pass the IRQF_NO_SUSPEND flag when requesting the IRQ or the
      WARN_ON_ONCE() in irq_pm_install_action() will trigger.  That is
      done to warn about situations in which unprepared interrupt handlers
      may be run unnecessarily for suspended devices and may attempt to
      access those devices by mistake.  However, it may cause drivers
      that have no technical reasons for using IRQF_NO_SUSPEND to set
      that flag just because they happen to share the interrupt line
      with something like a timer.
      
      Moreover, the generic handling of wakeup interrupts introduced by
      commit 9ce7a258 (genirq: Simplify wakeup mechanism) only works
      for IRQs without any NO_SUSPEND users, so the drivers of wakeup
      devices needing to use shared NO_SUSPEND interrupt lines for
      signaling system wakeup generally have to detect wakeup in their
      interrupt handlers.  Thus if they happen to share an interrupt line
      with a NO_SUSPEND user, they also need to request that their
      interrupt handlers be run after suspend_device_irqs().
      
      In both cases the reason for using IRQF_NO_SUSPEND is not because
      the driver in question has a genuine need to run its interrupt
      handler after suspend_device_irqs(), but because it happens to
      share the line with some other NO_SUSPEND user.  Otherwise, the
      driver would do without IRQF_NO_SUSPEND just fine.
      
      To make it possible to specify that condition explicitly, introduce
      a new IRQ action handler flag for shared IRQs, IRQF_COND_SUSPEND,
      that, when set, will indicate to the IRQ core that the interrupt
      user is generally fine with suspending the IRQ, but it also can
      tolerate handler invocations after suspend_device_irqs() and, in
      particular, it is capable of detecting system wakeup and triggering
      it as appropriate from its interrupt handler.
      
      That will allow us to work around a problem with a shared timer
      interrupt line on at91 platforms.
      
      Link: http://marc.info/?l=linux-kernel&m=142252777602084&w=2
      Link: http://marc.info/?t=142252775300011&r=1&w=2
      Link: https://lkml.org/lkml/2014/12/15/552Reported-by: NBoris Brezillon <boris.brezillon@free-electrons.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      17f48034
  10. 04 3月, 2015 1 次提交
    • T
      NFS: Fix a regression in the read() syscall · 874f9463
      Trond Myklebust 提交于
      When invalidating the page cache for a regular file, we want to first
      sync all dirty data to disk and then call invalidate_inode_pages2().
      The latter relies on nfs_launder_page() and nfs_release_page() to deal
      respectively with dirty pages, and unstable written pages.
      
      When commit 95905446 ("NFS: avoid deadlocks with loop-back mounted
      NFS filesystems.") changed the behaviour of nfs_release_page(), then it
      made it possible for invalidate_inode_pages2() to fail with an EBUSY.
      Unfortunately, that error is then propagated back to read().
      
      Let's therefore work around the problem for now by protecting the call
      to sync the data and invalidate_inode_pages2() so that they are atomic
      w.r.t. the addition of new writes.
      Later on, we can revisit whether or not we still need nfs_launder_page()
      and nfs_release_page().
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      874f9463
  11. 03 3月, 2015 2 次提交
  12. 02 3月, 2015 3 次提交
  13. 28 2月, 2015 2 次提交
    • D
      rhashtable: remove indirection for grow/shrink decision functions · 4c4b52d9
      Daniel Borkmann 提交于
      Currently, all real users of rhashtable default their grow and shrink
      decision functions to rht_grow_above_75() and rht_shrink_below_30(),
      so that there's currently no need to have this explicitly selectable.
      
      It can/should be generic and private inside rhashtable until a real
      use case pops up. Since we can make this private, we'll save us this
      additional indirection layer and can improve insertion/deletion time
      as well.
      
      Reference: http://patchwork.ozlabs.org/patch/443040/Suggested-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c4b52d9
    • M
      dm snapshot: suspend merging snapshot when doing exception handover · 09ee96b2
      Mikulas Patocka 提交于
      The "dm snapshot: suspend origin when doing exception handover" commit
      fixed a exception store handover bug associated with pending exceptions
      to the "snapshot-origin" target.
      
      However, a similar problem exists in snapshot merging.  When snapshot
      merging is in progress, we use the target "snapshot-merge" instead of
      "snapshot-origin".  Consequently, during exception store handover, we
      must find the snapshot-merge target and suspend its associated
      mapped_device.
      
      To avoid lockdep warnings, the target must be suspended and resumed
      without holding _origins_lock.
      
      Introduce a dm_hold() function that grabs a reference on a
      mapped_device, but unlike dm_get(), it doesn't crash if the device has
      the DMF_FREEING flag set, it returns an error in this case.
      
      In snapshot_resume() we grab the reference to the origin device using
      dm_hold() while holding _origins_lock (_origins_lock guarantees that the
      device won't disappear).  Then we release _origins_lock, suspend the
      device and grab _origins_lock again.
      
      NOTE to stable@ people:
      When backporting to kernels 3.18 and older, use dm_internal_suspend and
      dm_internal_resume instead of dm_internal_suspend_fast and
      dm_internal_resume_fast.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      09ee96b2
  14. 27 2月, 2015 1 次提交
  15. 26 2月, 2015 1 次提交
    • M
      genirq / PM: better describe IRQF_NO_SUSPEND semantics · 737eb030
      Mark Rutland 提交于
      The IRQF_NO_SUSPEND flag is intended to be used for interrupts required
      to be enabled during the suspend-resume cycle. This mostly consists of
      IPIs and timer interrupts, potentially including chained irqchip
      interrupts if these are necessary to handle timers or IPIs. If an
      interrupt does not fall into one of the aforementioned categories,
      requesting it with IRQF_NO_SUSPEND is likely incorrect.
      
      Using IRQF_NO_SUSPEND does not guarantee that the interrupt can wake the
      system from a suspended state. For an interrupt to be able to trigger a
      wakeup, it may be necessary to program various components of the system.
      In these cases it is necessary to use {enable,disabled}_irq_wake.
      
      Unfortunately, several drivers assume that IRQF_NO_SUSPEND ensures that
      an IRQ can wake up the system, and the documentation can be read
      ambiguously w.r.t. this property.
      
      This patch updates the documentation regarding IRQF_NO_SUSPEND to make
      this caveat explicit, hopefully making future misuse rarer. Cleanup of
      existing misuse will occur as part of later patch series.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      737eb030
  16. 25 2月, 2015 1 次提交
  17. 23 2月, 2015 4 次提交
  18. 22 2月, 2015 2 次提交
  19. 21 2月, 2015 1 次提交
    • G
      net: Initialize all members in skb_gro_remcsum_init() · 846cd667
      Geert Uytterhoeven 提交于
      skb_gro_remcsum_init() initializes the gro_remcsum.delta member only,
      leading to compiler warnings about a possibly uninitialized
      gro_remcsum.offset member:
      
      drivers/net/vxlan.c: In function ‘vxlan_gro_receive’:
      drivers/net/vxlan.c:602: warning: ‘grc.offset’ may be used uninitialized in this function
      net/ipv4/fou.c: In function ‘gue_gro_receive’:
      net/ipv4/fou.c:262: warning: ‘grc.offset’ may be used uninitialized in this function
      
      While these are harmless for now:
        - skb_gro_remcsum_process() sets offset before changing delta,
        - skb_gro_remcsum_cleanup() checks if delta is non-zero before
          accessing offset,
      it's safer to let the initialization function initialize all members.
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      846cd667
  20. 20 2月, 2015 5 次提交
    • K
      NVMe: Fix potential corruption during shutdown · 07836e65
      Keith Busch 提交于
      The driver has to end unreturned commands at some point even if the
      controller has not provided a completion. The driver tried to be safe by
      deleting IO queues prior to ending all unreturned commands. That should
      cause the controller to internally abort inflight commands, but IO queue
      deletion request does not have to be successful, so all bets are off. We
      still have to make progress, so to be extra safe, this patch doesn't
      clear a queue to release the dma mapping for a command until after the
      pci device has been disabled.
      
      This patch removes the special handling during device initialization
      so controller recovery can be done all the time. This is possible since
      initialization is not inlined with pci probe anymore.
      Reported-by: NNilish Choudhury <nilesh.choudhury@oracle.com>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      07836e65
    • K
      NVMe: Asynchronous controller probe · 2e1d8448
      Keith Busch 提交于
      This performs the longest parts of nvme device probe in scheduled work.
      This speeds up probe significantly when multiple devices are in use.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      2e1d8448
    • K
      NVMe: Register management handle under nvme class · b3fffdef
      Keith Busch 提交于
      This creates a new class type for nvme devices to register their
      management character devices with. This is so we do not rely on miscdev
      to provide enough minors for as many nvme devices some people plan to
      use. The previous limit was approximately 60 NVMe controllers, depending
      on the platform and kernel. Now the limit is 1M, which ought to be enough
      for anybody.
      
      Since we have a new device class, it makes sense to attach the block
      devices under this as well, so part of this patch moves the management
      handle initialization prior to the namespaces discovery.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      b3fffdef
    • K
      NVMe: Update SCSI Inquiry VPD 83h translation · 4f1982b4
      Keith Busch 提交于
      The original translation created collisions on Inquiry VPD 83 for many
      existing devices. Newer specifications provide other ways to translate
      based on the device's version can be used to create unique identifiers.
      
      Version 1.1 provides an EUI64 field that uniquely identifies each
      namespace, and 1.2 added the longer NGUID field for the same reason.
      Both follow the IEEE EUI format and readily translate to the SCSI device
      identification EUI designator type 2h. For devices implementing either,
      the translation will use this type, defaulting to the EUI64 8-byte type if
      implemented then NGUID's 16 byte version if not. If neither are provided,
      the 1.0 translation is used, and is updated to use the SCSI String format
      to guarantee a unique identifier.
      
      Knowing when to use the new fields depends on the nvme controller's
      revision. The NVME_VS macro was not decoding this correctly, so that is
      fixed in this patch and moved to a more appropriate place.
      
      Since the Identify Namespace structure required an update for the NGUID
      field, this patch adds the remaining new 1.2 fields to the structure.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      4f1982b4
    • K
      NVMe: Metadata format support · e1e5e564
      Keith Busch 提交于
      Adds support for NVMe metadata formats and exposes block devices for
      all namespaces regardless of their format. Namespace formats that are
      unusable will have disk capacity set to 0, but a handle to the block
      device is created to simplify device management. A namespace is not
      usable when the format requires host interleave block and metadata in
      single buffer, has no provisioned storage, or has better data but failed
      to register with blk integrity.
      
      The namespace has to be scanned in two phases to support separate
      metadata formats. The first establishes the sector size and capacity
      prior to invoking add_disk. If metadata is required, the capacity will
      be temporarilly set to 0 until it can be revalidated and registered with
      the integrity extenstions after add_disk completes.
      
      The driver relies on the integrity extensions to provide the metadata
      buffer. NVMe requires this be a single physically contiguous region,
      so only one integrity segment is allowed per command. If the metadata
      is used for T10 PI, the driver provides mappings to save and restore
      the reftag physical block translation. The driver provides no-op
      functions for generate and verify if metadata is not used for protection
      information. This way the setup is always provided by the block layer.
      
      If a request does not supply a required metadata buffer, the command
      is failed with bad address. This could only happen if a user manually
      disables verify/generate on such a disk. The only exception to where
      this is okay is if the controller is capable of stripping/generating
      the metadata, which is possible on some types of formats.
      
      The metadata scatter gather list now occupies the spot in the nvme_iod
      that used to be used to link retryable IOD's, but we don't do that
      anymore, so the field was unused.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      e1e5e564