1. 26 6月, 2020 18 次提交
  2. 25 6月, 2020 22 次提交
    • R
      cpuidle: Rearrange s2idle-specific idle state entry code · 10e8b11e
      Rafael J. Wysocki 提交于
      Implement call_cpuidle_s2idle() in analogy with call_cpuidle()
      for the s2idle-specific idle state entry and invoke it from
      cpuidle_idle_call() to make the s2idle-specific idle entry code
      path look more similar to the "regular" idle entry one.
      
      No intentional functional impact.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NChen Yu <yu.c.chen@intel.com>
      10e8b11e
    • D
      net: bcmgenet: use hardware padding of runt frames · 20d1f2d1
      Doug Berger 提交于
      When commit 474ea9ca ("net: bcmgenet: correctly pad short
      packets") added the call to skb_padto() it should have been
      located before the nr_frags parameter was read since that value
      could be changed when padding packets with lengths between 55
      and 59 bytes (inclusive).
      
      The use of a stale nr_frags value can cause corruption of the
      pad data when tx-scatter-gather is enabled. This corruption of
      the pad can cause invalid checksum computation when hardware
      offload of tx-checksum is also enabled.
      
      Since the original reason for the padding was corrected by
      commit 7dd39913 ("net: bcmgenet: fix skb_len in
      bcmgenet_xmit_single()") we can remove the software padding all
      together and make use of hardware padding of short frames as
      long as the hardware also always appends the FCS value to the
      frame.
      
      Fixes: 474ea9ca ("net: bcmgenet: correctly pad short packets")
      Signed-off-by: NDoug Berger <opendmb@gmail.com>
      Acked-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20d1f2d1
    • D
      net: bcmgenet: use __be16 for htons(ETH_P_IP) · d966d2ef
      Doug Berger 提交于
      The 16-bit value that holds a short in network byte order should
      be declared as a restricted big endian type to allow type checks
      to succeed during assignment.
      
      Fixes: 3e370952 ("net: bcmgenet: add support for ethtool rxnfc flows")
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NDoug Berger <opendmb@gmail.com>
      Acked-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d966d2ef
    • D
      net: bcmgenet: re-remove bcmgenet_hfb_add_filter · 673bafd5
      Doug Berger 提交于
      This function was originally removed by Baoyou Xie in
      commit e2072600 ("net: bcmgenet: remove unused function in
      bcmgenet.c") to prevent a build warning.
      
      Some of the functions removed by Baoyou Xie are now used for
      WAKE_FILTER support so his commit was reverted, but this function
      is still unused and the kbuild test robot dutifully reported the
      warning.
      
      This commit once again removes the remaining unused hfb functions.
      
      Fixes: 14da1510 ("Revert "net: bcmgenet: remove unused function in bcmgenet.c"")
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NDoug Berger <opendmb@gmail.com>
      Acked-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      673bafd5
    • B
      drm/amd: fix potential memleak in err branch · b5b78a6c
      Bernard Zhao 提交于
      The function kobject_init_and_add alloc memory like:
      kobject_init_and_add->kobject_add_varg->kobject_set_name_vargs
      ->kvasprintf_const->kstrdup_const->kstrdup->kmalloc_track_caller
      ->kmalloc_slab, in err branch this memory not free. If use
      kmemleak, this path maybe catched.
      These changes are to add kobject_put in kobject_init_and_add
      failed branch, fix potential memleak.
      Signed-off-by: NBernard Zhao <bernard@vivo.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org
      b5b78a6c
    • S
      drm/amd/display: Fix ineffective setting of max bpc property · fa7041d9
      Stylon Wang 提交于
      [Why]
      Regression was introduced where setting max bpc property has no effect
      on the atomic check and final commit. It has the same effect as max bpc
      being stuck at 8.
      
      [How]
      Correctly propagate max bpc with the new connector state.
      Signed-off-by: NStylon Wang <stylon.wang@amd.com>
      Reviewed-by: NNicholas Kazlauskas <Nicholas.Kazlauskas@amd.com>
      Acked-by: NRodrigo Siqueira <Rodrigo.Siqueira@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org
      fa7041d9
    • S
      drm/amd/display: Enable output_bpc property on all outputs · 5ae9c378
      Stylon Wang 提交于
      [Why]
      Connector property output_bpc is available on DP/eDP only. New IGT tests
      would benifit if this property works on HDMI.
      
      [How]
      Enable this read-only property on all types of connectors.
      Signed-off-by: NStylon Wang <stylon.wang@amd.com>
      Reviewed-by: NNicholas Kazlauskas <Nicholas.Kazlauskas@amd.com>
      Acked-by: NRodrigo Siqueira <Rodrigo.Siqueira@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org
      5ae9c378
    • W
      drm/amdgpu: add fw release for sdma v5_0 · edfaf6fa
      Wenhui Sheng 提交于
      sdma fw isn't released when module exit
      Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
      Signed-off-by: NWenhui Sheng <Wenhui.Sheng@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org
      edfaf6fa
    • C
      qed: add missing error test for DBG_STATUS_NO_MATCHING_FRAMING_MODE · a5124386
      Colin Ian King 提交于
      The error DBG_STATUS_NO_MATCHING_FRAMING_MODE was added to the enum
      enum dbg_status however there is a missing corresponding entry for
      this in the array s_status_str. This causes an out-of-bounds read when
      indexing into the last entry of s_status_str.  Fix this by adding in
      the missing entry.
      
      Addresses-Coverity: ("Out-of-bounds read").
      Fixes: 2d22bc83 ("qed: FW 8.42.2.0 debug features")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5124386
    • J
      net: phy: call phy_disable_interrupts() in phy_init_hw() · 9886a4db
      Jisheng Zhang 提交于
      Call phy_disable_interrupts() in phy_init_hw() to "have a defined init
      state as we don't know in which state the PHY is if the PHY driver is
      loaded. We shouldn't assume that it's the chip power-on defaults, BIOS
      or boot loader could have changed this. Or in case of dual-boot
      systems the other OS could leave the PHY in whatever state." as pointed
      out by Heiner.
      Suggested-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: NJisheng Zhang <Jisheng.Zhang@synaptics.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9886a4db
    • J
      net: phy: make phy_disable_interrupts() non-static · 3dd4ef1b
      Jisheng Zhang 提交于
      We face an issue with rtl8211f, a pin is shared between INTB and PMEB,
      and the PHY Register Accessible Interrupt is enabled by default, so
      the INTB/PMEB pin is always active in polling mode case.
      
      As Heiner pointed out "I was thinking about calling
      phy_disable_interrupts() in phy_init_hw(), to have a defined init
      state as we don't know in which state the PHY is if the PHY driver is
      loaded. We shouldn't assume that it's the chip power-on defaults, BIOS
      or boot loader could have changed this. Or in case of dual-boot
      systems the other OS could leave the PHY in whatever state."
      
      Make phy_disable_interrupts() non-static so that it could be used in
      phy_init_hw() to have a defined init state.
      Suggested-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: NJisheng Zhang <Jisheng.Zhang@synaptics.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3dd4ef1b
    • S
      net: ethernet: mvneta: Add back interface mode validation · 41c2b6b4
      Sascha Hauer 提交于
      When writing the serdes configuration register was moved to
      mvneta_config_interface() the whole code block was removed from
      mvneta_port_power_up() in the assumption that its only purpose was to
      write the serdes configuration register. As mentioned by Russell King
      its purpose was also to check for valid interface modes early so that
      later in the driver we do not have to care for unexpected interface
      modes.
      Add back the test to let the driver bail out early on unhandled
      interface modes.
      
      Fixes: b4748553 ("net: ethernet: mvneta: Fix Serdes configuration for SoCs without comphy")
      Signed-off-by: NSascha Hauer <s.hauer@pengutronix.de>
      Reviewed-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41c2b6b4
    • S
      net: ethernet: mvneta: Do not error out in non serdes modes · d3d239dc
      Sascha Hauer 提交于
      In mvneta_config_interface() the RGMII modes are catched by the default
      case which is an error return. The RGMII modes are valid modes for the
      driver, so instead of returning an error add a break statement to return
      successfully.
      
      This avoids this warning for non comphy SoCs which use RGMII, like
      SolidRun Clearfog:
      
      WARNING: CPU: 0 PID: 268 at drivers/net/ethernet/marvell/mvneta.c:3512 mvneta_start_dev+0x220/0x23c
      
      Fixes: b4748553 ("net: ethernet: mvneta: Fix Serdes configuration for SoCs without comphy")
      Signed-off-by: NSascha Hauer <s.hauer@pengutronix.de>
      Reviewed-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3d239dc
    • D
      drm/fb-helper: Fix vt restore · dc5bdb68
      Daniel Vetter 提交于
      In the past we had a pile of hacks to orchestrate access between fbdev
      emulation and native kms clients. We've tried to streamline this, by
      always preferring the kms side above fbdev calls when a drm master
      exists, because drm master controls access to the display resources.
      
      Unfortunately this breaks existing userspace, specifically Xorg. When
      exiting Xorg first restores the console to text mode using the KDSET
      ioctl on the vt. This does nothing, because a drm master is still
      around. Then it drops the drm master status, which again does nothing,
      because logind is keeping additional drm fd open to be able to
      orchestrate vt switches. In the past this is the point where fbdev was
      restored, as part of the ->lastclose hook on the drm side.
      
      Now to fix this regression we don't want to go back to letting fbdev
      restore things whenever it feels like, or to the pile of hacks we've
      had before. Instead try and go with a minimal exception to make the
      KDSET case work again, and nothing else.
      
      This means that if userspace does a KDSET call when switching between
      graphical compositors, there will be some flickering with fbcon
      showing up for a bit. But a) that's not a regression and b) userspace
      can fix it by improving the vt switching dance - logind should have
      all the information it needs.
      
      While pondering all this I'm also wondering wheter we should have a
      SWITCH_MASTER ioctl to allow race-free master status handover. But
      that's for another day.
      
      v2: Somehow forgot to cc all the fbdev people.
      
      v3: Fix typo Alex spotted.
      Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208179
      Cc: shlomo@fastmail.com
      Reported-and-Tested-by: shlomo@fastmail.com
      Cc: Michel Dänzer <michel@daenzer.net>
      Fixes: 64914da2 ("drm/fbdev-helper: don't force restores")
      Cc: Noralf Trønnes <noralf@tronnes.org>
      Cc: Thomas Zimmermann <tzimmermann@suse.de>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Maxime Ripard <mripard@kernel.org>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: dri-devel@lists.freedesktop.org
      Cc: <stable@vger.kernel.org> # v5.7+
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Cc: Qiujun Huang <hqjagain@gmail.com>
      Cc: Peter Rosin <peda@axentia.se>
      Cc: linux-fbdev@vger.kernel.org
      Signed-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20200624092910.3280448-1-daniel.vetter@ffwll.ch
      dc5bdb68
    • M
      IB/hfi1: Add atomic triggered sleep/wakeup · 38fd98af
      Mike Marciniszyn 提交于
      When running iperf in a two host configuration the following trace can
      occur:
      
      [  319.728730] NETDEV WATCHDOG: ib0 (hfi1): transmit queue 0 timed out
      
      The issue happens because the current implementation relies on the netif
      txq being stopped to control the flushing of the tx list.
      
      There are two resources that the transmit logic can wait on and stop the
      txq:
      - SDMA descriptors
      - Ring space to hold completions
      
      The ring space is tested on the sending side and relieved when the ring is
      consumed in the napi tx reaping.
      
      Unfortunately, that reaping can run conncurrently with the workqueue
      flushing of the txlist.  If the txq is started just before the workitem
      executes, the txlist will never be flushed, leading to the txq being
      stuck.
      
      Fix by:
      - Adding sleep/wakeup wrappers
        * Use an atomic to control the call to the netif routines inside the
          wrappers
      
      - Use another atomic to record ring space exhaustion
        * Only wakeup when the a ring space exhaustion has happened and it
          relieved
      
      Add additional wrappers to clarify the ring space resource handling.
      
      Fixes: d99dc602 ("IB/hfi1: Add functions to transmit datagram ipoib packets")
      Link: https://lore.kernel.org/r/20200623204327.108092.4024.stgit@awfm-01.aw.intel.comReviewed-by: NKaike Wan <kaike.wan@intel.com>
      Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
      38fd98af
    • M
      IB/hfi1: Correct -EBUSY handling in tx code · 82172b76
      Mike Marciniszyn 提交于
      The current code mishandles -EBUSY in two ways:
      - The flow change doesn't test the return from the flush and runs on to
        process the current packet racing with the wakeup processing
      - The -EBUSY handling for a single packet inserts the tx into the txlist
        after the submit call, racing with the same wakeup processing
      
      Fix the first by dropping the skb and returning NETDEV_TX_OK.
      
      Fix the second by insuring the the list entry within the txreq is inited
      when allocated.  This enables the sleep routine to detect that the txreq
      has used the non-list api and queue the packet to the txlist.
      
      Both flaws can lead to having the flushing thread executing in causing two
      threads to manipulate the txlist.
      
      Fixes: d99dc602 ("IB/hfi1: Add functions to transmit datagram ipoib packets")
      Link: https://lore.kernel.org/r/20200623204321.108092.83898.stgit@awfm-01.aw.intel.comReviewed-by: NKaike Wan <kaike.wan@intel.com>
      Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
      82172b76
    • D
      IB/hfi1: Fix module use count flaw due to leftover module put calls · 822fbd37
      Dennis Dalessandro 提交于
      When the try_module_get calls were removed from opening and closing of the
      i2c debugfs file, the corresponding module_put calls were missed.  This
      results in an inaccurate module use count that requires a power cycle to
      fix.
      
      Fixes: 09fbca8e ("IB/hfi1: No need to use try_module_get for debugfs")
      Link: https://lore.kernel.org/r/20200623203230.106975.76240.stgit@awfm-01.aw.intel.com
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NKaike Wan <kaike.wan@intel.com>
      Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
      822fbd37
    • D
      IB/hfi1: Restore kfree in dummy_netdev cleanup · b46925a2
      Dennis Dalessandro 提交于
      We need to do some rework on the dummy netdev. Calling the free_netdev()
      would normally make sense, and that will be addressed in an upcoming
      patch. For now just revert the behavior to what it was before keeping the
      unused variable removal part of the patch.
      
      The dd->dumm_netdev is mainly used for packet receiving through
      alloc_netdev_mqs() for typical net devices. A a result, it should be freed
      with kfree instead of free_netdev() that leads to a crash when unloading
      the hfi1 module:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000000
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 8000000855b54067 P4D 8000000855b54067 PUD 84a4f5067 PMD 0
        Oops: 0000 [#1] SMP PTI
        CPU: 73 PID: 10299 Comm: modprobe Not tainted 5.6.0-rc5+ #1
        Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
        RIP: 0010:__hw_addr_flush+0x12/0x80
        Code: 40 00 48 83 c4 08 4c 89 e7 5b 5d 41 5c e9 76 77 18 00 66 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 fc 55 53 48 8b 1f 48 39 df <48> 8b 2b 75 08 eb 4a 48 89 eb 48 89 c5 48 89 df e8 99 bf d0 ff 84
        RSP: 0018:ffffb40e08783db8 EFLAGS: 00010282
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
        RDX: ffffb40e00000000 RSI: 0000000000000246 RDI: ffff88ab13662298
        RBP: ffff88ab13662000 R08: 0000000000001549 R09: 0000000000001549
        R10: 0000000000000001 R11: 0000000000aaaaaa R12: ffff88ab13662298
        R13: ffff88ab1b259e20 R14: ffff88ab1b259e42 R15: 0000000000000000
        FS:  00007fb39b534740(0000) GS:ffff88b31f940000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000084d3ea004 CR4: 00000000003606e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         dev_addr_flush+0x15/0x30
         free_netdev+0x7e/0x130
         hfi1_netdev_free+0x59/0x70 [hfi1]
         remove_one+0x65/0x110 [hfi1]
         pci_device_remove+0x3b/0xc0
         device_release_driver_internal+0xec/0x1b0
         driver_detach+0x46/0x90
         bus_remove_driver+0x58/0xd0
         pci_unregister_driver+0x26/0xa0
         hfi1_mod_cleanup+0xc/0xd54 [hfi1]
         __x64_sys_delete_module+0x16c/0x260
         ? exit_to_usermode_loop+0xa4/0xc0
         do_syscall_64+0x5b/0x200
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 193ba031 ("IB/hfi1: Use free_netdev() in hfi1_netdev_free()")
      Link: https://lore.kernel.org/r/20200623203224.106975.16926.stgit@awfm-01.aw.intel.comReviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: NKaike Wan <kaike.wan@intel.com>
      Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
      b46925a2
    • S
      nvme-multipath: fix bogus request queue reference put · c3124466
      Sagi Grimberg 提交于
      The mpath disk node takes a reference on the request mpath
      request queue when adding live path to the mpath gendisk.
      However if we connected to an inaccessible path device_add_disk
      is not called, so if we disconnect and remove the mpath gendisk
      we endup putting an reference on the request queue that was
      never taken [1].
      
      Fix that to check if we ever added a live path (using
      NVME_NS_HEAD_HAS_DISK flag) and if not, clear the disk->queue
      reference.
      
      [1]:
      ------------[ cut here ]------------
      refcount_t: underflow; use-after-free.
      WARNING: CPU: 1 PID: 1372 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
      CPU: 1 PID: 1372 Comm: nvme Tainted: G           O      5.7.0-rc2+ #3
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
      RIP: 0010:refcount_warn_saturate+0xa6/0xf0
      RSP: 0018:ffffb29e8053bdc0 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: ffff8b7a2f4fc060 RCX: 0000000000000007
      RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8b7a3ec99980
      RBP: ffff8b7a2f4fc000 R08: 00000000000002e1 R09: 0000000000000004
      R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
      R13: fffffffffffffff2 R14: ffffb29e8053bf08 R15: ffff8b7a320e2da0
      FS:  00007f135d4ca800(0000) GS:ffff8b7a3ec80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00005651178c0c30 CR3: 000000003b650005 CR4: 0000000000360ee0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       disk_release+0xa2/0xc0
       device_release+0x28/0x80
       kobject_put+0xa5/0x1b0
       nvme_put_ns_head+0x26/0x70 [nvme_core]
       nvme_put_ns+0x30/0x60 [nvme_core]
       nvme_remove_namespaces+0x9b/0xe0 [nvme_core]
       nvme_do_delete_ctrl+0x43/0x5c [nvme_core]
       nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
       kernfs_fop_write+0xc1/0x1a0
       vfs_write+0xb6/0x1a0
       ksys_write+0x5f/0xe0
       do_syscall_64+0x52/0x1a0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Reported-by: NAnton Eidelman <anton@lightbitslabs.com>
      Tested-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      c3124466
    • A
      nvme-multipath: fix deadlock due to head->lock · d8a22f85
      Anton Eidelman 提交于
      In the following scenario scan_work and ana_work will deadlock:
      
      When scan_work calls nvme_mpath_add_disk() this holds ana_lock
      and invokes nvme_parse_ana_log(), which may issue IO
      in device_add_disk() and hang waiting for an accessible path.
      
      While nvme_mpath_set_live() only called when nvme_state_is_live(),
      a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO.
      
      Since nvme_mpath_set_live() holds ns->head->lock, an ana_work on
      ANY ctrl will not be able to complete nvme_mpath_set_live()
      on the same ns->head, which is required in order to update
      the new accessible path and remove NVME_NS_ANA_PENDING..
      Therefore IO never completes: deadlock [1].
      
      Fix:
      Move device_add_disk out of the head->lock and protect it with an
      atomic test_and_set for a new NVME_NS_HEAD_HAS_DISK bit.
      
      [1]:
      kernel: INFO: task kworker/u8:2:160 blocked for more than 120 seconds.
      kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
      kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      kernel: kworker/u8:2    D    0   160      2 0x80004000
      kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  schedule_preempt_disabled+0xe/0x10
      kernel:  __mutex_lock.isra.0+0x182/0x4f0
      kernel:  __mutex_lock_slowpath+0x13/0x20
      kernel:  mutex_lock+0x2e/0x40
      kernel:  nvme_update_ns_ana_state+0x22/0x60 [nvme_core]
      kernel:  nvme_update_ana_state+0xca/0xe0 [nvme_core]
      kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      kernel:  nvme_read_ana_log+0x76/0x100 [nvme_core]
      kernel:  nvme_ana_work+0x15/0x20 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x4d/0x400
      kernel:  kthread+0x104/0x140
      kernel:  ret_from_fork+0x35/0x40
      kernel: INFO: task kworker/u8:4:439 blocked for more than 120 seconds.
      kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
      kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      kernel: kworker/u8:4    D    0   439      2 0x80004000
      kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  io_schedule+0x16/0x40
      kernel:  do_read_cache_page+0x438/0x830
      kernel:  read_cache_page+0x12/0x20
      kernel:  read_dev_sector+0x27/0xc0
      kernel:  read_lba+0xc1/0x220
      kernel:  efi_partition+0x1e6/0x708
      kernel:  check_partition+0x154/0x244
      kernel:  rescan_partitions+0xae/0x280
      kernel:  __blkdev_get+0x40f/0x560
      kernel:  blkdev_get+0x3d/0x140
      kernel:  __device_add_disk+0x388/0x480
      kernel:  device_add_disk+0x13/0x20
      kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
      kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
      kernel:  nvme_mpath_add_disk+0xbe/0x100 [nvme_core]
      kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
      kernel:  nvme_scan_work+0x256/0x390 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x4d/0x400
      kernel:  kthread+0x104/0x140
      kernel:  ret_from_fork+0x35/0x40
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      d8a22f85
    • S
      nvme: don't protect ns mutation with ns->head->lock · e164471d
      Sagi Grimberg 提交于
      Right now ns->head->lock is protecting namespace mutation
      which is wrong and unneeded. Move it to only protect
      against head mutations. While we're at it, remove unnecessary
      ns->head reference as we already have head pointer.
      
      The problem with this is that the head->lock spans
      mpath disk node I/O that may block under some conditions (if
      for example the controller is disconnecting or the path
      became inaccessible), The locking scheme does not allow any
      other path to enable itself, preventing blocked I/O to complete
      and forward-progress from there.
      
      This is a preparation patch for the fix in a subsequent patch
      where the disk I/O will also be done outside the head->lock.
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      e164471d
    • A
      nvme-multipath: fix deadlock between ana_work and scan_work · 489dd102
      Anton Eidelman 提交于
      When scan_work calls nvme_mpath_add_disk() this holds ana_lock
      and invokes nvme_parse_ana_log(), which may issue IO
      in device_add_disk() and hang waiting for an accessible path.
      While nvme_mpath_set_live() only called when nvme_state_is_live(),
      a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO.
      
      In order to recover and complete the IO ana_work on the same ctrl
      should be able to update the path state and remove NVME_NS_ANA_PENDING.
      
      The deadlock occurs because scan_work keeps holding ana_lock,
      so ana_work hangs [1].
      
      Fix:
      Now nvme_mpath_add_disk() uses nvme_parse_ana_log() to obtain a copy
      of the ANA group desc, and then calls nvme_update_ns_ana_state() without
      holding ana_lock.
      
      [1]:
      kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  io_schedule+0x16/0x40
      kernel:  do_read_cache_page+0x438/0x830
      kernel:  read_cache_page+0x12/0x20
      kernel:  read_dev_sector+0x27/0xc0
      kernel:  read_lba+0xc1/0x220
      kernel:  efi_partition+0x1e6/0x708
      kernel:  check_partition+0x154/0x244
      kernel:  rescan_partitions+0xae/0x280
      kernel:  __blkdev_get+0x40f/0x560
      kernel:  blkdev_get+0x3d/0x140
      kernel:  __device_add_disk+0x388/0x480
      kernel:  device_add_disk+0x13/0x20
      kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
      kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
      kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
      kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
      kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
      kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x249/0x400
      kernel:  kthread+0x104/0x140
      
      kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  schedule_preempt_disabled+0xe/0x10
      kernel:  __mutex_lock.isra.0+0x182/0x4f0
      kernel:  ? __switch_to_asm+0x34/0x70
      kernel:  ? select_task_rq_fair+0x1aa/0x5c0
      kernel:  ? kvm_sched_clock_read+0x11/0x20
      kernel:  ? sched_clock+0x9/0x10
      kernel:  __mutex_lock_slowpath+0x13/0x20
      kernel:  mutex_lock+0x2e/0x40
      kernel:  nvme_read_ana_log+0x3a/0x100 [nvme_core]
      kernel:  nvme_ana_work+0x15/0x20 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x4d/0x400
      kernel:  kthread+0x104/0x140
      kernel:  ? process_one_work+0x380/0x380
      kernel:  ? kthread_park+0x80/0x80
      kernel:  ret_from_fork+0x35/0x40
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      489dd102