1. 05 9月, 2017 8 次提交
  2. 04 9月, 2017 8 次提交
    • F
      net: dsa: loop: Do not unregister invalid fixed PHY · 6d9c153a
      Florian Fainelli 提交于
      During error injection it was possible to crash in dsa_loop_exit() because of
      an attempt to unregister an invalid PHY. We actually want to the driver probing
      in dsa_loop_init() even though fixed_phy_register() may return an error to
      exercise how DSA deals with such cases, but we should not be crashing during
      driver removal.
      
      Fixes: 98cd1552 ("net: dsa: Mock-up driver")
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d9c153a
    • D
      Merge branch 'l2tp-session-creation-fixes' · 443cb3a3
      David S. Miller 提交于
      Guillaume Nault says:
      
      ====================
      l2tp: session creation fixes
      
      The session creation process has a few issues wrt. concurrent tunnel
      deletion.
      
      Patch #1 avoids creating sessions in tunnels that are getting removed.
      This prevents races where sessions could try to take tunnel resources
      that were already released.
      
      Patch #2 removes some racy l2tp_tunnel_find() calls in session creation
      callbacks. Together with path #1 it ensures that sessions can only
      access tunnel resources that are guaranteed to remain valid during the
      session creation process.
      
      There are other problems with how sessions are created: pseudo-wire
      specific data are set after the session is added to the tunnel. So
      the session can be used, or deleted, before it has been completely
      initialised. Separating session allocation from session registration
      would be necessary, but we'd still have circular dependencies
      preventing race-free registration. I'll consider this issue in future
      series.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      443cb3a3
    • G
      l2tp: pass tunnel pointer to ->session_create() · f026bc29
      Guillaume Nault 提交于
      Using l2tp_tunnel_find() in pppol2tp_session_create() and
      l2tp_eth_create() is racy, because no reference is held on the
      returned session. These functions are only used to implement the
      ->session_create callback which is run by l2tp_nl_cmd_session_create().
      Therefore searching for the parent tunnel isn't necessary because
      l2tp_nl_cmd_session_create() already has a pointer to it and holds a
      reference.
      
      This patch modifies ->session_create()'s prototype to directly pass the
      the parent tunnel as parameter, thus avoiding searching for it in
      pppol2tp_session_create() and l2tp_eth_create().
      
      Since we have to touch the ->session_create() call in
      l2tp_nl_cmd_session_create(), let's also remove the useless conditional:
      we know that ->session_create isn't NULL at this point because it's
      already been checked earlier in this same function.
      
      Finally, one might be tempted to think that the removed
      l2tp_tunnel_find() calls were harmless because they would return the
      same tunnel as the one held by l2tp_nl_cmd_session_create() anyway.
      But that tunnel might be removed and a new one created with same tunnel
      Id before the l2tp_tunnel_find() call. In this case l2tp_tunnel_find()
      would return the new tunnel which wouldn't be protected by the
      reference held by l2tp_nl_cmd_session_create().
      
      Fixes: 309795f4 ("l2tp: Add netlink control API for L2TP")
      Fixes: d9e31d17 ("l2tp: Add L2TP ethernet pseudowire support")
      Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f026bc29
    • G
      l2tp: prevent creation of sessions on terminated tunnels · f3c66d4e
      Guillaume Nault 提交于
      l2tp_tunnel_destruct() sets tunnel->sock to NULL, then removes the
      tunnel from the pernet list and finally closes all its sessions.
      Therefore, it's possible to add a session to a tunnel that is still
      reachable, but for which tunnel->sock has already been reset. This can
      make l2tp_session_create() dereference a NULL pointer when calling
      sock_hold(tunnel->sock).
      
      This patch adds the .acpt_newsess field to struct l2tp_tunnel, which is
      used by l2tp_tunnel_closeall() to prevent addition of new sessions to
      tunnels. Resetting tunnel->sock is done after l2tp_tunnel_closeall()
      returned, so that l2tp_session_add_to_tunnel() can safely take a
      reference on it when .acpt_newsess is true.
      
      The .acpt_newsess field is modified in l2tp_tunnel_closeall(), rather
      than in l2tp_tunnel_destruct(), so that it benefits all tunnel removal
      mechanisms. E.g. on UDP tunnels, a session could be added to a tunnel
      after l2tp_udp_encap_destroy() proceeded. This would prevent the tunnel
      from being removed because of the references held by this new session
      on the tunnel and its socket. Even though the session could be removed
      manually later on, this defeats the purpose of
      commit 9980d001 ("l2tp: add udp encap socket destroy handler").
      
      Fixes: fd558d18 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
      Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3c66d4e
    • D
      Merge branch 'net-revert-lib-percpu_counter-API-for-fragmentation-mem-accounting' · 4113f36b
      David S. Miller 提交于
      Jesper Dangaard Brouer says:
      
      ====================
      net: revert lib/percpu_counter API for fragmentation mem accounting
      
      There is a bug in fragmentation codes use of the percpu_counter API,
      that can cause issues on systems with many CPUs, above 24 CPUs.
      
      After much consideration and different attempts at solving the API
      usage.  The conclusion is to revert to the simple atomic_t API instead.
      
      The ratio between batch size and threshold size make it a bad use-case
      for the lib/percpu_counter API.  As using the correct API calls will
      unfortunately cause systems with many CPUs to always execute an
      expensive sum across all CPUs. Plus the added complexity is not worth it.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4113f36b
    • J
      Revert "net: fix percpu memory leaks" · 5a63643e
      Jesper Dangaard Brouer 提交于
      This reverts commit 1d6119ba.
      
      After reverting commit 6d7b857d ("net: use lib/percpu_counter API
      for fragmentation mem accounting") then here is no need for this
      fix-up patch.  As percpu_counter is no longer used, it cannot
      memory leak it any-longer.
      
      Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting")
      Fixes: 1d6119ba ("net: fix percpu memory leaks")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a63643e
    • J
      Revert "net: use lib/percpu_counter API for fragmentation mem accounting" · fb452a1a
      Jesper Dangaard Brouer 提交于
      This reverts commit 6d7b857d.
      
      There is a bug in fragmentation codes use of the percpu_counter API,
      that can cause issues on systems with many CPUs.
      
      The frag_mem_limit() just reads the global counter (fbc->count),
      without considering other CPUs can have upto batch size (130K) that
      haven't been subtracted yet.  Due to the 3MBytes lower thresh limit,
      this become dangerous at >=24 CPUs (3*1024*1024/130000=24).
      
      The correct API usage would be to use __percpu_counter_compare() which
      does the right thing, and takes into account the number of (online)
      CPUs and batch size, to account for this and call __percpu_counter_sum()
      when needed.
      
      We choose to revert the use of the lib/percpu_counter API for frag
      memory accounting for several reasons:
      
      1) On systems with CPUs > 24, the heavier fully locked
         __percpu_counter_sum() is always invoked, which will be more
         expensive than the atomic_t that is reverted to.
      
      Given systems with more than 24 CPUs are becoming common this doesn't
      seem like a good option.  To mitigate this, the batch size could be
      decreased and thresh be increased.
      
      2) The add_frag_mem_limit+sub_frag_mem_limit pairs happen on the RX
         CPU, before SKBs are pushed into sockets on remote CPUs.  Given
         NICs can only hash on L2 part of the IP-header, the NIC-RXq's will
         likely be limited.  Thus, a fair chance that atomic add+dec happen
         on the same CPU.
      
      Revert note that commit 1d6119ba ("net: fix percpu memory leaks")
      removed init_frag_mem_limit() and instead use inet_frags_init_net().
      After this revert, inet_frags_uninit_net() becomes empty.
      
      Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting")
      Fixes: 1d6119ba ("net: fix percpu memory leaks")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb452a1a
    • S
      ipv4: Don't override return code from ip_route_input_noref() · 64327fc8
      Stefano Brivio 提交于
      After ip_route_input() calls ip_route_input_noref(), another
      check on skb_dst() is done, but if this fails, we shouldn't
      override the return code from ip_route_input_noref(), as it
      could have been more specific (i.e. -EHOSTUNREACH).
      
      This also saves one call to skb_dst_force_safe() and one to
      skb_dst() in case the ip_route_input_noref() check fails.
      Reported-by: NSabrina Dubroca <sdubroca@redhat.com>
      Fixes: 9df16efa ("ipv4: call dst_hold_safe() properly")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Acked-by: NWei Wang <weiwan@google.com>
      Acked-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64327fc8
  3. 02 9月, 2017 11 次提交
    • O
      epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove() · 138e4ad6
      Oleg Nesterov 提交于
      The race was introduced by me in commit 971316f0 ("epoll:
      ep_unregister_pollwait() can use the freed pwq->whead").  I did not
      realize that nothing can protect eventpoll after ep_poll_callback() sets
      ->whead = NULL, only whead->lock can save us from the race with
      ep_free() or ep_remove().
      
      Move ->whead = NULL to the end of ep_poll_callback() and add the
      necessary barriers.
      
      TODO: cleanup the ewake/EPOLLEXCLUSIVE logic, it was confusing even
      before this patch.
      
      Hopefully this explains use-after-free reported by syzcaller:
      
      	BUG: KASAN: use-after-free in debug_spin_lock_before
      	...
      	 _raw_spin_lock_irqsave+0x4a/0x60 kernel/locking/spinlock.c:159
      	 ep_poll_callback+0x29f/0xff0 fs/eventpoll.c:1148
      
      this is spin_lock(eventpoll->lock),
      
      	...
      	Freed by task 17774:
      	...
      	 kfree+0xe8/0x2c0 mm/slub.c:3883
      	 ep_free+0x22c/0x2a0 fs/eventpoll.c:865
      
      Fixes: 971316f0 ("epoll: ep_unregister_pollwait() can use the freed pwq->whead")
      Reported-by: N范龙飞 <long7573@126.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      138e4ad6
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 8cf9f2a2
      Linus Torvalds 提交于
      Pull networking fixes from David Miller:
      
       1) Fix handling of pinned BPF map nodes in hash of maps, from Daniel
          Borkmann.
      
       2) IPSEC ESP error paths leak memory, from Steffen Klassert.
      
       3) We need an RCU grace period before freeing fib6_node objects, from
          Wei Wang.
      
       4) Must check skb_put_padto() return value in HSR driver, from FLorian
          Fainelli.
      
       5) Fix oops on PHY probe failure in ftgmac100 driver, from Andrew
          Jeffery.
      
       6) Fix infinite loop in UDP queue when using SO_PEEK_OFF, from Eric
          Dumazet.
      
       7) Use after free when tcf_chain_destroy() called multiple times, from
          Jiri Pirko.
      
       8) Fix KSZ DSA tag layer multiple free of SKBS, from Florian Fainelli.
      
       9) Fix leak of uninitialized memory in sctp_get_sctp_info(),
          inet_diag_msg_sctpladdrs_fill() and inet_diag_msg_sctpaddrs_fill().
          From Stefano Brivio.
      
      10) L2TP tunnel refcount fixes from Guillaume Nault.
      
      11) Don't leak UDP secpath in udp_set_dev_scratch(), from Yossi
          Kauperman.
      
      12) Revert a PHY layer change wrt. handling of PHY_HALTED state in
          phy_stop_machine(), it causes regressions for multiple people. From
          Florian Fainelli.
      
      13) When packets are sent out of br0 we have to clear the
          offload_fwdq_mark value.
      
      14) Several NULL pointer deref fixes in packet schedulers when their
          ->init() routine fails. From Nikolay Aleksandrov.
      
      15) Aquantium devices cannot checksum offload correctly when the packet
          is <= 60 bytes. From Pavel Belous.
      
      16) Fix vnet header access past end of buffer in AF_PACKET, from
          Benjamin Poirier.
      
      17) Double free in probe error paths of nfp driver, from Dan Carpenter.
      
      18) QOS capability not checked properly in DCB init paths of mlx5
          driver, from Huy Nguyen.
      
      19) Fix conflicts between firmware load failure and health_care timer in
          mlx5, also from Huy Nguyen.
      
      20) Fix dangling page pointer when DMA mapping errors occur in mlx5,
          from Eran Ben ELisha.
      
      21) ->ndo_setup_tc() in bnxt_en driver doesn't count rings properly,
          from Michael Chan.
      
      22) Missing MSIX vector free in bnxt_en, also from Michael Chan.
      
      23) Refcount leak in xfrm layer when using sk_policy, from Lorenzo
          Colitti.
      
      24) Fix copy of uninitialized data in qlge driver, from Arnd Bergmann.
      
      25) bpf_setsockopts() erroneously always returns -EINVAL even on
          success. Fix from Yuchung Cheng.
      
      26) tipc_rcv() needs to linearize the SKB before parsing the inner
          headers, from Parthasarathy Bhuvaragan.
      
      27) Fix deadlock between link status updates and link removal in netvsc
          driver, from Stephen Hemminger.
      
      28) Missed locking of page fragment handling in ESP output, from Steffen
          Klassert.
      
      29) Fix refcnt leak in ebpf congestion control code, from Sabrina
          Dubroca.
      
      30) sxgbe_probe_config_dt() doesn't check devm_kzalloc()'s return value,
          from Christophe Jaillet.
      
      31) Fix missing ipv6 rx_dst_cookie update when rx_dst is updated during
          early demux, from Paolo Abeni.
      
      32) Several info leaks in xfrm_user layer, from Mathias Krause.
      
      33) Fix out of bounds read in cxgb4 driver, from Stefano Brivio.
      
      34) Properly propagate obsolete state of route upwards in ipv6 so that
          upper holders like xfrm can see it. From Xin Long.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (118 commits)
        udp: fix secpath leak
        bridge: switchdev: Clear forward mark when transmitting packet
        mlxsw: spectrum: Forbid linking to devices that have uppers
        wl1251: add a missing spin_lock_init()
        Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"
        net: dsa: bcm_sf2: Fix number of CFP entries for BCM7278
        kcm: do not attach PF_KCM sockets to avoid deadlock
        sch_tbf: fix two null pointer dereferences on init failure
        sch_sfq: fix null pointer dereference on init failure
        sch_netem: avoid null pointer deref on init failure
        sch_fq_codel: avoid double free on init failure
        sch_cbq: fix null pointer dereferences on init failure
        sch_hfsc: fix null pointer deref and double free on init failure
        sch_hhf: fix null pointer dereference on init failure
        sch_multiq: fix double free on init failure
        sch_htb: fix crash on init failure
        net/mlx5e: Fix CQ moderation mode not set properly
        net/mlx5e: Fix inline header size for small packets
        net/mlx5: E-Switch, Unload the representors in the correct order
        net/mlx5e: Properly resolve TC offloaded ipv6 vxlan tunnel source address
        ...
      8cf9f2a2
    • L
      Merge tag 'ceph-for-4.13-rc8' of git://github.com/ceph/ceph-client · b8a78bb4
      Linus Torvalds 提交于
      Pull ceph fix from Ilya Dryomov:
       "ceph fscache page locking fix from Zheng, marked for stable"
      
      * tag 'ceph-for-4.13-rc8' of git://github.com/ceph/ceph-client:
        ceph: fix readpage from fscache
      b8a78bb4
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 3e1d79c8
      Linus Torvalds 提交于
      Pull input fixes from Dmitry Torokhov:
       "Just a couple drivers fixes (Synaptics PS/2, Xpad)"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: xpad - fix PowerA init quirk for some gamepad models
        Input: synaptics - fix device info appearing different on reconnect
      3e1d79c8
    • L
      Merge tag 'mmc-v4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc · d7e44b86
      Linus Torvalds 提交于
      Pull two more MMC fixes from Ulf Hansson:
       "MMC core:
         - Fix block status codes
      
        MMC host:
         - sdhci-xenon: Fix SD bus voltage select"
      
      * tag 'mmc-v4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
        mmc: sdhci-xenon: add set_power callback
        mmc: block: Fix block status codes
      d7e44b86
    • L
      Merge tag 'sound-4.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 381cce59
      Linus Torvalds 提交于
      Pull sound fixes from Takashi Iwai:
       "Three regression fixes that should be addressed before the final
        release: a missing mutex call in OSS PCM emulation ioctl, ASoC rt5670
        headset detection breakage, and a regression in simple-card parser
        code"
      
      * tag 'sound-4.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ASoC: simple_card_utils: fix fallback when "label" property isn't present
        ALSA: pcm: Fix power lock unbalance via OSS emulation
        ASoC: rt5670: Fix GPIO headset detection regression
      381cce59
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · bba2a5b8
      Linus Torvalds 提交于
      Pull s390 fixes from Martin Schwidefsky:
       "Three more bug fixes for v4.13.
      
        The two memory management related fixes are quite new, they fix kernel
        crashes that can be triggered by user space.
      
        The third commit fixes a bug in the vfio ccw translation code"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/mm: fix BUG_ON in crst_table_upgrade
        s390/mm: fork vs. 5 level page tabel
        vfio: ccw: fix bad ptr math for TIC cda translation
      bba2a5b8
    • L
      Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · a1c516a6
      Linus Torvalds 提交于
      Pull crypto fixes from Herbert Xu:
       "This fixes the following issues:
      
         - Regression in chacha20 handling of chunked input
      
         - Crash in algif_skcipher when used with async io
      
         - Potential bogus pointer dereference in lib/mpi"
      
      * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: algif_skcipher - only call put_page on referenced and used pages
        crypto: testmgr - add chunked test cases for chacha20
        crypto: chacha20 - fix handling of chunked input
        lib/mpi: kunmap after finishing accessing buffer
      a1c516a6
    • Y
      udp: fix secpath leak · e8a732d1
      Yossi Kuperman 提交于
      After commit dce4551c ("udp: preserve head state for IP_CMSG_PASSSEC")
      we preserve the secpath for the whole skb lifecycle, but we also
      end up leaking a reference to it.
      
      We must clear the head state on skb reception, if secpath is
      present.
      
      Fixes: dce4551c ("udp: preserve head state for IP_CMSG_PASSSEC")
      Signed-off-by: NYossi Kuperman <yossiku@mellanox.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8a732d1
    • I
      bridge: switchdev: Clear forward mark when transmitting packet · 79e99bdd
      Ido Schimmel 提交于
      Commit 6bc506b4 ("bridge: switchdev: Add forward mark support for
      stacked devices") added the 'offload_fwd_mark' bit to the skb in order
      to allow drivers to indicate to the bridge driver that they already
      forwarded the packet in L2.
      
      In case the bit is set, before transmitting the packet from each port,
      the port's mark is compared with the mark stored in the skb's control
      block. If both marks are equal, we know the packet arrived from a switch
      device that already forwarded the packet and it's not re-transmitted.
      
      However, if the packet is transmitted from the bridge device itself
      (e.g., br0), we should clear the 'offload_fwd_mark' bit as the mark
      stored in the skb's control block isn't valid.
      
      This scenario can happen in rare cases where a packet was trapped during
      L3 forwarding and forwarded by the kernel to a bridge device.
      
      Fixes: 6bc506b4 ("bridge: switchdev: Add forward mark support for stacked devices")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reported-by: NYotam Gigi <yotamg@mellanox.com>
      Tested-by: NYotam Gigi <yotamg@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79e99bdd
    • I
      mlxsw: spectrum: Forbid linking to devices that have uppers · 25cc72a3
      Ido Schimmel 提交于
      The mlxsw driver relies on NETDEV_CHANGEUPPER events to configure the
      device in case a port is enslaved to a master netdev such as bridge or
      bond.
      
      Since the driver ignores events unrelated to its ports and their
      uppers, it's possible to engineer situations in which the device's data
      path differs from the kernel's.
      
      One example to such a situation is when a port is enslaved to a bond
      that is already enslaved to a bridge. When the bond was enslaved the
      driver ignored the event - as the bond wasn't one of its uppers - and
      therefore a bridge port instance isn't created in the device.
      
      Until such configurations are supported forbid them by checking that the
      upper device doesn't have uppers of its own.
      
      Fixes: 0d65fc13 ("mlxsw: spectrum: Implement LAG port join/leave")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reported-by: NNogah Frankel <nogahf@mellanox.com>
      Tested-by: NNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25cc72a3
  4. 01 9月, 2017 13 次提交
    • L
      Merge tag 'cifs-fixes-for-4.13-rc7-and-stable' of git://git.samba.org/sfrench/cifs-2.6 · e89ce1f8
      Linus Torvalds 提交于
      Pull cifs fixes from Steve French:
       "Two cifs bug fixes for stable"
      
      * tag 'cifs-fixes-for-4.13-rc7-and-stable' of git://git.samba.org/sfrench/cifs-2.6:
        CIFS: remove endian related sparse warning
        CIFS: Fix maximum SMB2 header size
      e89ce1f8
    • L
      Merge branch 'for-linus' of git://git.kernel.dk/linux-block · 501d9f79
      Linus Torvalds 提交于
      Pull block fixes from Jens Axboe:
       "Unfortunately a few issues that warrant sending another pull request,
        even if I had hoped to avoid it. This contains:
      
         - A fix for multiqueue xen-blkback, on tear down / disconnect.
      
         - A few fixups for NVMe, including a wrong bit definition, fix for
           host memory buffers, and an nvme rdma page size fix"
      
      * 'for-linus' of git://git.kernel.dk/linux-block:
        nvme: fix the definition of the doorbell buffer config support bit
        nvme-pci: use dma memory for the host memory buffer descriptors
        nvme-rdma: default MR page size to 4k
        xen-blkback: stop blkback thread of every queue in xen_blkif_disconnect
      501d9f79
    • L
      Merge tag 'for-4.13/dm-fixes-2' of... · 73adb8c5
      Linus Torvalds 提交于
      Merge tag 'for-4.13/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
      
      Pull device mapper fixes from Mike Snitzer:
      
       - A couple fixes for bugs introduced as part of the blk_status_t block
         layer changes during the 4.13 merge window
      
       - A printk throttling fix to use discrete rate limiting state for each
         DM log level
      
       - A stable@ fix for DM multipath that delays request requeueing to
         avoid CPU lockup if/when the request queue is "dying"
      
      * tag 'for-4.13/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
        dm mpath: do not lock up a CPU with requeuing activity
        dm: fix printk() rate limiting code
        dm mpath: retry BLK_STS_RESOURCE errors
        dm: fix the second dec_pending() argument in __split_and_process_bio()
      73adb8c5
    • L
      Merge branch 'akpm' (patches from Andrew) · 1b2614f1
      Linus Torvalds 提交于
      Merge more fixes from Andrew Morton:
       "6 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        scripts/dtc: fix '%zx' warning
        include/linux/compiler.h: don't perform compiletime_assert with -O0
        mm, madvise: ensure poisoned pages are removed from per-cpu lists
        mm, uprobes: fix multiple free of ->uprobes_state.xol_area
        kernel/kthread.c: kthread_worker: don't hog the cpu
        mm,page_alloc: don't call __node_reclaim() with oom_lock held.
      1b2614f1
    • L
      Merge branch 'mmu_notifier_fixes' · ea25c431
      Linus Torvalds 提交于
      Merge mmu_notifier fixes from Jérôme Glisse:
       "The invalidate_page callback suffered from 2 pitfalls. First it used
        to happen after page table lock was release and thus a new page might
        have been setup for the virtual address before the call to
        invalidate_page().
      
        This is in a weird way fixed by commit c7ab0d2f ("mm: convert
        try_to_unmap_one() to use page_vma_mapped_walk()") which moved the
        callback under the page table lock. Which also broke several existing
        user of the mmu_notifier API that assumed they could sleep inside this
        callback.
      
        The second pitfall was invalidate_page being the only callback not
        taking a range of address in respect to invalidation but was giving an
        address and a page. Lot of the callback implementer assumed this could
        never be THP and thus failed to invalidate the appropriate range for
        THP pages.
      
        By killing this callback we unify the mmu_notifier callback API to
        always take a virtual address range as input.
      
        There is now two clear API (I am not mentioning the youngess API which
        is seldomly used):
      
         - invalidate_range_start()/end() callback (which allow you to sleep)
      
         - invalidate_range() where you can not sleep but happen right after
           page table update under page table lock
      
        Note that a lot of existing user feels broken in respect to
        range_start/ range_end. Many user only have range_start() callback but
        there is nothing preventing them to undo what was invalidated in their
        range_start() callback after it returns but before any CPU page table
        update take place.
      
        The code pattern use in kvm or umem odp is an example on how to
        properly avoid such race. In a nutshell use some kind of sequence
        number and active range invalidation counter to block anything that
        might undo what the range_start() callback did.
      
        If you do not care about keeping fully in sync with CPU page table (ie
        you can live with CPU page table pointing to new different page for a
        given virtual address) then you can take a reference on the pages
        inside the range_start callback and drop it in range_end or when your
        driver is done with those pages.
      
        Last alternative is to use invalidate_range() if you can do
        invalidation without sleeping as invalidate_range() callback happens
        under the CPU page table spinlock right after the page table is
        updated.
      
        The first two patches convert existing mmu_notifier_invalidate_page()
        calls to mmu_notifier_invalidate_range() and bracket those call with
        call to mmu_notifier_invalidate_range_start()/end().
      
        The next ten patches remove existing invalidate_page() callback as it
        can no longer happen.
      
        Finally the last page remove the invalidate_page() callback completely
        so it can RIP.
      
        Changes since v1:
         - remove more dead code in kvm (no testing impact)
         - more accurate end address computation (patch 2) in page_mkclean_one
           and try_to_unmap_one
         - added tested-by/reviewed-by gotten so far"
      
      * emailed patches from Jérôme Glisse <jglisse@redhat.com>:
        mm/mmu_notifier: kill invalidate_page
        KVM: update to new mmu_notifier semantic v2
        xen/gntdev: update to new mmu_notifier semantic
        sgi-gru: update to new mmu_notifier semantic
        misc/mic/scif: update to new mmu_notifier semantic
        iommu/intel: update to new mmu_notifier semantic
        iommu/amd: update to new mmu_notifier semantic
        IB/hfi1: update to new mmu_notifier semantic
        IB/umem: update to new mmu_notifier semantic
        drm/amdgpu: update to new mmu_notifier semantic
        powerpc/powernv: update to new mmu_notifier semantic
        mm/rmap: update to new mmu_notifier semantic v2
        dax: update to new mmu_notifier semantic
      ea25c431
    • D
      jfs should use MAX_LFS_FILESIZE when calculating s_maxbytes · c227390c
      Dave Kleikamp 提交于
      jfs had previously avoided the use of MAX_LFS_FILESIZE because it hadn't
      accounted for the whole 32-bit index range on 32-bit systems.  That has
      been fixed by commit 0cc3b0ec ("Clarify (and fix) MAX_LFS_FILESIZE
      macros"), so we can simplify the code now.
      
      Suggested by Andreas Dilger.
      Signed-off-by: NDave Kleikamp <dave.kleikamp@oracle.com>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Cc: jfs-discussion@lists.sourceforge.net
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c227390c
    • R
      scripts/dtc: fix '%zx' warning · e6618692
      Russell King 提交于
      dtc uses an incorrect format specifier for printing a uint64_t value.
      uint64_t may be either 'unsigned long' or 'unsigned long long' depending
      on the host architecture.
      
      Fix this by using %llx and casting to unsigned long long, which ensures
      that we always have a wide enough variable to print 64 bits of hex.
      
          HOSTCC  scripts/dtc/checks.o
        scripts/dtc/checks.c: In function 'check_simple_bus_reg':
        scripts/dtc/checks.c:876:2: warning: format '%zx' expects argument of type 'size_t', but argument 4 has type 'uint64_t' [-Wformat=]
          snprintf(unit_addr, sizeof(unit_addr), "%zx", reg);
          ^
        scripts/dtc/checks.c:876:2: warning: format '%zx' expects argument of type 'size_t', but argument 4 has type 'uint64_t' [-Wformat=]
      
      Link: http://lkml.kernel.org/r/20170829222034.GJ20805@n2100.armlinux.org.uk
      Fixes: 828d4cdd ("dtc: check.c fix compile error")
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Michal Marek <mmarek@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6618692
    • J
      include/linux/compiler.h: don't perform compiletime_assert with -O0 · c03567a8
      Joe Stringer 提交于
      Commit c7acec71 ("kernel.h: handle pointers to arrays better in
      container_of()") made use of __compiletime_assert() from container_of()
      thus increasing the usage of this macro, allowing developers to notice
      type conflicts in usage of container_of() at compile time.
      
      However, the implementation of __compiletime_assert relies on compiler
      optimizations to report an error.  This means that if a developer uses
      "-O0" with any code that performs container_of(), the compiler will always
      report an error regardless of whether there is an actual problem in the
      code.
      
      This patch disables compile_time_assert when optimizations are disabled to
      allow such code to compile with CFLAGS="-O0".
      
      Example compilation failure:
      
      ./include/linux/compiler.h:547:38: error: call to `__compiletime_assert_94' declared with attribute error: pointer type mismatch in container_of()
        _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
                                            ^
      ./include/linux/compiler.h:530:4: note: in definition of macro `__compiletime_assert'
          prefix ## suffix();    \
          ^~~~~~
      ./include/linux/compiler.h:547:2: note: in expansion of macro `_compiletime_assert'
        _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
        ^~~~~~~~~~~~~~~~~~~
      ./include/linux/build_bug.h:46:37: note: in expansion of macro `compiletime_assert'
       #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                           ^~~~~~~~~~~~~~~~~~
      ./include/linux/kernel.h:860:2: note: in expansion of macro `BUILD_BUG_ON_MSG'
        BUILD_BUG_ON_MSG(!__same_type(*(ptr), ((type *)0)->member) && \
        ^~~~~~~~~~~~~~~~
      
      [akpm@linux-foundation.org: use do{}while(0), per Michal]
      Link: http://lkml.kernel.org/r/20170829230114.11662-1-joe@ovn.org
      Fixes: c7acec71 ("kernel.h: handle pointers to arrays better in container_of()")
      Signed-off-by: NJoe Stringer <joe@ovn.org>
      Cc: Ian Abbott <abbotti@mev.co.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c03567a8
    • M
      mm, madvise: ensure poisoned pages are removed from per-cpu lists · c461ad6a
      Mel Gorman 提交于
      Wendy Wang reported off-list that a RAS HWPOISON-SOFT test case failed
      and bisected it to the commit 479f854a ("mm, page_alloc: defer
      debugging checks of pages allocated from the PCP").
      
      The problem is that a page that was poisoned with madvise() is reused.
      The commit removed a check that would trigger if DEBUG_VM was enabled
      but re-enabling the check only fixes the problem as a side-effect by
      printing a bad_page warning and recovering.
      
      The root of the problem is that an madvise() can leave a poisoned page
      on the per-cpu list.  This patch drains all per-cpu lists after pages
      are poisoned so that they will not be reused.  Wendy reports that the
      test case in question passes with this patch applied.  While this could
      be done in a targeted fashion, it is over-complicated for such a rare
      operation.
      
      Link: http://lkml.kernel.org/r/20170828133414.7qro57jbepdcyz5x@techsingularity.net
      Fixes: 479f854a ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reported-by: NWang, Wendy <wendy.wang@intel.com>
      Tested-by: NWang, Wendy <wendy.wang@intel.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: "Hansen, Dave" <dave.hansen@intel.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c461ad6a
    • E
      mm, uprobes: fix multiple free of ->uprobes_state.xol_area · 355627f5
      Eric Biggers 提交于
      Commit 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for
      write killable") made it possible to kill a forking task while it is
      waiting to acquire its ->mmap_sem for write, in dup_mmap().
      
      However, it was overlooked that this introduced an new error path before
      the new mm_struct's ->uprobes_state.xol_area has been set to NULL after
      being copied from the old mm_struct by the memcpy in dup_mm().  For a
      task that has previously hit a uprobe tracepoint, this resulted in the
      'struct xol_area' being freed multiple times if the task was killed at
      just the right time while forking.
      
      Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather
      than in uprobe_dup_mmap().
      
      With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C
      program given by commit 2b7e8665 ("fork: fix incorrect fput of
      ->exe_file causing use-after-free"), provided that a uprobe tracepoint
      has been set on the fork_thread() function.  For example:
      
          $ gcc reproducer.c -o reproducer -lpthread
          $ nm reproducer | grep fork_thread
          0000000000400719 t fork_thread
          $ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events
          $ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
          $ ./reproducer
      
      Here is the use-after-free reported by KASAN:
      
          BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200
          Read of size 8 at addr ffff8800320a8b88 by task reproducer/198
      
          CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f #255
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
          Call Trace:
           dump_stack+0xdb/0x185
           print_address_description+0x7e/0x290
           kasan_report+0x23b/0x350
           __asan_report_load8_noabort+0x19/0x20
           uprobe_clear_state+0x1c4/0x200
           mmput+0xd6/0x360
           do_exit+0x740/0x1670
           do_group_exit+0x13f/0x380
           get_signal+0x597/0x17d0
           do_signal+0x99/0x1df0
           exit_to_usermode_loop+0x166/0x1e0
           syscall_return_slowpath+0x258/0x2c0
           entry_SYSCALL_64_fastpath+0xbc/0xbe
      
          ...
      
          Allocated by task 199:
           save_stack_trace+0x1b/0x20
           kasan_kmalloc+0xfc/0x180
           kmem_cache_alloc_trace+0xf3/0x330
           __create_xol_area+0x10f/0x780
           uprobe_notify_resume+0x1674/0x2210
           exit_to_usermode_loop+0x150/0x1e0
           prepare_exit_to_usermode+0x14b/0x180
           retint_user+0x8/0x20
      
          Freed by task 199:
           save_stack_trace+0x1b/0x20
           kasan_slab_free+0xa8/0x1a0
           kfree+0xba/0x210
           uprobe_clear_state+0x151/0x200
           mmput+0xd6/0x360
           copy_process.part.8+0x605f/0x65d0
           _do_fork+0x1a5/0xbd0
           SyS_clone+0x19/0x20
           do_syscall_64+0x22f/0x660
           return_from_SYSCALL_64+0x0/0x7a
      
      Note: without KASAN, you may instead see a "Bad page state" message, or
      simply a general protection fault.
      
      Link: http://lkml.kernel.org/r/20170830033303.17927-1-ebiggers3@gmail.com
      Fixes: 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>    [4.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      355627f5
    • S
      kernel/kthread.c: kthread_worker: don't hog the cpu · 22cf8bc6
      Shaohua Li 提交于
      If the worker thread continues getting work, it will hog the cpu and rcu
      stall complains.  Make it a good citizen.  This is triggered in a loop
      block device test.
      
      Link: http://lkml.kernel.org/r/5de0a179b3184e1a2183fc503448b0269f24d75b.1503697127.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22cf8bc6
    • T
      mm,page_alloc: don't call __node_reclaim() with oom_lock held. · e746bf73
      Tetsuo Handa 提交于
      We are doing a last second memory allocation attempt before calling
      out_of_memory().  But since slab shrinker functions might indirectly
      wait for other thread's __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory
      allocations via sleeping locks, calling slab shrinker functions from
      node_reclaim() from get_page_from_freelist() with oom_lock held has
      possibility of deadlock.  Therefore, make sure that last second memory
      allocation attempt does not call slab shrinker functions.
      
      Link: http://lkml.kernel.org/r/1503577106-9196-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e746bf73
    • J
      mm/mmu_notifier: kill invalidate_page · 5f32b265
      Jérôme Glisse 提交于
      The invalidate_page callback suffered from two pitfalls.  First it used
      to happen after the page table lock was release and thus a new page
      might have setup before the call to invalidate_page() happened.
      
      This is in a weird way fixed by commit c7ab0d2f ("mm: convert
      try_to_unmap_one() to use page_vma_mapped_walk()") that moved the
      callback under the page table lock but this also broke several existing
      users of the mmu_notifier API that assumed they could sleep inside this
      callback.
      
      The second pitfall was invalidate_page() being the only callback not
      taking a range of address in respect to invalidation but was giving an
      address and a page.  Lots of the callback implementers assumed this
      could never be THP and thus failed to invalidate the appropriate range
      for THP.
      
      By killing this callback we unify the mmu_notifier callback API to
      always take a virtual address range as input.
      
      Finally this also simplifies the end user life as there is now two clear
      choices:
        - invalidate_range_start()/end() callback (which allow you to sleep)
        - invalidate_range() where you can not sleep but happen right after
          page table update under page table lock
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Bernhard Held <berny156@gmx.de>
      Cc: Adam Borowski <kilobyte@angband.pl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: axie <axie@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f32b265