1. 07 2月, 2019 8 次提交
    • L
      net: ip_gre: always reports o_key to userspace · 897ea28b
      Lorenzo Bianconi 提交于
      [ Upstream commit feaf5c796b3f0240f10d0d6d0b686715fd58a05b ]
      
      Erspan protocol (version 1 and 2) relies on o_key to configure
      session id header field. However TUNNEL_KEY bit is cleared in
      erspan_xmit since ERSPAN protocol does not set the key field
      of the external GRE header and so the configured o_key is not reported
      to userspace. The issue can be triggered with the following reproducer:
      
      $ip link add erspan1 type erspan local 192.168.0.1 remote 192.168.0.2 \
          key 1 seq erspan_ver 1
      $ip link set erspan1 up
      $ip -d link sh erspan1
      
      erspan1@NONE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UNKNOWN mode DEFAULT
        link/ether 52:aa:99:95:9a:b5 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500
        erspan remote 192.168.0.2 local 192.168.0.1 ttl inherit ikey 0.0.0.1 iseq oseq erspan_index 0
      
      Fix the issue adding TUNNEL_KEY bit to the o_flags parameter in
      ipgre_fill_info
      
      Fixes: 84e54fe0 ("gre: introduce native tunnel support for ERSPAN")
      Signed-off-by: NLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      897ea28b
    • J
      l2tp: fix reading optional fields of L2TPv3 · 8de67666
      Jacob Wen 提交于
      [ Upstream commit 4522a70db7aa5e77526a4079628578599821b193 ]
      
      Use pskb_may_pull() to make sure the optional fields are in skb linear
      parts, so we can safely read them later.
      
      It's easy to reproduce the issue with a net driver that supports paged
      skb data. Just create a L2TPv3 over IP tunnel and then generates some
      network traffic.
      Once reproduced, rx err in /sys/kernel/debug/l2tp/tunnels will increase.
      
      Changes in v4:
      1. s/l2tp_v3_pull_opt/l2tp_v3_ensure_opt_in_linear/
      2. s/tunnel->version != L2TP_HDR_VER_2/tunnel->version == L2TP_HDR_VER_3/
      3. Add 'Fixes' in commit messages.
      
      Changes in v3:
      1. To keep consistency, move the code out of l2tp_recv_common.
      2. Use "net" instead of "net-next", since this is a bug fix.
      
      Changes in v2:
      1. Only fix L2TPv3 to make code simple.
         To fix both L2TPv3 and L2TPv2, we'd better refactor l2tp_recv_common.
         It's complicated to do so.
      2. Reloading pointers after pskb_may_pull
      
      Fixes: f7faffa3 ("l2tp: Add L2TPv3 protocol support")
      Fixes: 0d76751f ("l2tp: Add L2TPv3 IP encapsulation (no UDP) support")
      Fixes: a32e0eec ("l2tp: introduce L2TPv3 IP encapsulation support for IPv6")
      Signed-off-by: NJacob Wen <jian.w.wen@oracle.com>
      Acked-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8de67666
    • J
      l2tp: copy 4 more bytes to linear part if necessary · 3d418a25
      Jacob Wen 提交于
      [ Upstream commit 91c524708de6207f59dd3512518d8a1c7b434ee3 ]
      
      The size of L2TPv2 header with all optional fields is 14 bytes.
      l2tp_udp_recv_core only moves 10 bytes to the linear part of a
      skb. This may lead to l2tp_recv_common read data outside of a skb.
      
      This patch make sure that there is at least 14 bytes in the linear
      part of a skb to meet the maximum need of l2tp_udp_recv_core and
      l2tp_recv_common. The minimum size of both PPP HDLC-like frame and
      Ethernet frame is larger than 14 bytes, so we are safe to do so.
      
      Also remove L2TP_HDR_SIZE_NOSEQ, it is unused now.
      
      Fixes: fd558d18 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
      Suggested-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NJacob Wen <jian.w.wen@oracle.com>
      Acked-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3d418a25
    • D
      ipvlan, l3mdev: fix broken l3s mode wrt local routes · fcc9c69a
      Daniel Borkmann 提交于
      [ Upstream commit d5256083f62e2720f75bb3c5a928a0afe47d6bc3 ]
      
      While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
      I ran into the issue that while l3 mode is working fine, l3s mode
      does not have any connectivity to kube-apiserver and hence all pods
      end up in Error state as well. The ipvlan master device sits on
      top of a bond device and hostns traffic to kube-apiserver (also running
      in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
      where the latter is the address of the bond0. While in l3 mode, a
      curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
      works fine from hostns, neither of them do in case of l3s. In the
      latter only a curl to https://127.0.0.1:37573 appeared to work where
      for local addresses of bond0 I saw kernel suddenly starting to emit
      ARP requests to query HW address of bond0 which remained unanswered
      and neighbor entries in INCOMPLETE state. These ARP requests only
      happen while in l3s.
      
      Debugging this further, I found the issue is that l3s mode is piggy-
      backing on l3 master device, and in this case local routes are using
      l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
      f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev
      if relevant") and 5f02ce24 ("net: l3mdev: Allow the l3mdev to be
      a loopback"). I found that reverting them back into using the
      net->loopback_dev fixed ipvlan l3s connectivity and got everything
      working for the CNI.
      
      Now judging from 4fbae7d8 ("ipvlan: Introduce l3s mode") and the
      l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
      on l3 master device is to get the l3mdev_ip_rcv() receive hook for
      setting the dst entry of the input route without adding its own
      ipvlan specific hacks into the receive path, however, any l3 domain
      semantics beyond just that are breaking l3s operation. Note that
      ipvlan also has the ability to dynamically switch its internal
      operation from l3 to l3s for all ports via ipvlan_set_port_mode()
      at runtime. In any case, l3 vs l3s soley distinguishes itself by
      'de-confusing' netfilter through switching skb->dev to ipvlan slave
      device late in NF_INET_LOCAL_IN before handing the skb to L4.
      
      Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
      if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
      without any additional l3mdev semantics on top. This should also have
      minimal impact since dev->priv_flags is already hot in cache. With
      this set, l3s mode is working fine and I also get things like
      masquerading pod traffic on the ipvlan master properly working.
      
        [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf
      
      Fixes: f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
      Fixes: 5f02ce24 ("net: l3mdev: Allow the l3mdev to be a loopback")
      Fixes: 4fbae7d8 ("ipvlan: Introduce l3s mode")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: David Ahern <dsa@cumulusnetworks.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Martynas Pumputis <m@lambda.lt>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fcc9c69a
    • Y
      ipv6: sr: clear IP6CB(skb) on SRH ip4ip6 encapsulation · 2f704348
      Yohei Kanemaru 提交于
      [ Upstream commit ef489749aae508e6f17886775c075f12ff919fb1 ]
      
      skb->cb may contain data from previous layers (in an observed case
      IPv4 with L3 Master Device). In the observed scenario, the data in
      IPCB(skb)->frags was misinterpreted as IP6CB(skb)->frag_max_size,
      eventually caused an unexpected IPv6 fragmentation in ip6_fragment()
      through ip6_finish_output().
      
      This patch clears IP6CB(skb), which potentially contains garbage data,
      on the SRH ip4ip6 encapsulation.
      
      Fixes: 32d99d0b ("ipv6: sr: add support for ip4ip6 encapsulation")
      Signed-off-by: NYohei Kanemaru <yohei.kanemaru@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2f704348
    • D
      ipv6: Consider sk_bound_dev_if when binding a socket to an address · 7e9a6476
      David Ahern 提交于
      [ Upstream commit c5ee066333ebc322a24a00a743ed941a0c68617e ]
      
      IPv6 does not consider if the socket is bound to a device when binding
      to an address. The result is that a socket can be bound to eth0 and then
      bound to the address of eth1. If the device is a VRF, the result is that
      a socket can only be bound to an address in the default VRF.
      
      Resolve by considering the device if sk_bound_dev_if is set.
      
      This problem exists from the beginning of git history.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7e9a6476
    • A
      drm/msm/gpu: fix building without debugfs · 8877843b
      Arnd Bergmann 提交于
      commit c878a628e0c483ec36fa70f4590e4a58e34a6e49 upstream.
      
      When debugfs is disabled, but coredump is turned on, the adreno driver fails to build:
      
      drivers/gpu/drm/msm/adreno/a3xx_gpu.c:460:4: error: 'struct msm_gpu_funcs' has no member named 'show'
         .show = adreno_show,
          ^~~~
      drivers/gpu/drm/msm/adreno/a3xx_gpu.c:460:11: note: (near initialization for 'funcs.base')
      drivers/gpu/drm/msm/adreno/a3xx_gpu.c:460:11: error: initialization of 'void (*)(struct msm_gpu *, struct msm_gem_submit *, struct msm_file_private *)' from incompatible pointer type 'void (*)(struct msm_gpu *, struct msm_gpu_state *, struct drm_printer *)' [-Werror=incompatible-pointer-types]
      drivers/gpu/drm/msm/adreno/a3xx_gpu.c:460:11: note: (near initialization for 'funcs.base.submit')
      drivers/gpu/drm/msm/adreno/a4xx_gpu.c:546:4: error: 'struct msm_gpu_funcs' has no member named 'show'
      drivers/gpu/drm/msm/adreno/a5xx_gpu.c:1460:4: error: 'struct msm_gpu_funcs' has no member named 'show'
      drivers/gpu/drm/msm/adreno/a6xx_gpu.c:769:4: error: 'struct msm_gpu_funcs' has no member named 'show'
      drivers/gpu/drm/msm/msm_gpu.c: In function 'msm_gpu_devcoredump_read':
      drivers/gpu/drm/msm/msm_gpu.c:289:12: error: 'const struct msm_gpu_funcs' has no member named 'show'
      
      Adjust the #ifdef to make it build again.
      
      Fixes: c0fec7f5 ("drm/msm/gpu: Capture the GPU state on a GPU hang")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NRob Clark <robdclark@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8877843b
    • G
      Fix "net: ipv4: do not handle duplicate fragments as overlapping" · 8c763a3c
      Greg Kroah-Hartman 提交于
      ade446403bfb ("net: ipv4: do not handle duplicate fragments as
      overlapping") was backported to many stable trees, but it had a problem
      that was "accidentally" fixed by the upstream commit 0ff89efb5246 ("ip:
      fail fast on IP defrag errors")
      
      This is the fixup for that problem as we do not want the larger patch in
      the older stable trees.
      
      Fixes: ade446403bfb ("net: ipv4: do not handle duplicate fragments as overlapping")
      Reported-by: NIvan Babrou <ivan@cloudflare.com>
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8c763a3c
  2. 31 1月, 2019 32 次提交
    • G
      Linux 4.19.19 · dffbba43
      Greg Kroah-Hartman 提交于
      dffbba43
    • D
      Input: input_event - fix the CONFIG_SPARC64 mixup · 3a3b6a6b
      Deepa Dinamani 提交于
      commit 141e5dcaa7356077028b4cd48ec351a38c70e5e5 upstream.
      
      Arnd Bergmann pointed out that CONFIG_* cannot be used in a uapi header.
      Override with an equivalent conditional.
      
      Fixes: 2e746942ebac ("Input: input_event - provide override for sparc64")
      Fixes: 152194fe ("Input: extend usable life of event timestamps to 2106 on 32 bit systems")
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a3b6a6b
    • C
      ide: fix a typo in the settings proc file name · d4a6ac28
      Christoph Hellwig 提交于
      commit f8ff6c732d35904d773043f979b844ef330c701b upstream.
      
      Fixes: ec7d9c9c ("ide: replace ->proc_fops with ->proc_show")
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4a6ac28
    • J
      usb: dwc3: gadget: Clear req->needs_extra_trb flag on cleanup · 25ad17d6
      Jack Pham 提交于
      commit bd6742249b9ca918565e4e3abaa06665e587f4b5 upstream.
      
      OUT endpoint requests may somtimes have this flag set when
      preparing to be submitted to HW indicating that there is an
      additional TRB chained to the request for alignment purposes.
      If that request is removed before the controller can execute the
      transfer (e.g. ep_dequeue/ep_disable), the request will not go
      through the dwc3_gadget_ep_cleanup_completed_request() handler
      and will not have its needs_extra_trb flag cleared when
      dwc3_gadget_giveback() is called.  This same request could be
      later requeued for a new transfer that does not require an
      extra TRB and if it is successfully completed, the cleanup
      and TRB reclamation will incorrectly process the additional TRB
      which belongs to the next request, and incorrectly advances the
      TRB dequeue pointer, thereby messing up calculation of the next
      requeust's actual/remaining count when it completes.
      
      The right thing to do here is to ensure that the flag is cleared
      before it is given back to the function driver.  A good place
      to do that is in dwc3_gadget_del_and_unmap_request().
      
      Fixes: c6267a51 ("usb: dwc3: gadget: align transfers to wMaxPacketSize")
      Cc: stable@vger.kernel.org
      Signed-off-by: NJack Pham <jackp@codeaurora.org>
      Signed-off-by: NFelipe Balbi <felipe.balbi@linux.intel.com>
      [jackp: backport to <= 4.20: replaced 'needs_extra_trb' with 'unaligned'
              and 'zero' members in patch and reworded commit text]
      Signed-off-by: NJack Pham <jackp@codeaurora.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      25ad17d6
    • M
      Revert "mm, memory_hotplug: initialize struct pages for the full memory section" · 6bab9573
      Michal Hocko 提交于
      commit 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a upstream.
      
      This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684.
      
      The underlying assumption that one sparse section belongs into a single
      numa node doesn't hold really. Robert Shteynfeld has reported a boot
      failure. The boot log was not captured but his memory layout is as
      follows:
      
        Early memory node ranges
          node   1: [mem 0x0000000000001000-0x0000000000090fff]
          node   1: [mem 0x0000000000100000-0x00000000dbdf8fff]
          node   1: [mem 0x0000000100000000-0x0000001423ffffff]
          node   0: [mem 0x0000001424000000-0x0000002023ffffff]
      
      This means that node0 starts in the middle of a memory section which is
      also in node1.  memmap_init_zone tries to initialize padding of a
      section even when it is outside of the given pfn range because there are
      code paths (e.g.  memory hotplug) which assume that the full worth of
      memory section is always initialized.
      
      In this particular case, though, such a range is already intialized and
      most likely already managed by the page allocator.  Scribbling over
      those pages corrupts the internal state and likely blows up when any of
      those pages gets used.
      Reported-by: NRobert Shteynfeld <robert.shteynfeld@gmail.com>
      Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section")
      Cc: stable@kernel.org
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6bab9573
    • R
      nvmet-rdma: fix null dereference under heavy load · 7dbf1297
      Raju Rangoju 提交于
      commit 5cbab6303b4791a3e6713dfe2c5fda6a867f9adc upstream.
      
      Under heavy load if we don't have any pre-allocated rsps left, we
      dynamically allocate a rsp, but we are not actually allocating memory
      for nvme_completion (rsp->req.rsp). In such a case, accessing pointer
      fields (req->rsp->status) in nvmet_req_init() will result in crash.
      
      To fix this, allocate the memory for nvme_completion by calling
      nvmet_rdma_alloc_rsp()
      
      Fixes: 8407879c("nvmet-rdma:fix possible bogus dereference under heavy load")
      
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NRaju Rangoju <rajur@chelsio.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7dbf1297
    • I
      nvmet-rdma: Add unlikely for response allocated check · fa9184be
      Israel Rukshin 提交于
      commit ad1f824948e4ed886529219cf7cd717d078c630d upstream.
      Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Cc: Raju  Rangoju <rajur@chelsio.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fa9184be
    • D
      s390/smp: Fix calling smp_call_ipl_cpu() from ipl CPU · 48046a01
      David Hildenbrand 提交于
      commit 60f1bf29c0b2519989927cae640cd1f50f59dc7f upstream.
      
      When calling smp_call_ipl_cpu() from the IPL CPU, we will try to read
      from pcpu_devices->lowcore. However, due to prefixing, that will result
      in reading from absolute address 0 on that CPU. We have to go via the
      actual lowcore instead.
      
      This means that right now, we will read lc->nodat_stack == 0 and
      therfore work on a very wrong stack.
      
      This BUG essentially broke rebooting under QEMU TCG (which will report
      a low address protection exception). And checking under KVM, it is
      also broken under KVM. With 1 VCPU it can be easily triggered.
      
      :/# echo 1 > /proc/sys/kernel/sysrq
      :/# echo b > /proc/sysrq-trigger
      [   28.476745] sysrq: SysRq : Resetting
      [   28.476793] Kernel stack overflow.
      [   28.476817] CPU: 0 PID: 424 Comm: sh Not tainted 5.0.0-rc1+ #13
      [   28.476820] Hardware name: IBM 2964 NE1 716 (KVM/Linux)
      [   28.476826] Krnl PSW : 0400c00180000000 0000000000115c0c (pcpu_delegate+0x12c/0x140)
      [   28.476861]            R:0 T:1 IO:0 EX:0 Key:0 M:0 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
      [   28.476863] Krnl GPRS: ffffffffffffffff 0000000000000000 000000000010dff8 0000000000000000
      [   28.476864]            0000000000000000 0000000000000000 0000000000ab7090 000003e0006efbf0
      [   28.476864]            000000000010dff8 0000000000000000 0000000000000000 0000000000000000
      [   28.476865]            000000007fffc000 0000000000730408 000003e0006efc58 0000000000000000
      [   28.476887] Krnl Code: 0000000000115bfe: 4170f000            la      %r7,0(%r15)
      [   28.476887]            0000000000115c02: 41f0a000            la      %r15,0(%r10)
      [   28.476887]           #0000000000115c06: e370f0980024        stg     %r7,152(%r15)
      [   28.476887]           >0000000000115c0c: c0e5fffff86e        brasl   %r14,114ce8
      [   28.476887]            0000000000115c12: 41f07000            la      %r15,0(%r7)
      [   28.476887]            0000000000115c16: a7f4ffa8            brc     15,115b66
      [   28.476887]            0000000000115c1a: 0707                bcr     0,%r7
      [   28.476887]            0000000000115c1c: 0707                bcr     0,%r7
      [   28.476901] Call Trace:
      [   28.476902] Last Breaking-Event-Address:
      [   28.476920]  [<0000000000a01c4a>] arch_call_rest_init+0x22/0x80
      [   28.476927] Kernel panic - not syncing: Corrupt kernel stack, can't continue.
      [   28.476930] CPU: 0 PID: 424 Comm: sh Not tainted 5.0.0-rc1+ #13
      [   28.476932] Hardware name: IBM 2964 NE1 716 (KVM/Linux)
      [   28.476932] Call Trace:
      
      Fixes: 2f859d0d ("s390/smp: reduce size of struct pcpu")
      Cc: stable@vger.kernel.org # 4.0+
      Reported-by: NCornelia Huck <cohuck@redhat.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      48046a01
    • D
      bpf: fix inner map masking to prevent oob under speculation · 37c9e3ee
      Daniel Borkmann 提交于
      [ commit 9d5564ddcf2a0f5ba3fa1c3a1f8a1b59ad309553 upstream ]
      
      During review I noticed that inner meta map setup for map in
      map is buggy in that it does not propagate all needed data
      from the reference map which the verifier is later accessing.
      
      In particular one such case is index masking to prevent out of
      bounds access under speculative execution due to missing the
      map's unpriv_array/index_mask field propagation. Fix this such
      that the verifier is generating the correct code for inlined
      lookups in case of unpriviledged use.
      
      Before patch (test_verifier's 'map in map access' dump):
      
        # bpftool prog dump xla id 3
           0: (62) *(u32 *)(r10 -4) = 0
           1: (bf) r2 = r10
           2: (07) r2 += -4
           3: (18) r1 = map[id:4]
           5: (07) r1 += 272                |
           6: (61) r0 = *(u32 *)(r2 +0)     |
           7: (35) if r0 >= 0x1 goto pc+6   | Inlined map in map lookup
           8: (54) (u32) r0 &= (u32) 0      | with index masking for
           9: (67) r0 <<= 3                 | map->unpriv_array.
          10: (0f) r0 += r1                 |
          11: (79) r0 = *(u64 *)(r0 +0)     |
          12: (15) if r0 == 0x0 goto pc+1   |
          13: (05) goto pc+1                |
          14: (b7) r0 = 0                   |
          15: (15) if r0 == 0x0 goto pc+11
          16: (62) *(u32 *)(r10 -4) = 0
          17: (bf) r2 = r10
          18: (07) r2 += -4
          19: (bf) r1 = r0
          20: (07) r1 += 272                |
          21: (61) r0 = *(u32 *)(r2 +0)     | Index masking missing (!)
          22: (35) if r0 >= 0x1 goto pc+3   | for inner map despite
          23: (67) r0 <<= 3                 | map->unpriv_array set.
          24: (0f) r0 += r1                 |
          25: (05) goto pc+1                |
          26: (b7) r0 = 0                   |
          27: (b7) r0 = 0
          28: (95) exit
      
      After patch:
      
        # bpftool prog dump xla id 1
           0: (62) *(u32 *)(r10 -4) = 0
           1: (bf) r2 = r10
           2: (07) r2 += -4
           3: (18) r1 = map[id:2]
           5: (07) r1 += 272                |
           6: (61) r0 = *(u32 *)(r2 +0)     |
           7: (35) if r0 >= 0x1 goto pc+6   | Same inlined map in map lookup
           8: (54) (u32) r0 &= (u32) 0      | with index masking due to
           9: (67) r0 <<= 3                 | map->unpriv_array.
          10: (0f) r0 += r1                 |
          11: (79) r0 = *(u64 *)(r0 +0)     |
          12: (15) if r0 == 0x0 goto pc+1   |
          13: (05) goto pc+1                |
          14: (b7) r0 = 0                   |
          15: (15) if r0 == 0x0 goto pc+12
          16: (62) *(u32 *)(r10 -4) = 0
          17: (bf) r2 = r10
          18: (07) r2 += -4
          19: (bf) r1 = r0
          20: (07) r1 += 272                |
          21: (61) r0 = *(u32 *)(r2 +0)     |
          22: (35) if r0 >= 0x1 goto pc+4   | Now fixed inlined inner map
          23: (54) (u32) r0 &= (u32) 0      | lookup with proper index masking
          24: (67) r0 <<= 3                 | for map->unpriv_array.
          25: (0f) r0 += r1                 |
          26: (05) goto pc+1                |
          27: (b7) r0 = 0                   |
          28: (b7) r0 = 0
          29: (95) exit
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      37c9e3ee
    • D
      bpf: fix sanitation of alu op with pointer / scalar type from different paths · eed84f94
      Daniel Borkmann 提交于
      [ commit d3bd7413e0ca40b60cf60d4003246d067cafdeda upstream ]
      
      While 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer
      arithmetic") took care of rejecting alu op on pointer when e.g. pointer
      came from two different map values with different map properties such as
      value size, Jann reported that a case was not covered yet when a given
      alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from
      different branches where we would incorrectly try to sanitize based
      on the pointer's limit. Catch this corner case and reject the program
      instead.
      
      Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      eed84f94
    • D
      bpf: prevent out of bounds speculation on pointer arithmetic · f92a819b
      Daniel Borkmann 提交于
      [ commit 979d63d50c0c0f7bc537bf821e056cc9fe5abd38 upstream ]
      
      Jann reported that the original commit back in b2157399
      ("bpf: prevent out-of-bounds speculation") was not sufficient
      to stop CPU from speculating out of bounds memory access:
      While b2157399 only focussed on masking array map access
      for unprivileged users for tail calls and data access such
      that the user provided index gets sanitized from BPF program
      and syscall side, there is still a more generic form affected
      from BPF programs that applies to most maps that hold user
      data in relation to dynamic map access when dealing with
      unknown scalars or "slow" known scalars as access offset, for
      example:
      
        - Load a map value pointer into R6
        - Load an index into R7
        - Do a slow computation (e.g. with a memory dependency) that
          loads a limit into R8 (e.g. load the limit from a map for
          high latency, then mask it to make the verifier happy)
        - Exit if R7 >= R8 (mispredicted branch)
        - Load R0 = R6[R7]
        - Load R0 = R6[R0]
      
      For unknown scalars there are two options in the BPF verifier
      where we could derive knowledge from in order to guarantee
      safe access to the memory: i) While </>/<=/>= variants won't
      allow to derive any lower or upper bounds from the unknown
      scalar where it would be safe to add it to the map value
      pointer, it is possible through ==/!= test however. ii) another
      option is to transform the unknown scalar into a known scalar,
      for example, through ALU ops combination such as R &= <imm>
      followed by R |= <imm> or any similar combination where the
      original information from the unknown scalar would be destroyed
      entirely leaving R with a constant. The initial slow load still
      precedes the latter ALU ops on that register, so the CPU
      executes speculatively from that point. Once we have the known
      scalar, any compare operation would work then. A third option
      only involving registers with known scalars could be crafted
      as described in [0] where a CPU port (e.g. Slow Int unit)
      would be filled with many dependent computations such that
      the subsequent condition depending on its outcome has to wait
      for evaluation on its execution port and thereby executing
      speculatively if the speculated code can be scheduled on a
      different execution port, or any other form of mistraining
      as described in [1], for example. Given this is not limited
      to only unknown scalars, not only map but also stack access
      is affected since both is accessible for unprivileged users
      and could potentially be used for out of bounds access under
      speculation.
      
      In order to prevent any of these cases, the verifier is now
      sanitizing pointer arithmetic on the offset such that any
      out of bounds speculation would be masked in a way where the
      pointer arithmetic result in the destination register will
      stay unchanged, meaning offset masked into zero similar as
      in array_index_nospec() case. With regards to implementation,
      there are three options that were considered: i) new insn
      for sanitation, ii) push/pop insn and sanitation as inlined
      BPF, iii) reuse of ax register and sanitation as inlined BPF.
      
      Option i) has the downside that we end up using from reserved
      bits in the opcode space, but also that we would require
      each JIT to emit masking as native arch opcodes meaning
      mitigation would have slow adoption till everyone implements
      it eventually which is counter-productive. Option ii) and iii)
      have both in common that a temporary register is needed in
      order to implement the sanitation as inlined BPF since we
      are not allowed to modify the source register. While a push /
      pop insn in ii) would be useful to have in any case, it
      requires once again that every JIT needs to implement it
      first. While possible, amount of changes needed would also
      be unsuitable for a -stable patch. Therefore, the path which
      has fewer changes, less BPF instructions for the mitigation
      and does not require anything to be changed in the JITs is
      option iii) which this work is pursuing. The ax register is
      already mapped to a register in all JITs (modulo arm32 where
      it's mapped to stack as various other BPF registers there)
      and used in constant blinding for JITs-only so far. It can
      be reused for verifier rewrites under certain constraints.
      The interpreter's tmp "register" has therefore been remapped
      into extending the register set with hidden ax register and
      reusing that for a number of instructions that needed the
      prior temporary variable internally (e.g. div, mod). This
      allows for zero increase in stack space usage in the interpreter,
      and enables (restricted) generic use in rewrites otherwise as
      long as such a patchlet does not make use of these instructions.
      The sanitation mask is dynamic and relative to the offset the
      map value or stack pointer currently holds.
      
      There are various cases that need to be taken under consideration
      for the masking, e.g. such operation could look as follows:
      ptr += val or val += ptr or ptr -= val. Thus, the value to be
      sanitized could reside either in source or in destination
      register, and the limit is different depending on whether
      the ALU op is addition or subtraction and depending on the
      current known and bounded offset. The limit is derived as
      follows: limit := max_value_size - (smin_value + off). For
      subtraction: limit := umax_value + off. This holds because
      we do not allow any pointer arithmetic that would
      temporarily go out of bounds or would have an unknown
      value with mixed signed bounds where it is unclear at
      verification time whether the actual runtime value would
      be either negative or positive. For example, we have a
      derived map pointer value with constant offset and bounded
      one, so limit based on smin_value works because the verifier
      requires that statically analyzed arithmetic on the pointer
      must be in bounds, and thus it checks if resulting
      smin_value + off and umax_value + off is still within map
      value bounds at time of arithmetic in addition to time of
      access. Similarly, for the case of stack access we derive
      the limit as follows: MAX_BPF_STACK + off for subtraction
      and -off for the case of addition where off := ptr_reg->off +
      ptr_reg->var_off.value. Subtraction is a special case for
      the masking which can be in form of ptr += -val, ptr -= -val,
      or ptr -= val. In the first two cases where we know that
      the value is negative, we need to temporarily negate the
      value in order to do the sanitation on a positive value
      where we later swap the ALU op, and restore original source
      register if the value was in source.
      
      The sanitation of pointer arithmetic alone is still not fully
      sufficient as is, since a scenario like the following could
      happen ...
      
        PTR += 0x1000 (e.g. K-based imm)
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        PTR += 0x1000
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        [...]
      
      ... which under speculation could end up as ...
      
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        [...]
      
      ... and therefore still access out of bounds. To prevent such
      case, the verifier is also analyzing safety for potential out
      of bounds access under speculative execution. Meaning, it is
      also simulating pointer access under truncation. We therefore
      "branch off" and push the current verification state after the
      ALU operation with known 0 to the verification stack for later
      analysis. Given the current path analysis succeeded it is
      likely that the one under speculation can be pruned. In any
      case, it is also subject to existing complexity limits and
      therefore anything beyond this point will be rejected. In
      terms of pruning, it needs to be ensured that the verification
      state from speculative execution simulation must never prune
      a non-speculative execution path, therefore, we mark verifier
      state accordingly at the time of push_stack(). If verifier
      detects out of bounds access under speculative execution from
      one of the possible paths that includes a truncation, it will
      reject such program.
      
      Given we mask every reg-based pointer arithmetic for
      unprivileged programs, we've been looking into how it could
      affect real-world programs in terms of size increase. As the
      majority of programs are targeted for privileged-only use
      case, we've unconditionally enabled masking (with its alu
      restrictions on top of it) for privileged programs for the
      sake of testing in order to check i) whether they get rejected
      in its current form, and ii) by how much the number of
      instructions and size will increase. We've tested this by
      using Katran, Cilium and test_l4lb from the kernel selftests.
      For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
      and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
      we've used test_l4lb.o as well as test_l4lb_noinline.o. We
      found that none of the programs got rejected by the verifier
      with this change, and that impact is rather minimal to none.
      balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
      7,797 bytes JITed before and after the change. Most complex
      program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
      and 18,538 bytes JITed before and after and none of the other
      tail call programs in bpf_lxc.o had any changes either. For
      the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
      increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
      before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
      after the change. Other programs from that object file had
      similar small increase. Both test_l4lb.o had no change and
      remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
      JITed and for test_l4lb_noinline.o constant at 5,080 bytes
      (634 insns) xlated and 3,313 bytes JITed. This can be explained
      in that LLVM typically optimizes stack based pointer arithmetic
      by using K-based operations and that use of dynamic map access
      is not overly frequent. However, in future we may decide to
      optimize the algorithm further under known guarantees from
      branch and value speculation. Latter seems also unclear in
      terms of prediction heuristics that today's CPUs apply as well
      as whether there could be collisions in e.g. the predictor's
      Value History/Pattern Table for triggering out of bounds access,
      thus masking is performed unconditionally at this point but could
      be subject to relaxation later on. We were generally also
      brainstorming various other approaches for mitigation, but the
      blocker was always lack of available registers at runtime and/or
      overhead for runtime tracking of limits belonging to a specific
      pointer. Thus, we found this to be minimally intrusive under
      given constraints.
      
      With that in place, a simple example with sanitized access on
      unprivileged load at post-verification time looks as follows:
      
        # bpftool prog dump xlated id 282
        [...]
        28: (79) r1 = *(u64 *)(r7 +0)
        29: (79) r2 = *(u64 *)(r7 +8)
        30: (57) r1 &= 15
        31: (79) r3 = *(u64 *)(r0 +4608)
        32: (57) r3 &= 1
        33: (47) r3 |= 1
        34: (2d) if r2 > r3 goto pc+19
        35: (b4) (u32) r11 = (u32) 20479  |
        36: (1f) r11 -= r2                | Dynamic sanitation for pointer
        37: (4f) r11 |= r2                | arithmetic with registers
        38: (87) r11 = -r11               | containing bounded or known
        39: (c7) r11 s>>= 63              | scalars in order to prevent
        40: (5f) r11 &= r2                | out of bounds speculation.
        41: (0f) r4 += r11                |
        42: (71) r4 = *(u8 *)(r4 +0)
        43: (6f) r4 <<= r1
        [...]
      
      For the case where the scalar sits in the destination register
      as opposed to the source register, the following code is emitted
      for the above example:
      
        [...]
        16: (b4) (u32) r11 = (u32) 20479
        17: (1f) r11 -= r2
        18: (4f) r11 |= r2
        19: (87) r11 = -r11
        20: (c7) r11 s>>= 63
        21: (5f) r2 &= r11
        22: (0f) r2 += r0
        23: (61) r0 = *(u32 *)(r2 +0)
        [...]
      
      JIT blinding example with non-conflicting use of r10:
      
        [...]
         d5:	je     0x0000000000000106    _
         d7:	mov    0x0(%rax),%edi       |
         da:	mov    $0xf153246,%r10d     | Index load from map value and
         e0:	xor    $0xf153259,%r10      | (const blinded) mask with 0x1f.
         e7:	and    %r10,%rdi            |_
         ea:	mov    $0x2f,%r10d          |
         f0:	sub    %rdi,%r10            | Sanitized addition. Both use r10
         f3:	or     %rdi,%r10            | but do not interfere with each
         f6:	neg    %r10                 | other. (Neither do these instructions
         f9:	sar    $0x3f,%r10           | interfere with the use of ax as temp
         fd:	and    %r10,%rdi            | in interpreter.)
        100:	add    %rax,%rdi            |_
        103:	mov    0x0(%rdi),%eax
       [...]
      
      Tested that it fixes Jann's reproducer, and also checked that test_verifier
      and test_progs suite with interpreter, JIT and JIT with hardening enabled
      on x86-64 and arm64 runs successfully.
      
        [0] Speculose: Analyzing the Security Implications of Speculative
            Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
            https://arxiv.org/pdf/1801.04084.pdf
      
        [1] A Systematic Evaluation of Transient Execution Attacks and
            Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
            Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
            Dmitry Evtyushkin, Daniel Gruss,
            https://arxiv.org/pdf/1811.05441.pdf
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f92a819b
    • D
      bpf: fix check_map_access smin_value test when pointer contains offset · 4f7f708d
      Daniel Borkmann 提交于
      [ commit b7137c4eab85c1cf3d46acdde90ce1163b28c873 upstream ]
      
      In check_map_access() we probe actual bounds through __check_map_access()
      with offset of reg->smin_value + off for lower bound and offset of
      reg->umax_value + off for the upper bound. However, even though the
      reg->smin_value could have a negative value, the final result of the
      sum with off could be positive when pointer arithmetic with known and
      unknown scalars is combined. In this case we reject the program with
      an error such as "R<x> min value is negative, either use unsigned index
      or do a if (index >=0) check." even though the access itself would be
      fine. Therefore extend the check to probe whether the actual resulting
      reg->smin_value + off is less than zero.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4f7f708d
    • D
      bpf: restrict unknown scalars of mixed signed bounds for unprivileged · 44f8fc64
      Daniel Borkmann 提交于
      [ commit 9d7eceede769f90b66cfa06ad5b357140d5141ed upstream ]
      
      For unknown scalars of mixed signed bounds, meaning their smin_value is
      negative and their smax_value is positive, we need to reject arithmetic
      with pointer to map value. For unprivileged the goal is to mask every
      map pointer arithmetic and this cannot reliably be done when it is
      unknown at verification time whether the scalar value is negative or
      positive. Given this is a corner case, the likelihood of breaking should
      be very small.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      44f8fc64
    • D
      bpf: restrict stack pointer arithmetic for unprivileged · 5332dda9
      Daniel Borkmann 提交于
      [ commit e4298d25830a866cc0f427d4bccb858e76715859 upstream ]
      
      Restrict stack pointer arithmetic for unprivileged users in that
      arithmetic itself must not go out of bounds as opposed to the actual
      access later on. Therefore after each adjust_ptr_min_max_vals() with
      a stack pointer as a destination we simulate a check_stack_access()
      of 1 byte on the destination and once that fails the program is
      rejected for unprivileged program loads. This is analog to map
      value pointer arithmetic and needed for masking later on.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5332dda9
    • D
      bpf: restrict map value pointer arithmetic for unprivileged · 9e57b296
      Daniel Borkmann 提交于
      [ commit 0d6303db7970e6f56ae700fa07e11eb510cda125 upstream ]
      
      Restrict map value pointer arithmetic for unprivileged users in that
      arithmetic itself must not go out of bounds as opposed to the actual
      access later on. Therefore after each adjust_ptr_min_max_vals() with a
      map value pointer as a destination it will simulate a check_map_access()
      of 1 byte on the destination and once that fails the program is rejected
      for unprivileged program loads. We use this later on for masking any
      pointer arithmetic with the remainder of the map value space. The
      likelihood of breaking any existing real-world unprivileged eBPF
      program is very small for this corner case.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9e57b296
    • D
      bpf: enable access to ax register also from verifier rewrite · 232ac70d
      Daniel Borkmann 提交于
      [ commit 9b73bfdd08e73231d6a90ae6db4b46b3fbf56c30 upstream ]
      
      Right now we are using BPF ax register in JIT for constant blinding as
      well as in interpreter as temporary variable. Verifier will not be able
      to use it simply because its use will get overridden from the former in
      bpf_jit_blind_insn(). However, it can be made to work in that blinding
      will be skipped if there is prior use in either source or destination
      register on the instruction. Taking constraints of ax into account, the
      verifier is then open to use it in rewrites under some constraints. Note,
      ax register already has mappings in every eBPF JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      232ac70d
    • D
      bpf: move tmp variable into ax register in interpreter · b855e310
      Daniel Borkmann 提交于
      [ commit 144cd91c4c2bced6eb8a7e25e590f6618a11e854 upstream ]
      
      This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run()
      into the hidden ax register. The latter is currently only used in JITs
      for constant blinding as a temporary scratch register, meaning the BPF
      interpreter will never see the use of ax. Therefore it is safe to use
      it for the cases where tmp has been used earlier. This is needed to later
      on allow restricted hidden use of ax in both interpreter and JITs.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b855e310
    • D
      bpf: move {prev_,}insn_idx into verifier env · 333a31c8
      Daniel Borkmann 提交于
      [ commit c08435ec7f2bc8f4109401f696fd55159b4b40cb upstream ]
      
      Move prev_insn_idx and insn_idx from the do_check() function into
      the verifier environment, so they can be read inside the various
      helper functions for handling the instructions. It's easier to put
      this into the environment rather than changing all call-sites only
      to pass it along. insn_idx is useful in particular since this later
      on allows to hold state in env->insn_aux_data[env->insn_idx].
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      333a31c8
    • A
      bpf: add per-insn complexity limit · 43711294
      Alexei Starovoitov 提交于
      [ commit ceefbc96fa5c5b975d87bf8e89ba8416f6b764d9 upstream ]
      
      malicious bpf program may try to force the verifier to remember
      a lot of distinct verifier states.
      Put a limit to number of per-insn 'struct bpf_verifier_state'.
      Note that hitting the limit doesn't reject the program.
      It potentially makes the verifier do more steps to analyze the program.
      It means that malicious programs will hit BPF_COMPLEXITY_LIMIT_INSNS sooner
      instead of spending cpu time walking long link list.
      
      The limit of BPF_COMPLEXITY_LIMIT_STATES==64 affects cilium progs
      with slight increase in number of "steps" it takes to successfully verify
      the programs:
                             before    after
      bpf_lb-DLB_L3.o         1940      1940
      bpf_lb-DLB_L4.o         3089      3089
      bpf_lb-DUNKNOWN.o       1065      1065
      bpf_lxc-DDROP_ALL.o     28052  |  28162
      bpf_lxc-DUNKNOWN.o      35487  |  35541
      bpf_netdev.o            10864     10864
      bpf_overlay.o           6643      6643
      bpf_lcx_jit.o           38437     38437
      
      But it also makes malicious program to be rejected in 0.4 seconds vs 6.5
      Hence apply this limit to unprivileged programs only.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      43711294
    • A
      bpf: improve verifier branch analysis · 7da6cd69
      Alexei Starovoitov 提交于
      [ commit 4f7b3e82589e0de723780198ec7983e427144c0a upstream ]
      
      pathological bpf programs may try to force verifier to explode in
      the number of branch states:
        20: (d5) if r1 s<= 0x24000028 goto pc+0
        21: (b5) if r0 <= 0xe1fa20 goto pc+2
        22: (d5) if r1 s<= 0x7e goto pc+0
        23: (b5) if r0 <= 0xe880e000 goto pc+0
        24: (c5) if r0 s< 0x2100ecf4 goto pc+0
        25: (d5) if r1 s<= 0xe880e000 goto pc+1
        26: (c5) if r0 s< 0xf4041810 goto pc+0
        27: (d5) if r1 s<= 0x1e007e goto pc+0
        28: (b5) if r0 <= 0xe86be000 goto pc+0
        29: (07) r0 += 16614
        30: (c5) if r0 s< 0x6d0020da goto pc+0
        31: (35) if r0 >= 0x2100ecf4 goto pc+0
      
      Teach verifier to recognize always taken and always not taken branches.
      This analysis is already done for == and != comparison.
      Expand it to all other branches.
      
      It also helps real bpf programs to be verified faster:
                             before  after
      bpf_lb-DLB_L3.o         2003    1940
      bpf_lb-DLB_L4.o         3173    3089
      bpf_lb-DUNKNOWN.o       1080    1065
      bpf_lxc-DDROP_ALL.o     29584   28052
      bpf_lxc-DUNKNOWN.o      36916   35487
      bpf_netdev.o            11188   10864
      bpf_overlay.o           6679    6643
      bpf_lcx_jit.o           39555   38437
      Reported-by: NAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7da6cd69
    • N
      drm/meson: Fix atomic mode switching regression · ce8d0581
      Neil Armstrong 提交于
      commit ce0210c12433031aba3bbacd75f4c02ab77f2004 upstream.
      
      Since commit 2bcd3ecab773 when switching mode from X11 (ubuntu mate for
      example) the display gets blurry, looking like an invalid framebuffer width.
      
      This commit fixed atomic crtc modesetting in a totally wrong way and
      introduced a local unnecessary ->enabled crtc state.
      
      This commit reverts the crctc _begin() and _enable() changes and simply
      adds drm_atomic_helper_commit_tail_rpm as helper.
      Reported-by: NTony McKahan <tonymckahan@gmail.com>
      Suggested-by: NDaniel Vetter <daniel@ffwll.ch>
      Fixes: 2bcd3ecab773 ("drm/meson: Fixes for drm_crtc_vblank_on/off support")
      Signed-off-by: NNeil Armstrong <narmstrong@baylibre.com>
      Acked-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      [narmstrong: fixed blank line issue from checkpatch]
      Link: https://patchwork.freedesktop.org/patch/msgid/20190114153118.8024-1-narmstrong@baylibre.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ce8d0581
    • N
      vt: invoke notifier on screen size change · 8b4dffe8
      Nicolas Pitre 提交于
      commit 0c9b1965faddad7534b6974b5b36c4ad37998f8e upstream.
      
      User space using poll() on /dev/vcs devices are not awaken when a
      screen size change occurs. Let's fix that.
      Signed-off-by: NNicolas Pitre <nico@linaro.org>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b4dffe8
    • N
      vt: always call notifier with the console lock held · 18ef43de
      Nicolas Pitre 提交于
      commit 7e1d226345f89ad5d0216a9092c81386c89b4983 upstream.
      
      Every invocation of notify_write() and notify_update() is performed
      under the console lock, except for one case. Let's fix that.
      Signed-off-by: NNicolas Pitre <nico@linaro.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      18ef43de
    • N
      vt: make vt_console_print() compatible with the unicode screen buffer · 855f7e64
      Nicolas Pitre 提交于
      commit 6609cff65c5b184ab889880ef5d41189611ea05f upstream.
      
      When kernel messages are printed to the console, they appear blank on
      the unicode screen. This is because vt_console_print() is lacking a call
      to vc_uniscr_putc(). However the later function assumes vc->vc_x is
      always up to date when called, which is not the case here as
      vt_console_print() uses it to mark the beginning of the display update.
      
      This patch reworks (and simplifies) vt_console_print() so that vc->vc_x
      is always valid and keeps the start of display update in a local variable
      instead, which finally allows for adding the missing vc_uniscr_putc()
      call.
      Signed-off-by: NNicolas Pitre <nico@linaro.org>
      Cc: stable@vger.kernel.org # v4.19+
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      855f7e64
    • U
      can: flexcan: fix NULL pointer exception during bringup · 6f4f2a44
      Uwe Kleine-König 提交于
      commit a55234dabe1f72cf22f9197980751d37e38ba020 upstream.
      
      Commit cbffaf7aa09e ("can: flexcan: Always use last mailbox for TX")
      introduced a loop letting i run up to (including) ARRAY_SIZE(regs->mb)
      and in the body accessed regs->mb[i] which is an out-of-bounds array
      access that then resulted in an access to an reserved register area.
      
      Later this was changed by commit 0517961ccdf1 ("can: flexcan: Add
      provision for variable payload size") to iterate a bit differently but
      still runs one iteration too much resulting to call
      
      	flexcan_get_mb(priv, priv->mb_count)
      
      which results in a WARN_ON and then a NULL pointer exception. This
      only affects devices compatible with "fsl,p1010-flexcan",
      "fsl,imx53-flexcan", "fsl,imx35-flexcan", "fsl,imx25-flexcan",
      "fsl,imx28-flexcan", so newer i.MX SoCs are not affected.
      
      Fixes: cbffaf7aa09e ("can: flexcan: Always use last mailbox for TX")
      Signed-off-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Cc: linux-stable <stable@vger.kernel.org> # >= 4.20
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      6f4f2a44
    • O
      can: bcm: check timer values before ktime conversion · 576f474f
      Oliver Hartkopp 提交于
      commit 93171ba6f1deffd82f381d36cb13177872d023f6 upstream.
      
      Kyungtae Kim detected a potential integer overflow in bcm_[rx|tx]_setup()
      when the conversion into ktime multiplies the given value with NSEC_PER_USEC
      (1000).
      
      Reference: https://marc.info/?l=linux-can&m=154732118819828&w=2
      
      Add a check for the given tv_usec, so that the value stays below one second.
      Additionally limit the tv_sec value to a reasonable value for CAN related
      use-cases of 400 days and ensure all values to be positive.
      Reported-by: NKyungtae Kim <kt0755@gmail.com>
      Tested-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Cc: linux-stable <stable@vger.kernel.org> # >= 2.6.26
      Tested-by: NKyungtae Kim <kt0755@gmail.com>
      Acked-by: NAndre Naujoks <nautsch2@gmail.com>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      576f474f
    • M
      can: dev: __can_get_echo_skb(): fix bogous check for non-existing skb by removing it · 8d85aa96
      Manfred Schlaegl 提交于
      commit 7b12c8189a3dc50638e7d53714c88007268d47ef upstream.
      
      This patch revert commit 7da11ba5c506
      ("can: dev: __can_get_echo_skb(): print error message, if trying to echo non existing skb")
      
      After introduction of this change we encountered following new error
      message on various i.MX plattforms (flexcan):
      
      | flexcan 53fc8000.can can0: __can_get_echo_skb: BUG! Trying to echo non
      | existing skb: can_priv::echo_skb[0]
      
      The introduction of the message was a mistake because
      priv->echo_skb[idx] = NULL is a perfectly valid in following case: If
      CAN_RAW_LOOPBACK is disabled (setsockopt) in applications, the pkt_type
      of the tx skb's given to can_put_echo_skb is set to PACKET_LOOPBACK. In
      this case can_put_echo_skb will not set priv->echo_skb[idx]. It is
      therefore kept NULL.
      
      As additional argument for revert: The order of check and usage of idx
      was changed. idx is used to access an array element before checking it's
      boundaries.
      Signed-off-by: NManfred Schlaegl <manfred.schlaegl@ginzinger.com>
      Fixes: 7da11ba5c506 ("can: dev: __can_get_echo_skb(): print error message, if trying to echo non existing skb")
      Cc: linux-stable <stable@vger.kernel.org>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8d85aa96
    • M
      irqchip/gic-v3-its: Align PCI Multi-MSI allocation on their size · bdcf74e7
      Marc Zyngier 提交于
      commit 8208d1708b88b412ca97f50a6d951242c88cbbac upstream.
      
      The way we allocate events works fine in most cases, except
      when multiple PCI devices share an ITS-visible DevID, and that
      one of them is trying to use MultiMSI allocation.
      
      In that case, our allocation is not guaranteed to be zero-based
      anymore, and we have to make sure we allocate it on a boundary
      that is compatible with the PCI Multi-MSI constraints.
      
      Fix this by allocating the full region upfront instead of iterating
      over the number of MSIs. MSI-X are always allocated one by one,
      so this shouldn't change anything on that front.
      
      Fixes: b48ac83d ("irqchip: GICv3: ITS: MSI support")
      Cc: stable@vger.kernel.org
      Reported-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Tested-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bdcf74e7
    • T
      net: sun: cassini: Cleanup license conflict · 6f4db68a
      Thomas Gleixner 提交于
      commit 56cb4e5034998b5522a657957321ca64ca2ea0a0 upstream.
      
      The recent addition of SPDX license identifiers to the files in
      drivers/net/ethernet/sun created a licensing conflict.
      
      The cassini driver files contain a proper license notice:
      
        * This program is free software; you can redistribute it and/or
        * modify it under the terms of the GNU General Public License as
        * published by the Free Software Foundation; either version 2 of the
        * License, or (at your option) any later version.
      
      but the SPDX change added:
      
         SPDX-License-Identifier: GPL-2.0
      
      So the file got tagged GPL v2 only while in fact it is licensed under GPL
      v2 or later.
      
      It's nice that people care about the SPDX tags, but they need to be more
      careful about it. Not everything under (the) sun belongs to ...
      
      Fix up the SPDX identifier and remove the boiler plate text as it is
      redundant.
      
      Fixes: c861ef83 ("sun: Add SPDX license tags to Sun network drivers")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Shannon Nelson <shannon.nelson@oracle.com>
      Cc: Zhu Yanjun <yanjun.zhu@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: netdev@vger.kernel.org
      Cc: stable@vger.kernel.org
      Acked-by: NShannon Nelson <shannon.lee.nelson@gmail.com>
      Reviewed-by: NZhu Yanjun <yanjun.zhu@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6f4db68a
    • T
      posix-cpu-timers: Unbreak timer rearming · 21c0d162
      Thomas Gleixner 提交于
      commit 93ad0fc088c5b4631f796c995bdd27a082ef33a6 upstream.
      
      The recent commit which prevented a division by 0 issue in the alarm timer
      code broke posix CPU timers as an unwanted side effect.
      
      The reason is that the common rearm code checks for timer->it_interval
      being 0 now. What went unnoticed is that the posix cpu timer setup does not
      initialize timer->it_interval as it stores the interval in CPU timer
      specific storage. The reason for the separate storage is historical as the
      posix CPU timers always had a 64bit nanoseconds representation internally
      while timer->it_interval is type ktime_t which used to be a modified
      timespec representation on 32bit machines.
      
      Instead of reverting the offending commit and fixing the alarmtimer issue
      in the alarmtimer code, store the interval in timer->it_interval at CPU
      timer setup time so the common code check works. This also repairs the
      existing inconistency of the posix CPU timer code which kept a single shot
      timer armed despite of the interval being 0.
      
      The separate storage can be removed in mainline, but that needs to be a
      separate commit as the current one has to be backported to stable kernels.
      
      Fixes: 0e334db6bb4b ("posix-timers: Fix division by zero bug")
      Reported-by: NH.J. Lu <hjl.tools@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190111133500.840117406@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      21c0d162
    • J
      x86/entry/64/compat: Fix stack switching for XEN PV · dd085f9b
      Jan Beulich 提交于
      commit fc24d75a7f91837d7918e40719575951820b2b8f upstream.
      
      While in the native case entry into the kernel happens on the trampoline
      stack, PV Xen kernels get entered with the current thread stack right
      away. Hence source and destination stacks are identical in that case,
      and special care is needed.
      
      Other than in sync_regs() the copying done on the INT80 path isn't
      NMI / #MC safe, as either of these events occurring in the middle of the
      stack copying would clobber data on the (source) stack.
      
      There is similar code in interrupt_entry() and nmi(), but there is no fixup
      required because those code paths are unreachable in XEN PV guests.
      
      [ tglx: Sanitized subject, changelog, Fixes tag and stable mail address. Sigh ]
      
      Fixes: 7f2590a1 ("x86/entry/64: Use a per-CPU trampoline stack for IDT entries")
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: xen-devel@lists.xenproject.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/5C3E1128020000780020DFAD@prv1-mh.provo.novell.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dd085f9b
    • D
      x86/kaslr: Fix incorrect i8254 outb() parameters · ed334be9
      Daniel Drake 提交于
      commit 7e6fc2f50a3197d0e82d1c0e86282976c9e6c8a4 upstream.
      
      The outb() function takes parameters value and port, in that order.  Fix
      the parameters used in the kalsr i8254 fallback code.
      
      Fixes: 5bfce5ef ("x86, kaslr: Provide randomness functions")
      Signed-off-by: NDaniel Drake <drake@endlessm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: bp@alien8.de
      Cc: hpa@zytor.com
      Cc: linux@endlessm.com
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190107034024.15005-1-drake@endlessm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ed334be9