1. 14 3月, 2019 1 次提交
  2. 10 3月, 2019 6 次提交
    • M
      Bluetooth: Fix locking in bt_accept_enqueue() for BH context · 8d368fc5
      Matthias Kaehlcke 提交于
      commit c4f5627f7eeecde1bb6b646d8c0907b96dc2b2a6 upstream.
      
      With commit e1633762 ("Bluetooth: Handle bt_accept_enqueue() socket
      atomically") lock_sock[_nested]() is used to acquire the socket lock
      before manipulating the socket. lock_sock[_nested]() may block, which
      is problematic since bt_accept_enqueue() can be called in bottom half
      context (e.g. from rfcomm_connect_ind()):
      
      [<ffffff80080d81ec>] __might_sleep+0x4c/0x80
      [<ffffff800876c7b0>] lock_sock_nested+0x24/0x58
      [<ffffff8000d7c27c>] bt_accept_enqueue+0x48/0xd4 [bluetooth]
      [<ffffff8000e67d8c>] rfcomm_connect_ind+0x190/0x218 [rfcomm]
      
      Add a parameter to bt_accept_enqueue() to indicate whether the
      function is called from BH context, and acquire the socket lock
      with bh_lock_sock_nested() if that's the case.
      
      Also adapt all callers of bt_accept_enqueue() to pass the new
      parameter:
      
      - l2cap_sock_new_connection_cb()
        - uses lock_sock() to lock the parent socket => process context
      
      - rfcomm_connect_ind()
        - acquires the parent socket lock with bh_lock_sock() => BH
          context
      
      - __sco_chan_add()
        - called from sco_chan_add(), which is called from sco_connect().
          parent is NULL, hence bt_accept_enqueue() isn't called in this
          code path and we can ignore it
        - also called from sco_conn_ready(). uses bh_lock_sock() to acquire
          the parent lock => BH context
      
      Fixes: e1633762 ("Bluetooth: Handle bt_accept_enqueue() socket atomically")
      Signed-off-by: NMatthias Kaehlcke <mka@chromium.org>
      Reviewed-by: NDouglas Anderson <dianders@chromium.org>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8d368fc5
    • N
      net: avoid use IPCB in cipso_v4_error · 125bc1e6
      Nazarov Sergey 提交于
      [ Upstream commit 3da1ed7ac398f34fff1694017a07054d69c5f5c5 ]
      
      Extract IP options in cipso_v4_error and use __icmp_send.
      Signed-off-by: NSergey Nazarov <s-nazarov@yandex.ru>
      Acked-by: NPaul Moore <paul@paul-moore.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      125bc1e6
    • N
      net: Add __icmp_send helper. · f2397468
      Nazarov Sergey 提交于
      [ Upstream commit 9ef6b42ad6fd7929dd1b6092cb02014e382c6a91 ]
      
      Add __icmp_send function having ip_options struct parameter
      Signed-off-by: NSergey Nazarov <s-nazarov@yandex.ru>
      Reviewed-by: NPaul Moore <paul@paul-moore.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f2397468
    • H
      ipv4: Add ICMPv6 support when parse route ipproto · 99ed9458
      Hangbin Liu 提交于
      [ Upstream commit 5e1a99eae84999a2536f50a0beaf5d5262337f40 ]
      
      For ip rules, we need to use 'ipproto ipv6-icmp' to match ICMPv6 headers.
      But for ip -6 route, currently we only support tcp, udp and icmp.
      
      Add ICMPv6 support so we can match ipv6-icmp rules for route lookup.
      
      v2: As David Ahern and Sabrina Dubroca suggested, Add an argument to
      rtm_getroute_parse_ip_proto() to handle ICMP/ICMPv6 with different family.
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Fixes: eacb9384 ("ipv6: support sport, dport and ip_proto in RTM_GETROUTE")
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      99ed9458
    • E
      net: sched: put back q.qlen into a single location · 3043bfe0
      Eric Dumazet 提交于
      [ Upstream commit 46b1c18f9deb326a7e18348e668e4c7ab7c7458b ]
      
      In the series fc8b81a5 ("Merge branch 'lockless-qdisc-series'")
      John made the assumption that the data path had no need to read
      the qdisc qlen (number of packets in the qdisc).
      
      It is true when pfifo_fast is used as the root qdisc, or as direct MQ/MQPRIO
      children.
      
      But pfifo_fast can be used as leaf in class full qdiscs, and existing
      logic needs to access the child qlen in an efficient way.
      
      HTB breaks badly, since it uses cl->leaf.q->q.qlen in :
        htb_activate() -> WARN_ON()
        htb_dequeue_tree() to decide if a class can be htb_deactivated
        when it has no more packets.
      
      HFSC, DRR, CBQ, QFQ have similar issues, and some calls to
      qdisc_tree_reduce_backlog() also read q.qlen directly.
      
      Using qdisc_qlen_sum() (which iterates over all possible cpus)
      in the data path is a non starter.
      
      It seems we have to put back qlen in a central location,
      at least for stable kernels.
      
      For all qdisc but pfifo_fast, qlen is guarded by the qdisc lock,
      so the existing q.qlen{++|--} are correct.
      
      For 'lockless' qdisc (pfifo_fast so far), we need to use atomic_{inc|dec}()
      because the spinlock might be not held (for example from
      pfifo_fast_enqueue() and pfifo_fast_dequeue())
      
      This patch adds atomic_qlen (in the same location than qlen)
      and renames the following helpers, since we want to express
      they can be used without qdisc lock, and that qlen is no longer percpu.
      
      - qdisc_qstats_cpu_qlen_dec -> qdisc_qstats_atomic_qlen_dec()
      - qdisc_qstats_cpu_qlen_inc -> qdisc_qstats_atomic_qlen_inc()
      
      Later (net-next) we might revert this patch by tracking all these
      qlen uses and replace them by a more efficient method (not having
      to access a precise qlen, but an empty/non_empty status that might
      be less expensive to maintain/track).
      
      Another possibility is to have a legacy pfifo_fast version that would
      be used when used a a child qdisc, since the parent qdisc needs
      a spinlock anyway. But then, future lockless qdiscs would also
      have the same problem.
      
      Fixes: 7e66016f ("net: sched: helpers to sum qlen and qlen for per cpu logic")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3043bfe0
    • V
      cpufreq: Use struct kobj_attribute instead of struct global_attr · 464b4279
      Viresh Kumar 提交于
      commit 625c85a62cb7d3c79f6e16de3cfa972033658250 upstream.
      
      The cpufreq_global_kobject is created using kobject_create_and_add()
      helper, which assigns the kobj_type as dynamic_kobj_ktype and show/store
      routines are set to kobj_attr_show() and kobj_attr_store().
      
      These routines pass struct kobj_attribute as an argument to the
      show/store callbacks. But all the cpufreq files created using the
      cpufreq_global_kobject expect the argument to be of type struct
      attribute. Things work fine currently as no one accesses the "attr"
      argument. We may not see issues even if the argument is used, as struct
      kobj_attribute has struct attribute as its first element and so they
      will both get same address.
      
      But this is logically incorrect and we should rather use struct
      kobj_attribute instead of struct global_attr in the cpufreq core and
      drivers and the show/store callbacks should take struct kobj_attribute
      as argument instead.
      
      This bug is caught using CFI CLANG builds in android kernel which
      catches mismatch in function prototypes for such callbacks.
      Reported-by: NDonghee Han <dh.han@samsung.com>
      Reported-by: NSangkyu Kim <skwith.kim@samsung.com>
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      464b4279
  3. 06 3月, 2019 3 次提交
  4. 27 2月, 2019 8 次提交
    • M
      udlfb: handle unplug properly · c014cae8
      Mikulas Patocka 提交于
      commit 68a958a915ca912b8ce71b9eea7445996f6e681e upstream.
      
      The udlfb driver maintained an open count and cleaned up itself when the
      count reached zero. But the console is also counted in the reference count
      - so, if the user unplugged the device, the open count would not drop to
      zero and the driver stayed loaded with console attached. If the user
      re-plugged the adapter, it would create a device /dev/fb1, show green
      screen and the access to the console would be lost.
      
      The framebuffer subsystem has reference counting on its own - in order to
      fix the unplug bug, we rely the framebuffer reference counting. When the
      user unplugs the adapter, we call unregister_framebuffer unconditionally.
      unregister_framebuffer will unbind the console, wait until all users stop
      using the framebuffer and then call the fb_destroy method. The fb_destroy
      cleans up the USB driver.
      
      This patch makes the following changes:
      * Drop dlfb->kref and rely on implicit framebuffer reference counting
        instead.
      * dlfb_usb_disconnect calls unregister_framebuffer, the rest of driver
        cleanup is done in the function dlfb_ops_destroy. dlfb_ops_destroy will
        be called by the framebuffer subsystem when no processes have the
        framebuffer open or mapped.
      * We don't use workqueue during initialization, but initialize directly
        from dlfb_usb_probe. The workqueue could race with dlfb_usb_disconnect
        and this racing would produce various kinds of memory corruption.
      * We use usb_get_dev and usb_put_dev to make sure that the USB subsystem
        doesn't free the device under us.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      cc: Dave Airlie <airlied@redhat.com>
      Cc: Bernie Thompson <bernie@plugable.com>,
      Cc: Ladislav Michl <ladis@linux-mips.org>
      Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c014cae8
    • W
      net: avoid false positives in untrusted gso validation · c375152b
      Willem de Bruijn 提交于
      commit 9e8db5913264d3967b93c765a6a9e464d9c473db upstream.
      
      GSO packets with vnet_hdr must conform to a small set of gso_types.
      The below commit uses flow dissection to drop packets that do not.
      
      But it has false positives when the skb is not fully initialized.
      Dissection needs skb->protocol and skb->network_header.
      
      Infer skb->protocol from gso_type as the two must agree.
      SKB_GSO_UDP can use both ipv4 and ipv6, so try both.
      
      Exclude callers for which network header offset is not known.
      
      Fixes: d5be7f632bad ("net: validate untrusted gso packets without csum offload")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c375152b
    • W
      net: validate untrusted gso packets without csum offload · e93384b1
      Willem de Bruijn 提交于
      commit d5be7f632bad0f489879eed0ff4b99bd7fe0b74c upstream.
      
      Syzkaller again found a path to a kernel crash through bad gso input.
      By building an excessively large packet to cause an skb field to wrap.
      
      If VIRTIO_NET_HDR_F_NEEDS_CSUM was set this would have been dropped in
      skb_partial_csum_set.
      
      GSO packets that do not set checksum offload are suspicious and rare.
      Most callers of virtio_net_hdr_to_skb already pass them to
      skb_probe_transport_header.
      
      Move that test forward, change it to detect parse failure and drop
      packets on failure as those cleary are not one of the legitimate
      VIRTIO_NET_HDR_GSO types.
      
      Fixes: bfd5f4a3 ("packet: Add GSO/csum offload support.")
      Fixes: f43798c2 ("tun: Allow GSO using virtio_net_hdr")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e93384b1
    • E
      KEYS: user: Align the payload buffer · 390c7653
      Eric Biggers 提交于
      commit cc1780fc42c76c705dd07ea123f1143dc5057630 upstream.
      
      Align the payload of "user" and "logon" keys so that users of the
      keyrings service can access it as a struct that requires more than
      2-byte alignment.  fscrypt currently does this which results in the read
      of fscrypt_key::size being misaligned as it needs 4-byte alignment.
      
      Align to __alignof__(u64) rather than __alignof__(long) since in the
      future it's conceivable that people would use structs beginning with
      u64, which on some platforms would require more than 'long' alignment.
      Reported-by: NAaro Koskinen <aaro.koskinen@iki.fi>
      Fixes: 2aa349f6 ("[PATCH] Keys: Export user-defined keyring operations")
      Fixes: 88bd6ccd ("ext4 crypto: add encryption key management facilities")
      Cc: stable@vger.kernel.org
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Tested-by: NAaro Koskinen <aaro.koskinen@iki.fi>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NJames Morris <james.morris@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      390c7653
    • K
      inet_diag: fix reporting cgroup classid and fallback to priority · 589503cb
      Konstantin Khlebnikov 提交于
      [ Upstream commit 1ec17dbd90f8b638f41ee650558609c1af63dfa0 ]
      
      Field idiag_ext in struct inet_diag_req_v2 used as bitmap of requested
      extensions has only 8 bits. Thus extensions starting from DCTCPINFO
      cannot be requested directly. Some of them included into response
      unconditionally or hook into some of lower 8 bits.
      
      Extension INET_DIAG_CLASS_ID has not way to request from the beginning.
      
      This patch bundle it with INET_DIAG_TCLASS (ipv6 tos), fixes space
      reservation, and documents behavior for other extensions.
      
      Also this patch adds fallback to reporting socket priority. This filed
      is more widely used for traffic classification because ipv4 sockets
      automatically maps TOS to priority and default qdisc pfifo_fast knows
      about that. But priority could be changed via setsockopt SO_PRIORITY so
      INET_DIAG_TOS isn't enough for predicting class.
      
      Also cgroup2 obsoletes net_cls classid (it always zero), but we cannot
      reuse this field for reporting cgroup2 id because it is 64-bit (ino+gen).
      
      So, after this patch INET_DIAG_CLASS_ID will report socket priority
      for most common setup when net_cls isn't set and/or cgroup2 in use.
      
      Fixes: 0888e372 ("net: inet: diag: expose sockets cgroup classid")
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      589503cb
    • W
      netfilter: nft_flow_offload: fix interaction with vrf slave device · 6d26c375
      wenxu 提交于
      [ Upstream commit 10f4e765879e514e1ce7f52ed26603047af196e2 ]
      
      In the forward chain, the iif is changed from slave device to master vrf
      device. Thus, flow offload does not find a match on the lower slave
      device.
      
      This patch uses the cached route, ie. dst->dev, to update the iif and
      oif fields in the flow entry.
      
      After this patch, the following example works fine:
      
       # ip addr add dev eth0 1.1.1.1/24
       # ip addr add dev eth1 10.0.0.1/24
       # ip link add user1 type vrf table 1
       # ip l set user1 up
       # ip l set dev eth0 master user1
       # ip l set dev eth1 master user1
      
       # nft add table firewall
       # nft add flowtable f fb1 { hook ingress priority 0 \; devices = { eth0, eth1 } \; }
       # nft add chain f ftb-all {type filter hook forward priority 0 \; policy accept \; }
       # nft add rule f ftb-all ct zone 1 ip protocol tcp flow offload @fb1
       # nft add rule f ftb-all ct zone 1 ip protocol udp flow offload @fb1
      Signed-off-by: Nwenxu <wenxu@ucloud.cn>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6d26c375
    • M
      include/linux/compiler*.h: fix OPTIMIZER_HIDE_VAR · 4047a7ad
      Michael S. Tsirkin 提交于
      [ Upstream commit 3e2ffd655cc6a694608d997738989ff5572a8266 ]
      
      Since commit 815f0ddb ("include/linux/compiler*.h: make compiler-*.h
      mutually exclusive") clang no longer reuses the OPTIMIZER_HIDE_VAR macro
      from compiler-gcc - instead it gets the version in
      include/linux/compiler.h.  Unfortunately that version doesn't actually
      prevent compiler from optimizing out the variable.
      
      Fix up by moving the macro out from compiler-gcc.h to compiler.h.
      Compilers without incline asm support will keep working
      since it's protected by an ifdef.
      
      Also fix up comments to match reality since we are no longer overriding
      any macros.
      
      Build-tested with gcc and clang.
      
      Fixes: 815f0ddb ("include/linux/compiler*.h: make compiler-*.h mutually exclusive")
      Cc: Eli Friedman <efriedma@codeaurora.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4047a7ad
    • D
      qed: Fix qed_chain_set_prod() for PBL chains with non power of 2 page count · 1fa0cf45
      Denis Bolotin 提交于
      [ Upstream commit 2d533a9287f2011632977e87ce2783f4c689c984 ]
      
      In PBL chains with non power of 2 page count, the producer is not at the
      beginning of the chain when index is 0 after a wrap. Therefore, after the
      producer index wrap around, page index should be calculated more carefully.
      Signed-off-by: NDenis Bolotin <dbolotin@marvell.com>
      Signed-off-by: NAriel Elior <aelior@marvell.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1fa0cf45
  5. 23 2月, 2019 4 次提交
    • E
      ax25: fix possible use-after-free · eed11c69
      Eric Dumazet 提交于
      commit 63530aba7826a0f8e129874df9c4d264f9db3f9e upstream.
      
      syzbot found that ax25 routes where not properly protected
      against concurrent use [1].
      
      In this particular report the bug happened while
      copying ax25->digipeat.
      
      Fix this problem by making sure we call ax25_get_route()
      while ax25_route_lock is held, so that no modification
      could happen while using the route.
      
      The current two ax25_get_route() callers do not sleep,
      so this change should be fine.
      
      Once we do that, ax25_get_route() no longer needs to
      grab a reference on the found route.
      
      [1]
      ax25_connect(): syz-executor0 uses autobind, please contact jreuter@yaina.de
      BUG: KASAN: use-after-free in memcpy include/linux/string.h:352 [inline]
      BUG: KASAN: use-after-free in kmemdup+0x42/0x60 mm/util.c:113
      Read of size 66 at addr ffff888066641a80 by task syz-executor2/531
      
      ax25_connect(): syz-executor0 uses autobind, please contact jreuter@yaina.de
      CPU: 1 PID: 531 Comm: syz-executor2 Not tainted 5.0.0-rc2+ #10
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1db/0x2d0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
       kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       check_memory_region_inline mm/kasan/generic.c:185 [inline]
       check_memory_region+0x123/0x190 mm/kasan/generic.c:191
       memcpy+0x24/0x50 mm/kasan/common.c:130
       memcpy include/linux/string.h:352 [inline]
       kmemdup+0x42/0x60 mm/util.c:113
       kmemdup include/linux/string.h:425 [inline]
       ax25_rt_autobind+0x25d/0x750 net/ax25/ax25_route.c:424
       ax25_connect.cold+0x30/0xa4 net/ax25/af_ax25.c:1224
       __sys_connect+0x357/0x490 net/socket.c:1664
       __do_sys_connect net/socket.c:1675 [inline]
       __se_sys_connect net/socket.c:1672 [inline]
       __x64_sys_connect+0x73/0xb0 net/socket.c:1672
       do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x458099
      Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f870ee22c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458099
      RDX: 0000000000000048 RSI: 0000000020000080 RDI: 0000000000000005
      RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
      ax25_connect(): syz-executor4 uses autobind, please contact jreuter@yaina.de
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f870ee236d4
      R13: 00000000004be48e R14: 00000000004ce9a8 R15: 00000000ffffffff
      
      Allocated by task 526:
       save_stack+0x45/0xd0 mm/kasan/common.c:73
       set_track mm/kasan/common.c:85 [inline]
       __kasan_kmalloc mm/kasan/common.c:496 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:469
       kasan_kmalloc+0x9/0x10 mm/kasan/common.c:504
      ax25_connect(): syz-executor5 uses autobind, please contact jreuter@yaina.de
       kmem_cache_alloc_trace+0x151/0x760 mm/slab.c:3609
       kmalloc include/linux/slab.h:545 [inline]
       ax25_rt_add net/ax25/ax25_route.c:95 [inline]
       ax25_rt_ioctl+0x3b9/0x1270 net/ax25/ax25_route.c:233
       ax25_ioctl+0x322/0x10b0 net/ax25/af_ax25.c:1763
       sock_do_ioctl+0xe2/0x400 net/socket.c:950
       sock_ioctl+0x32f/0x6c0 net/socket.c:1074
       vfs_ioctl fs/ioctl.c:46 [inline]
       file_ioctl fs/ioctl.c:509 [inline]
       do_vfs_ioctl+0x107b/0x17d0 fs/ioctl.c:696
       ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
       __do_sys_ioctl fs/ioctl.c:720 [inline]
       __se_sys_ioctl fs/ioctl.c:718 [inline]
       __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
       do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      ax25_connect(): syz-executor5 uses autobind, please contact jreuter@yaina.de
      Freed by task 550:
       save_stack+0x45/0xd0 mm/kasan/common.c:73
       set_track mm/kasan/common.c:85 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:458
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:466
       __cache_free mm/slab.c:3487 [inline]
       kfree+0xcf/0x230 mm/slab.c:3806
       ax25_rt_add net/ax25/ax25_route.c:92 [inline]
       ax25_rt_ioctl+0x304/0x1270 net/ax25/ax25_route.c:233
       ax25_ioctl+0x322/0x10b0 net/ax25/af_ax25.c:1763
       sock_do_ioctl+0xe2/0x400 net/socket.c:950
       sock_ioctl+0x32f/0x6c0 net/socket.c:1074
       vfs_ioctl fs/ioctl.c:46 [inline]
       file_ioctl fs/ioctl.c:509 [inline]
       do_vfs_ioctl+0x107b/0x17d0 fs/ioctl.c:696
       ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
       __do_sys_ioctl fs/ioctl.c:720 [inline]
       __se_sys_ioctl fs/ioctl.c:718 [inline]
       __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
       do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The buggy address belongs to the object at ffff888066641a80
       which belongs to the cache kmalloc-96 of size 96
      The buggy address is located 0 bytes inside of
       96-byte region [ffff888066641a80, ffff888066641ae0)
      The buggy address belongs to the page:
      page:ffffea0001999040 count:1 mapcount:0 mapping:ffff88812c3f04c0 index:0x0
      flags: 0x1fffc0000000200(slab)
      ax25_connect(): syz-executor4 uses autobind, please contact jreuter@yaina.de
      raw: 01fffc0000000200 ffffea0001817948 ffffea0002341dc8 ffff88812c3f04c0
      raw: 0000000000000000 ffff888066641000 0000000100000020 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff888066641980: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
       ffff888066641a00: 00 00 00 00 00 00 00 00 02 fc fc fc fc fc fc fc
      >ffff888066641a80: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
                         ^
       ffff888066641b00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
       ffff888066641b80: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eed11c69
    • D
      net: Add header for usage of fls64() · 56e97e70
      David S. Miller 提交于
      [ Upstream commit 8681ef1f3d295bd3600315325f3b3396d76d02f6 ]
      
      Fixes: 3b89ea9c5902 ("net: Fix for_each_netdev_feature on Big endian")
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      56e97e70
    • L
      net: ipv4: use a dedicated counter for icmp_v4 redirect packets · 1764111c
      Lorenzo Bianconi 提交于
      [ Upstream commit c09551c6ff7fe16a79a42133bcecba5fc2fc3291 ]
      
      According to the algorithm described in the comment block at the
      beginning of ip_rt_send_redirect, the host should try to send
      'ip_rt_redirect_number' ICMP redirect packets with an exponential
      backoff and then stop sending them at all assuming that the destination
      ignores redirects.
      If the device has previously sent some ICMP error packets that are
      rate-limited (e.g TTL expired) and continues to receive traffic,
      the redirect packets will never be transmitted. This happens since
      peer->rate_tokens will be typically greater than 'ip_rt_redirect_number'
      and so it will never be reset even if the redirect silence timeout
      (ip_rt_redirect_silence) has elapsed without receiving any packet
      requiring redirects.
      
      Fix it by using a dedicated counter for the number of ICMP redirect
      packets that has been sent by the host
      
      I have not been able to identify a given commit that introduced the
      issue since ip_rt_send_redirect implements the same rate-limiting
      algorithm from commit 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1764111c
    • H
      net: Fix for_each_netdev_feature on Big endian · b2975c2e
      Hauke Mehrtens 提交于
      [ Upstream commit 3b89ea9c5902acccdbbdec307c85edd1bf52515e ]
      
      The features attribute is of type u64 and stored in the native endianes on
      the system. The for_each_set_bit() macro takes a pointer to a 32 bit array
      and goes over the bits in this area. On little Endian systems this also
      works with an u64 as the most significant bit is on the highest address,
      but on big endian the words are swapped. When we expect bit 15 here we get
      bit 47 (15 + 32).
      
      This patch converts it more or less to its own for_each_set_bit()
      implementation which works on 64 bit integers directly. This is then
      completely in host endianness and should work like expected.
      
      Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack")
      Signed-off-by: NHauke Mehrtens <hauke.mehrtens@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b2975c2e
  6. 20 2月, 2019 2 次提交
    • Z
      mmc: block: handle complete_work on separate workqueue · c4609e81
      Zachary Hays 提交于
      commit dcf6e2e38a1c7ccbc535de5e1d9b14998847499d upstream.
      
      The kblockd workqueue is created with the WQ_MEM_RECLAIM flag set.
      This generates a rescuer thread for that queue that will trigger when
      the CPU is under heavy load and collect the uncompleted work.
      
      In the case of mmc, this creates the possibility of a deadlock when
      there are multiple partitions on the device as other blk-mq work is
      also run on the same queue. For example:
      
      - worker 0 claims the mmc host to work on partition 1
      - worker 1 attempts to claim the host for partition 2 but has to wait
        for worker 0 to finish
      - worker 0 schedules complete_work to release the host
      - rescuer thread is triggered after time-out and collects the dangling
        work
      - rescuer thread attempts to complete the work in order starting with
        claim host
      - the task to release host is now blocked by a task to claim it and
        will never be called
      
      The above results in multiple hung tasks that lead to failures to
      mount partitions.
      
      Handling complete_work on a separate workqueue avoids this by keeping
      the work completion tasks separate from the other blk-mq work. This
      allows the host to be released without getting blocked by other tasks
      attempting to claim the host.
      Signed-off-by: NZachary Hays <zhays@lexmark.com>
      Fixes: 81196976 ("mmc: block: Add blk-mq support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c4609e81
    • J
      perf/x86: Add check_period PMU callback · 74cbb754
      Jiri Olsa 提交于
      commit 81ec3f3c4c4d78f2d3b6689c9816bfbdf7417dbb upstream.
      
      Vince (and later on Ravi) reported crashes in the BTS code during
      fuzzing with the following backtrace:
      
        general protection fault: 0000 [#1] SMP PTI
        ...
        RIP: 0010:perf_prepare_sample+0x8f/0x510
        ...
        Call Trace:
         <IRQ>
         ? intel_pmu_drain_bts_buffer+0x194/0x230
         intel_pmu_drain_bts_buffer+0x160/0x230
         ? tick_nohz_irq_exit+0x31/0x40
         ? smp_call_function_single_interrupt+0x48/0xe0
         ? call_function_single_interrupt+0xf/0x20
         ? call_function_single_interrupt+0xa/0x20
         ? x86_schedule_events+0x1a0/0x2f0
         ? x86_pmu_commit_txn+0xb4/0x100
         ? find_busiest_group+0x47/0x5d0
         ? perf_event_set_state.part.42+0x12/0x50
         ? perf_mux_hrtimer_restart+0x40/0xb0
         intel_pmu_disable_event+0xae/0x100
         ? intel_pmu_disable_event+0xae/0x100
         x86_pmu_stop+0x7a/0xb0
         x86_pmu_del+0x57/0x120
         event_sched_out.isra.101+0x83/0x180
         group_sched_out.part.103+0x57/0xe0
         ctx_sched_out+0x188/0x240
         ctx_resched+0xa8/0xd0
         __perf_event_enable+0x193/0x1e0
         event_function+0x8e/0xc0
         remote_function+0x41/0x50
         flush_smp_call_function_queue+0x68/0x100
         generic_smp_call_function_single_interrupt+0x13/0x30
         smp_call_function_single_interrupt+0x3e/0xe0
         call_function_single_interrupt+0xf/0x20
         </IRQ>
      
      The reason is that while event init code does several checks
      for BTS events and prevents several unwanted config bits for
      BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows
      to create BTS event without those checks being done.
      
      Following sequence will cause the crash:
      
      If we create an 'almost' BTS event with precise_ip and callchains,
      and it into a BTS event it will crash the perf_prepare_sample()
      function because precise_ip events are expected to come
      in with callchain data initialized, but that's not the
      case for intel_pmu_drain_bts_buffer() caller.
      
      Adding a check_period callback to be called before the period
      is changed via PERF_EVENT_IOC_PERIOD. It will deny the change
      if the event would become BTS. Plus adding also the limit_period
      check as well.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20190204123532.GA4794@kravaSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      74cbb754
  7. 15 2月, 2019 1 次提交
    • B
      SUNRPC: Always drop the XPRT_LOCK on XPRT_CLOSE_WAIT · 2440f3ce
      Benjamin Coddington 提交于
      This patch is only appropriate for stable kernels v4.16 - v4.19
      
      Since commit 9b30889c ("SUNRPC: Ensure we always close the socket after
      a connection shuts down"), and until commit c544577daddb ("SUNRPC: Clean up
      transport write space handling"), it is possible for the NFS client to spin
      in the following tight loop:
      
      269.964083: rpc_task_run_action: task:43@0 flags=5a81 state=0005 status=0 action=call_bind [sunrpc]
      269.964083: rpc_task_run_action: task:43@0 flags=5a81 state=0005 status=0 action=call_connect [sunrpc]
      269.964083: rpc_task_run_action: task:43@0 flags=5a81 state=0005 status=0 action=call_transmit [sunrpc]
      269.964085: xprt_transmit: peer=[10.0.1.82]:2049 xid=0x761d3f77 status=-32
      269.964085: rpc_task_run_action: task:43@0 flags=5a81 state=0005 status=-32 action=call_transmit_status [sunrpc]
      269.964085: rpc_task_run_action: task:43@0 flags=5a81 state=0005 status=-32 action=call_status [sunrpc]
      269.964085: rpc_call_status: task:43@0 status=-32
      
      The issue is that the path through call_transmit_status does not release
      the XPRT_LOCK when the transmit result is -EPIPE, so the socket cannot be
      properly shut down.
      
      The below commit fixed things up in mainline by unconditionally calling
      xprt_end_transmit() and releasing the XPRT_LOCK after every pass through
      call_transmit.  However, the entirety of this commit is not appropriate for
      stable kernels because its original inclusion was part of a series that
      modifies the sunrpc code to use a different queueing model.  As a result,
      there are machinations within this patch that are not needed for a stable
      fix and will not make sense without a larger backport of the mainline
      series.
      
      In this patch, we take the slightly modified bit of the mainline patch
      below, which is to release the XPRT_LOCK on transmission error should we
      detect that the transport is waiting to close.
      
      commit c544577daddb618c7dd5fa7fb98d6a41782f020e upstream
      Author: Trond Myklebust <trond.myklebust@hammerspace.com>
      Date:   Mon Sep 3 23:39:27 2018 -0400
      
          SUNRPC: Clean up transport write space handling
      
          Treat socket write space handling in the same way we now treat transport
          congestion: by denying the XPRT_LOCK until the transport signals that it
          has free buffer space.
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      
      The original discussion of the problem is here:
      
          https://lore.kernel.org/linux-nfs/20181212135157.4489-1-dwysocha@redhat.com/T/#t
      
      This passes my usual cthon and xfstests on NFS as applied on v4.19 mainline.
      Reported-by: NDave Wysochanski <dwysocha@redhat.com>
      Suggested-by: NTrond Myklebust <trondmy@hammerspace.com>
      Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2440f3ce
  8. 13 2月, 2019 7 次提交
  9. 07 2月, 2019 3 次提交
    • F
      of: overlay: add tests to validate kfrees from overlay removal · 5006496f
      Frank Rowand 提交于
      commit 144552c786925314c1e7cb8f91a71dae1aca8798 upstream.
      
      Add checks:
        - attempted kfree due to refcount reaching zero before overlay
          is removed
        - properties linked to an overlay node when the node is removed
        - node refcount > one during node removal in a changeset destroy,
          if the node was created by the changeset
      
      After applying this patch, several validation warnings will be
      reported from the devicetree unittest during boot due to
      pre-existing devicetree bugs. The warnings will be similar to:
      
        OF: ERROR: of_node_release(), unexpected properties in /testcase-data/overlay-node/test-bus/test-unittest11
        OF: ERROR: memory leak, expected refcount 1 instead of 2, of_node_get()/of_node_put() unbalanced - destroy cset entry: attach overlay node /testcase-data-2/substation@100/
        hvac-medium-2
      Tested-by: NAlan Tull <atull@kernel.org>
      Signed-off-by: NFrank Rowand <frank.rowand@sony.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5006496f
    • T
      oom, oom_reaper: do not enqueue same task twice · 7e70ddc3
      Tetsuo Handa 提交于
      commit 9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3 upstream.
      
      Arkadiusz reported that enabling memcg's group oom killing causes
      strange memcg statistics where there is no task in a memcg despite the
      number of tasks in that memcg is not 0.  It turned out that there is a
      bug in wake_oom_reaper() which allows enqueuing same task twice which
      makes impossible to decrease the number of tasks in that memcg due to a
      refcount leak.
      
      This bug existed since the OOM reaper became invokable from
      task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
      
        T1@P1     |T2@P1     |T3@P1     |OOM reaper
        ----------+----------+----------+------------
                                         # Processing an OOM victim in a different memcg domain.
                              try_charge()
                                mem_cgroup_out_of_memory()
                                  mutex_lock(&oom_lock)
                   try_charge()
                     mem_cgroup_out_of_memory()
                       mutex_lock(&oom_lock)
        try_charge()
          mem_cgroup_out_of_memory()
            mutex_lock(&oom_lock)
                                  out_of_memory()
                                    oom_kill_process(P1)
                                      do_send_sig_info(SIGKILL, @P1)
                                      mark_oom_victim(T1@P1)
                                      wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
                                  mutex_unlock(&oom_lock)
                       out_of_memory()
                         mark_oom_victim(T2@P1)
                         wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
                       mutex_unlock(&oom_lock)
            out_of_memory()
              mark_oom_victim(T1@P1)
              wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
            mutex_unlock(&oom_lock)
                                         # Completed processing an OOM victim in a different memcg domain.
                                         spin_lock(&oom_reaper_lock)
                                         # T1P1 is dequeued.
                                         spin_unlock(&oom_reaper_lock)
      
      but memcg's group oom killing made it easier to trigger this bug by
      calling wake_oom_reaper() on the same task from one out_of_memory()
      request.
      
      Fix this bug using an approach used by commit 855b0183 ("oom,
      oom_reaper: disable oom_reaper for oom_kill_allocating_task").  As a
      side effect of this patch, this patch also avoids enqueuing multiple
      threads sharing memory via task_will_free_mem(current) path.
      
      Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
      Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
      Fixes: af8e15cc ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: NArkadiusz Miskiewicz <arekm@maven.pl>
      Tested-by: NArkadiusz Miskiewicz <arekm@maven.pl>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aleksa Sarai <asarai@suse.de>
      Cc: Jay Kamat <jgkamat@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7e70ddc3
    • D
      ipvlan, l3mdev: fix broken l3s mode wrt local routes · fcc9c69a
      Daniel Borkmann 提交于
      [ Upstream commit d5256083f62e2720f75bb3c5a928a0afe47d6bc3 ]
      
      While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
      I ran into the issue that while l3 mode is working fine, l3s mode
      does not have any connectivity to kube-apiserver and hence all pods
      end up in Error state as well. The ipvlan master device sits on
      top of a bond device and hostns traffic to kube-apiserver (also running
      in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
      where the latter is the address of the bond0. While in l3 mode, a
      curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
      works fine from hostns, neither of them do in case of l3s. In the
      latter only a curl to https://127.0.0.1:37573 appeared to work where
      for local addresses of bond0 I saw kernel suddenly starting to emit
      ARP requests to query HW address of bond0 which remained unanswered
      and neighbor entries in INCOMPLETE state. These ARP requests only
      happen while in l3s.
      
      Debugging this further, I found the issue is that l3s mode is piggy-
      backing on l3 master device, and in this case local routes are using
      l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
      f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev
      if relevant") and 5f02ce24 ("net: l3mdev: Allow the l3mdev to be
      a loopback"). I found that reverting them back into using the
      net->loopback_dev fixed ipvlan l3s connectivity and got everything
      working for the CNI.
      
      Now judging from 4fbae7d8 ("ipvlan: Introduce l3s mode") and the
      l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
      on l3 master device is to get the l3mdev_ip_rcv() receive hook for
      setting the dst entry of the input route without adding its own
      ipvlan specific hacks into the receive path, however, any l3 domain
      semantics beyond just that are breaking l3s operation. Note that
      ipvlan also has the ability to dynamically switch its internal
      operation from l3 to l3s for all ports via ipvlan_set_port_mode()
      at runtime. In any case, l3 vs l3s soley distinguishes itself by
      'de-confusing' netfilter through switching skb->dev to ipvlan slave
      device late in NF_INET_LOCAL_IN before handing the skb to L4.
      
      Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
      if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
      without any additional l3mdev semantics on top. This should also have
      minimal impact since dev->priv_flags is already hot in cache. With
      this set, l3s mode is working fine and I also get things like
      masquerading pod traffic on the ipvlan master properly working.
      
        [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf
      
      Fixes: f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
      Fixes: 5f02ce24 ("net: l3mdev: Allow the l3mdev to be a loopback")
      Fixes: 4fbae7d8 ("ipvlan: Introduce l3s mode")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: David Ahern <dsa@cumulusnetworks.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Martynas Pumputis <m@lambda.lt>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fcc9c69a
  10. 31 1月, 2019 5 次提交
    • D
      Input: input_event - fix the CONFIG_SPARC64 mixup · 3a3b6a6b
      Deepa Dinamani 提交于
      commit 141e5dcaa7356077028b4cd48ec351a38c70e5e5 upstream.
      
      Arnd Bergmann pointed out that CONFIG_* cannot be used in a uapi header.
      Override with an equivalent conditional.
      
      Fixes: 2e746942ebac ("Input: input_event - provide override for sparc64")
      Fixes: 152194fe ("Input: extend usable life of event timestamps to 2106 on 32 bit systems")
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a3b6a6b
    • D
      bpf: fix sanitation of alu op with pointer / scalar type from different paths · eed84f94
      Daniel Borkmann 提交于
      [ commit d3bd7413e0ca40b60cf60d4003246d067cafdeda upstream ]
      
      While 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer
      arithmetic") took care of rejecting alu op on pointer when e.g. pointer
      came from two different map values with different map properties such as
      value size, Jann reported that a case was not covered yet when a given
      alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from
      different branches where we would incorrectly try to sanitize based
      on the pointer's limit. Catch this corner case and reject the program
      instead.
      
      Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      eed84f94
    • D
      bpf: prevent out of bounds speculation on pointer arithmetic · f92a819b
      Daniel Borkmann 提交于
      [ commit 979d63d50c0c0f7bc537bf821e056cc9fe5abd38 upstream ]
      
      Jann reported that the original commit back in b2157399
      ("bpf: prevent out-of-bounds speculation") was not sufficient
      to stop CPU from speculating out of bounds memory access:
      While b2157399 only focussed on masking array map access
      for unprivileged users for tail calls and data access such
      that the user provided index gets sanitized from BPF program
      and syscall side, there is still a more generic form affected
      from BPF programs that applies to most maps that hold user
      data in relation to dynamic map access when dealing with
      unknown scalars or "slow" known scalars as access offset, for
      example:
      
        - Load a map value pointer into R6
        - Load an index into R7
        - Do a slow computation (e.g. with a memory dependency) that
          loads a limit into R8 (e.g. load the limit from a map for
          high latency, then mask it to make the verifier happy)
        - Exit if R7 >= R8 (mispredicted branch)
        - Load R0 = R6[R7]
        - Load R0 = R6[R0]
      
      For unknown scalars there are two options in the BPF verifier
      where we could derive knowledge from in order to guarantee
      safe access to the memory: i) While </>/<=/>= variants won't
      allow to derive any lower or upper bounds from the unknown
      scalar where it would be safe to add it to the map value
      pointer, it is possible through ==/!= test however. ii) another
      option is to transform the unknown scalar into a known scalar,
      for example, through ALU ops combination such as R &= <imm>
      followed by R |= <imm> or any similar combination where the
      original information from the unknown scalar would be destroyed
      entirely leaving R with a constant. The initial slow load still
      precedes the latter ALU ops on that register, so the CPU
      executes speculatively from that point. Once we have the known
      scalar, any compare operation would work then. A third option
      only involving registers with known scalars could be crafted
      as described in [0] where a CPU port (e.g. Slow Int unit)
      would be filled with many dependent computations such that
      the subsequent condition depending on its outcome has to wait
      for evaluation on its execution port and thereby executing
      speculatively if the speculated code can be scheduled on a
      different execution port, or any other form of mistraining
      as described in [1], for example. Given this is not limited
      to only unknown scalars, not only map but also stack access
      is affected since both is accessible for unprivileged users
      and could potentially be used for out of bounds access under
      speculation.
      
      In order to prevent any of these cases, the verifier is now
      sanitizing pointer arithmetic on the offset such that any
      out of bounds speculation would be masked in a way where the
      pointer arithmetic result in the destination register will
      stay unchanged, meaning offset masked into zero similar as
      in array_index_nospec() case. With regards to implementation,
      there are three options that were considered: i) new insn
      for sanitation, ii) push/pop insn and sanitation as inlined
      BPF, iii) reuse of ax register and sanitation as inlined BPF.
      
      Option i) has the downside that we end up using from reserved
      bits in the opcode space, but also that we would require
      each JIT to emit masking as native arch opcodes meaning
      mitigation would have slow adoption till everyone implements
      it eventually which is counter-productive. Option ii) and iii)
      have both in common that a temporary register is needed in
      order to implement the sanitation as inlined BPF since we
      are not allowed to modify the source register. While a push /
      pop insn in ii) would be useful to have in any case, it
      requires once again that every JIT needs to implement it
      first. While possible, amount of changes needed would also
      be unsuitable for a -stable patch. Therefore, the path which
      has fewer changes, less BPF instructions for the mitigation
      and does not require anything to be changed in the JITs is
      option iii) which this work is pursuing. The ax register is
      already mapped to a register in all JITs (modulo arm32 where
      it's mapped to stack as various other BPF registers there)
      and used in constant blinding for JITs-only so far. It can
      be reused for verifier rewrites under certain constraints.
      The interpreter's tmp "register" has therefore been remapped
      into extending the register set with hidden ax register and
      reusing that for a number of instructions that needed the
      prior temporary variable internally (e.g. div, mod). This
      allows for zero increase in stack space usage in the interpreter,
      and enables (restricted) generic use in rewrites otherwise as
      long as such a patchlet does not make use of these instructions.
      The sanitation mask is dynamic and relative to the offset the
      map value or stack pointer currently holds.
      
      There are various cases that need to be taken under consideration
      for the masking, e.g. such operation could look as follows:
      ptr += val or val += ptr or ptr -= val. Thus, the value to be
      sanitized could reside either in source or in destination
      register, and the limit is different depending on whether
      the ALU op is addition or subtraction and depending on the
      current known and bounded offset. The limit is derived as
      follows: limit := max_value_size - (smin_value + off). For
      subtraction: limit := umax_value + off. This holds because
      we do not allow any pointer arithmetic that would
      temporarily go out of bounds or would have an unknown
      value with mixed signed bounds where it is unclear at
      verification time whether the actual runtime value would
      be either negative or positive. For example, we have a
      derived map pointer value with constant offset and bounded
      one, so limit based on smin_value works because the verifier
      requires that statically analyzed arithmetic on the pointer
      must be in bounds, and thus it checks if resulting
      smin_value + off and umax_value + off is still within map
      value bounds at time of arithmetic in addition to time of
      access. Similarly, for the case of stack access we derive
      the limit as follows: MAX_BPF_STACK + off for subtraction
      and -off for the case of addition where off := ptr_reg->off +
      ptr_reg->var_off.value. Subtraction is a special case for
      the masking which can be in form of ptr += -val, ptr -= -val,
      or ptr -= val. In the first two cases where we know that
      the value is negative, we need to temporarily negate the
      value in order to do the sanitation on a positive value
      where we later swap the ALU op, and restore original source
      register if the value was in source.
      
      The sanitation of pointer arithmetic alone is still not fully
      sufficient as is, since a scenario like the following could
      happen ...
      
        PTR += 0x1000 (e.g. K-based imm)
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        PTR += 0x1000
        PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
        [...]
      
      ... which under speculation could end up as ...
      
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        PTR += 0x1000
        PTR -= 0 [ truncated by mitigation ]
        [...]
      
      ... and therefore still access out of bounds. To prevent such
      case, the verifier is also analyzing safety for potential out
      of bounds access under speculative execution. Meaning, it is
      also simulating pointer access under truncation. We therefore
      "branch off" and push the current verification state after the
      ALU operation with known 0 to the verification stack for later
      analysis. Given the current path analysis succeeded it is
      likely that the one under speculation can be pruned. In any
      case, it is also subject to existing complexity limits and
      therefore anything beyond this point will be rejected. In
      terms of pruning, it needs to be ensured that the verification
      state from speculative execution simulation must never prune
      a non-speculative execution path, therefore, we mark verifier
      state accordingly at the time of push_stack(). If verifier
      detects out of bounds access under speculative execution from
      one of the possible paths that includes a truncation, it will
      reject such program.
      
      Given we mask every reg-based pointer arithmetic for
      unprivileged programs, we've been looking into how it could
      affect real-world programs in terms of size increase. As the
      majority of programs are targeted for privileged-only use
      case, we've unconditionally enabled masking (with its alu
      restrictions on top of it) for privileged programs for the
      sake of testing in order to check i) whether they get rejected
      in its current form, and ii) by how much the number of
      instructions and size will increase. We've tested this by
      using Katran, Cilium and test_l4lb from the kernel selftests.
      For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
      and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
      we've used test_l4lb.o as well as test_l4lb_noinline.o. We
      found that none of the programs got rejected by the verifier
      with this change, and that impact is rather minimal to none.
      balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
      7,797 bytes JITed before and after the change. Most complex
      program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
      and 18,538 bytes JITed before and after and none of the other
      tail call programs in bpf_lxc.o had any changes either. For
      the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
      increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
      before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
      after the change. Other programs from that object file had
      similar small increase. Both test_l4lb.o had no change and
      remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
      JITed and for test_l4lb_noinline.o constant at 5,080 bytes
      (634 insns) xlated and 3,313 bytes JITed. This can be explained
      in that LLVM typically optimizes stack based pointer arithmetic
      by using K-based operations and that use of dynamic map access
      is not overly frequent. However, in future we may decide to
      optimize the algorithm further under known guarantees from
      branch and value speculation. Latter seems also unclear in
      terms of prediction heuristics that today's CPUs apply as well
      as whether there could be collisions in e.g. the predictor's
      Value History/Pattern Table for triggering out of bounds access,
      thus masking is performed unconditionally at this point but could
      be subject to relaxation later on. We were generally also
      brainstorming various other approaches for mitigation, but the
      blocker was always lack of available registers at runtime and/or
      overhead for runtime tracking of limits belonging to a specific
      pointer. Thus, we found this to be minimally intrusive under
      given constraints.
      
      With that in place, a simple example with sanitized access on
      unprivileged load at post-verification time looks as follows:
      
        # bpftool prog dump xlated id 282
        [...]
        28: (79) r1 = *(u64 *)(r7 +0)
        29: (79) r2 = *(u64 *)(r7 +8)
        30: (57) r1 &= 15
        31: (79) r3 = *(u64 *)(r0 +4608)
        32: (57) r3 &= 1
        33: (47) r3 |= 1
        34: (2d) if r2 > r3 goto pc+19
        35: (b4) (u32) r11 = (u32) 20479  |
        36: (1f) r11 -= r2                | Dynamic sanitation for pointer
        37: (4f) r11 |= r2                | arithmetic with registers
        38: (87) r11 = -r11               | containing bounded or known
        39: (c7) r11 s>>= 63              | scalars in order to prevent
        40: (5f) r11 &= r2                | out of bounds speculation.
        41: (0f) r4 += r11                |
        42: (71) r4 = *(u8 *)(r4 +0)
        43: (6f) r4 <<= r1
        [...]
      
      For the case where the scalar sits in the destination register
      as opposed to the source register, the following code is emitted
      for the above example:
      
        [...]
        16: (b4) (u32) r11 = (u32) 20479
        17: (1f) r11 -= r2
        18: (4f) r11 |= r2
        19: (87) r11 = -r11
        20: (c7) r11 s>>= 63
        21: (5f) r2 &= r11
        22: (0f) r2 += r0
        23: (61) r0 = *(u32 *)(r2 +0)
        [...]
      
      JIT blinding example with non-conflicting use of r10:
      
        [...]
         d5:	je     0x0000000000000106    _
         d7:	mov    0x0(%rax),%edi       |
         da:	mov    $0xf153246,%r10d     | Index load from map value and
         e0:	xor    $0xf153259,%r10      | (const blinded) mask with 0x1f.
         e7:	and    %r10,%rdi            |_
         ea:	mov    $0x2f,%r10d          |
         f0:	sub    %rdi,%r10            | Sanitized addition. Both use r10
         f3:	or     %rdi,%r10            | but do not interfere with each
         f6:	neg    %r10                 | other. (Neither do these instructions
         f9:	sar    $0x3f,%r10           | interfere with the use of ax as temp
         fd:	and    %r10,%rdi            | in interpreter.)
        100:	add    %rax,%rdi            |_
        103:	mov    0x0(%rdi),%eax
       [...]
      
      Tested that it fixes Jann's reproducer, and also checked that test_verifier
      and test_progs suite with interpreter, JIT and JIT with hardening enabled
      on x86-64 and arm64 runs successfully.
      
        [0] Speculose: Analyzing the Security Implications of Speculative
            Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
            https://arxiv.org/pdf/1801.04084.pdf
      
        [1] A Systematic Evaluation of Transient Execution Attacks and
            Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
            Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
            Dmitry Evtyushkin, Daniel Gruss,
            https://arxiv.org/pdf/1811.05441.pdf
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f92a819b
    • D
      bpf: enable access to ax register also from verifier rewrite · 232ac70d
      Daniel Borkmann 提交于
      [ commit 9b73bfdd08e73231d6a90ae6db4b46b3fbf56c30 upstream ]
      
      Right now we are using BPF ax register in JIT for constant blinding as
      well as in interpreter as temporary variable. Verifier will not be able
      to use it simply because its use will get overridden from the former in
      bpf_jit_blind_insn(). However, it can be made to work in that blinding
      will be skipped if there is prior use in either source or destination
      register on the instruction. Taking constraints of ax into account, the
      verifier is then open to use it in rewrites under some constraints. Note,
      ax register already has mappings in every eBPF JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      232ac70d
    • D
      bpf: move tmp variable into ax register in interpreter · b855e310
      Daniel Borkmann 提交于
      [ commit 144cd91c4c2bced6eb8a7e25e590f6618a11e854 upstream ]
      
      This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run()
      into the hidden ax register. The latter is currently only used in JITs
      for constant blinding as a temporary scratch register, meaning the BPF
      interpreter will never see the use of ax. Therefore it is safe to use
      it for the cases where tmp has been used earlier. This is needed to later
      on allow restricted hidden use of ax in both interpreter and JITs.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b855e310