1. 20 5月, 2020 1 次提交
  2. 15 5月, 2020 4 次提交
    • B
      x86: Fix early boot crash on gcc-10, third try · a9a3ed1e
      Borislav Petkov 提交于
      ... or the odyssey of trying to disable the stack protector for the
      function which generates the stack canary value.
      
      The whole story started with Sergei reporting a boot crash with a kernel
      built with gcc-10:
      
        Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary
        CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b3 #139
        Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013
        Call Trace:
          dump_stack
          panic
          ? start_secondary
          __stack_chk_fail
          start_secondary
          secondary_startup_64
        -—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary
      
      This happens because gcc-10 tail-call optimizes the last function call
      in start_secondary() - cpu_startup_entry() - and thus emits a stack
      canary check which fails because the canary value changes after the
      boot_init_stack_canary() call.
      
      To fix that, the initial attempt was to mark the one function which
      generates the stack canary with:
      
        __attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused)
      
      however, using the optimize attribute doesn't work cumulatively
      as the attribute does not add to but rather replaces previously
      supplied optimization options - roughly all -fxxx options.
      
      The key one among them being -fno-omit-frame-pointer and thus leading to
      not present frame pointer - frame pointer which the kernel needs.
      
      The next attempt to prevent compilers from tail-call optimizing
      the last function call cpu_startup_entry(), shy of carving out
      start_secondary() into a separate compilation unit and building it with
      -fno-stack-protector, was to add an empty asm("").
      
      This current solution was short and sweet, and reportedly, is supported
      by both compilers but we didn't get very far this time: future (LTO?)
      optimization passes could potentially eliminate this, which leads us
      to the third attempt: having an actual memory barrier there which the
      compiler cannot ignore or move around etc.
      
      That should hold for a long time, but hey we said that about the other
      two solutions too so...
      Reported-by: NSergei Trofimovich <slyfox@gentoo.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Tested-by: NKalle Valo <kvalo@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org
      a9a3ed1e
    • K
      net: phy: broadcom: fix BCM54XX_SHD_SCR3_TRDDAPD value for BCM54810 · cc8a677a
      Kevin Lo 提交于
      Set the correct bit when checking for PHY_BRCM_DIS_TXCRXC_NOENRGY on the
      BCM54810 PHY.
      
      Fixes: 0ececcfc ("net: phy: broadcom: Allow BCM54810 to use bcm54xx_adjust_rxrefclk()")
      Signed-off-by: NKevin Lo <kevlo@kevlo.org>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc8a677a
    • A
      security: Fix the default value of secid_to_secctx hook · 625236ba
      Anders Roxell 提交于
      security_secid_to_secctx is called by the bpf_lsm hook and a successful
      return value (i.e 0) implies that the parameter will be consumed by the
      LSM framework. The current behaviour return success when the pointer
      isn't initialized when CONFIG_BPF_LSM is enabled, with the default
      return from kernel/bpf/bpf_lsm.c.
      
      This is the internal error:
      
      [ 1229.341488][ T2659] usercopy: Kernel memory exposure attempt detected from null address (offset 0, size 280)!
      [ 1229.374977][ T2659] ------------[ cut here ]------------
      [ 1229.376813][ T2659] kernel BUG at mm/usercopy.c:99!
      [ 1229.378398][ T2659] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
      [ 1229.380348][ T2659] Modules linked in:
      [ 1229.381654][ T2659] CPU: 0 PID: 2659 Comm: systemd-journal Tainted: G    B   W         5.7.0-rc5-next-20200511-00019-g864e0c6319b8-dirty #13
      [ 1229.385429][ T2659] Hardware name: linux,dummy-virt (DT)
      [ 1229.387143][ T2659] pstate: 80400005 (Nzcv daif +PAN -UAO BTYPE=--)
      [ 1229.389165][ T2659] pc : usercopy_abort+0xc8/0xcc
      [ 1229.390705][ T2659] lr : usercopy_abort+0xc8/0xcc
      [ 1229.392225][ T2659] sp : ffff000064247450
      [ 1229.393533][ T2659] x29: ffff000064247460 x28: 0000000000000000
      [ 1229.395449][ T2659] x27: 0000000000000118 x26: 0000000000000000
      [ 1229.397384][ T2659] x25: ffffa000127049e0 x24: ffffa000127049e0
      [ 1229.399306][ T2659] x23: ffffa000127048e0 x22: ffffa000127048a0
      [ 1229.401241][ T2659] x21: ffffa00012704b80 x20: ffffa000127049e0
      [ 1229.403163][ T2659] x19: ffffa00012704820 x18: 0000000000000000
      [ 1229.405094][ T2659] x17: 0000000000000000 x16: 0000000000000000
      [ 1229.407008][ T2659] x15: 0000000000000000 x14: 003d090000000000
      [ 1229.408942][ T2659] x13: ffff80000d5b25b2 x12: 1fffe0000d5b25b1
      [ 1229.410859][ T2659] x11: 1fffe0000d5b25b1 x10: ffff80000d5b25b1
      [ 1229.412791][ T2659] x9 : ffffa0001034bee0 x8 : ffff00006ad92d8f
      [ 1229.414707][ T2659] x7 : 0000000000000000 x6 : ffffa00015eacb20
      [ 1229.416642][ T2659] x5 : ffff0000693c8040 x4 : 0000000000000000
      [ 1229.418558][ T2659] x3 : ffffa0001034befc x2 : d57a7483a01c6300
      [ 1229.420610][ T2659] x1 : 0000000000000000 x0 : 0000000000000059
      [ 1229.422526][ T2659] Call trace:
      [ 1229.423631][ T2659]  usercopy_abort+0xc8/0xcc
      [ 1229.425091][ T2659]  __check_object_size+0xdc/0x7d4
      [ 1229.426729][ T2659]  put_cmsg+0xa30/0xa90
      [ 1229.428132][ T2659]  unix_dgram_recvmsg+0x80c/0x930
      [ 1229.429731][ T2659]  sock_recvmsg+0x9c/0xc0
      [ 1229.431123][ T2659]  ____sys_recvmsg+0x1cc/0x5f8
      [ 1229.432663][ T2659]  ___sys_recvmsg+0x100/0x160
      [ 1229.434151][ T2659]  __sys_recvmsg+0x110/0x1a8
      [ 1229.435623][ T2659]  __arm64_sys_recvmsg+0x58/0x70
      [ 1229.437218][ T2659]  el0_svc_common.constprop.1+0x29c/0x340
      [ 1229.438994][ T2659]  do_el0_svc+0xe8/0x108
      [ 1229.440587][ T2659]  el0_svc+0x74/0x88
      [ 1229.441917][ T2659]  el0_sync_handler+0xe4/0x8b4
      [ 1229.443464][ T2659]  el0_sync+0x17c/0x180
      [ 1229.444920][ T2659] Code: aa1703e2 aa1603e1 910a8260 97ecc860 (d4210000)
      [ 1229.447070][ T2659] ---[ end trace 400497d91baeaf51 ]---
      [ 1229.448791][ T2659] Kernel panic - not syncing: Fatal exception
      [ 1229.450692][ T2659] Kernel Offset: disabled
      [ 1229.452061][ T2659] CPU features: 0x240002,20002004
      [ 1229.453647][ T2659] Memory Limit: none
      [ 1229.455015][ T2659] ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Rework the so the default return value is -EOPNOTSUPP.
      
      There are likely other callbacks such as security_inode_getsecctx() that
      may have the same problem, and that someone that understand the code
      better needs to audit them.
      
      Thank you Arnd for helping me figure out what went wrong.
      
      Fixes: 98e828a0 ("security: Refactor declaration of LSM hooks")
      Signed-off-by: NAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJames Morris <jamorris@linux.microsoft.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/bpf/20200512174607.9630-1-anders.roxell@linaro.org
      625236ba
    • Y
      mm, memcg: fix inconsistent oom event behavior · 04fd61a4
      Yafang Shao 提交于
      A recent commit 9852ae3f ("mm, memcg: consider subtrees in
      memory.events") changed the behavior of memcg events, which will now
      consider subtrees in memory.events.
      
      But oom_kill event is a special one as it is used in both cgroup1 and
      cgroup2.  In cgroup1, it is displayed in memory.oom_control.  The file
      memory.oom_control is in both root memcg and non root memcg, that is
      different with memory.event as it only in non-root memcg.  That commit
      is okay for cgroup2, but it is not okay for cgroup1 as it will cause
      inconsistent behavior between root memcg and non-root memcg.
      
      Here's an example on why this behavior is inconsistent in cgroup1.
      
             root memcg
             /
          memcg foo
           /
        memcg bar
      
      Suppose there's an oom_kill in memcg bar, then the oon_kill will be
      
             root memcg : memory.oom_control(oom_kill)  0
             /
          memcg foo : memory.oom_control(oom_kill)  1
           /
        memcg bar : memory.oom_control(oom_kill)  1
      
      For the non-root memcg, its memory.oom_control(oom_kill) includes its
      descendants' oom_kill, but for root memcg, it doesn't include its
      descendants' oom_kill.  That means, memory.oom_control(oom_kill) has
      different meanings in different memcgs.  That is inconsistent.  Then the
      user has to know whether the memcg is root or not.
      
      If we can't fully support it in cgroup1, for example by adding
      memory.events.local into cgroup1 as well, then let's don't touch its
      original behavior.
      
      Fixes: 9852ae3f ("mm, memcg: consider subtrees in memory.events")
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NChris Down <chris@chrisdown.name>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200502141055.7378-1-laoar.shao@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04fd61a4
  3. 14 5月, 2020 3 次提交
    • A
      usb: raw-gadget: support stalling/halting/wedging endpoints · c61769bd
      Andrey Konovalov 提交于
      Raw Gadget is currently unable to stall/halt/wedge gadget endpoints,
      which is required for proper emulation of certain USB classes.
      
      This patch adds a few more ioctls:
      
      - USB_RAW_IOCTL_EP0_STALL allows to stall control endpoint #0 when
        there's a pending setup request for it.
      - USB_RAW_IOCTL_SET/CLEAR_HALT/WEDGE allow to set/clear halt/wedge status
        on non-control non-isochronous endpoints.
      
      Fixes: f2c2e717 ("usb: gadget: add raw-gadget interface")
      Signed-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NFelipe Balbi <balbi@kernel.org>
      c61769bd
    • A
      usb: raw-gadget: fix gadget endpoint selection · 97df5e57
      Andrey Konovalov 提交于
      Currently automatic gadget endpoint selection based on required features
      doesn't work. Raw Gadget tries iterating over the list of available
      endpoints and finding one that has the right direction and transfer type.
      Unfortunately selecting arbitrary gadget endpoints (even if they satisfy
      feature requirements) doesn't work, as (depending on the UDC driver) they
      might have fixed addresses, and one also needs to provide matching
      endpoint addresses in the descriptors sent to the host.
      
      The composite framework deals with this by assigning endpoint addresses
      in usb_ep_autoconfig() before enumeration starts. This approach won't work
      with Raw Gadget as the endpoints are supposed to be enabled after a
      set_configuration/set_interface request from the host, so it's too late to
      patch the endpoint descriptors that had already been sent to the host.
      
      For Raw Gadget we take another approach. Similarly to GadgetFS, we allow
      the user to make the decision as to which gadget endpoints to use.
      
      This patch adds another Raw Gadget ioctl USB_RAW_IOCTL_EPS_INFO that
      exposes information about all non-control endpoints that a currently
      connected UDC has. This information includes endpoints addresses, as well
      as their capabilities and limits to allow the user to choose the most
      fitting gadget endpoint.
      
      The USB_RAW_IOCTL_EP_ENABLE ioctl is updated to use the proper endpoint
      validation routine usb_gadget_ep_match_desc().
      
      These changes affect the portability of the gadgets that use Raw Gadget
      when running on different UDCs. Nevertheless, as long as the user relies
      on the information provided by USB_RAW_IOCTL_EPS_INFO to dynamically
      choose endpoint addresses, UDC-agnostic gadgets can still be written with
      Raw Gadget.
      
      Fixes: f2c2e717 ("usb: gadget: add raw-gadget interface")
      Signed-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NFelipe Balbi <balbi@kernel.org>
      97df5e57
    • A
      usb: raw-gadget: improve uapi headers comments · 17ff3b72
      Andrey Konovalov 提交于
      Fix typo "trasferred" => "transferred".
      
      Don't call USB requests URBs.
      
      Fix comment style.
      Signed-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NFelipe Balbi <balbi@kernel.org>
      17ff3b72
  4. 13 5月, 2020 3 次提交
    • S
      x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up · 59566b0b
      Steven Rostedt (VMware) 提交于
      Booting one of my machines, it triggered the following crash:
      
       Kernel/User page tables isolation: enabled
       ftrace: allocating 36577 entries in 143 pages
       Starting tracer 'function'
       BUG: unable to handle page fault for address: ffffffffa000005c
       #PF: supervisor write access in kernel mode
       #PF: error_code(0x0003) - permissions violation
       PGD 2014067 P4D 2014067 PUD 2015063 PMD 7b253067 PTE 7b252061
       Oops: 0003 [#1] PREEMPT SMP PTI
       CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.0-test+ #24
       Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
       RIP: 0010:text_poke_early+0x4a/0x58
       Code: 34 24 48 89 54 24 08 e8 bf 72 0b 00 48 8b 34 24 48 8b 4c 24 08 84 c0 74 0b 48 89 df f3 a4 48 83 c4 10 5b c3 9c 58 fa 48 89 df <f3> a4 50 9d 48 83 c4 10 5b e9 d6 f9 ff ff
      0 41 57 49
       RSP: 0000:ffffffff82003d38 EFLAGS: 00010046
       RAX: 0000000000000046 RBX: ffffffffa000005c RCX: 0000000000000005
       RDX: 0000000000000005 RSI: ffffffff825b9a90 RDI: ffffffffa000005c
       RBP: ffffffffa000005c R08: 0000000000000000 R09: ffffffff8206e6e0
       R10: ffff88807b01f4c0 R11: ffffffff8176c106 R12: ffffffff8206e6e0
       R13: ffffffff824f2440 R14: 0000000000000000 R15: ffffffff8206eac0
       FS:  0000000000000000(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: ffffffffa000005c CR3: 0000000002012000 CR4: 00000000000006b0
       Call Trace:
        text_poke_bp+0x27/0x64
        ? mutex_lock+0x36/0x5d
        arch_ftrace_update_trampoline+0x287/0x2d5
        ? ftrace_replace_code+0x14b/0x160
        ? ftrace_update_ftrace_func+0x65/0x6c
        __register_ftrace_function+0x6d/0x81
        ftrace_startup+0x23/0xc1
        register_ftrace_function+0x20/0x37
        func_set_flag+0x59/0x77
        __set_tracer_option.isra.19+0x20/0x3e
        trace_set_options+0xd6/0x13e
        apply_trace_boot_options+0x44/0x6d
        register_tracer+0x19e/0x1ac
        early_trace_init+0x21b/0x2c9
        start_kernel+0x241/0x518
        ? load_ucode_intel_bsp+0x21/0x52
        secondary_startup_64+0xa4/0xb0
      
      I was able to trigger it on other machines, when I added to the kernel
      command line of both "ftrace=function" and "trace_options=func_stack_trace".
      
      The cause is the "ftrace=function" would register the function tracer
      and create a trampoline, and it will set it as executable and
      read-only. Then the "trace_options=func_stack_trace" would then update
      the same trampoline to include the stack tracer version of the function
      tracer. But since the trampoline already exists, it updates it with
      text_poke_bp(). The problem is that text_poke_bp() called while
      system_state == SYSTEM_BOOTING, it will simply do a memcpy() and not
      the page mapping, as it would think that the text is still read-write.
      But in this case it is not, and we take a fault and crash.
      
      Instead, lets keep the ftrace trampolines read-write during boot up,
      and then when the kernel executable text is set to read-only, the
      ftrace trampolines get set to read-only as well.
      
      Link: https://lkml.kernel.org/r/20200430202147.4dc6e2de@oasis.local.home
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: stable@vger.kernel.org
      Fixes: 768ae440 ("x86/ftrace: Use text_poke()")
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      59566b0b
    • E
      tcp: fix SO_RCVLOWAT hangs with fat skbs · 24adbc16
      Eric Dumazet 提交于
      We autotune rcvbuf whenever SO_RCVLOWAT is set to account for 100%
      overhead in tcp_set_rcvlowat()
      
      This works well when skb->len/skb->truesize ratio is bigger than 0.5
      
      But if we receive packets with small MSS, we can end up in a situation
      where not enough bytes are available in the receive queue to satisfy
      RCVLOWAT setting.
      As our sk_rcvbuf limit is hit, we send zero windows in ACK packets,
      preventing remote peer from sending more data.
      
      Even autotuning does not help, because it only triggers at the time
      user process drains the queue. If no EPOLLIN is generated, this
      can not happen.
      
      Note poll() has a similar issue, after commit
      c7004482 ("tcp: Respect SO_RCVLOWAT in tcp_poll().")
      
      Fixes: 03f45c88 ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24adbc16
    • J
      ptp: fix struct member comment for do_aux_work · 2c864c78
      Jacob Keller 提交于
      The do_aux_work callback had documentation in the structure comment
      which referred to it as "do_work".
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c864c78
  5. 12 5月, 2020 1 次提交
  6. 11 5月, 2020 2 次提交
    • P
      netfilter: flowtable: Add pending bit for offload work · 2c889795
      Paul Blakey 提交于
      Gc step can queue offloaded flow del work or stats work.
      Those work items can race each other and a flow could be freed
      before the stats work is executed and querying it.
      To avoid that, add a pending bit that if a work exists for a flow
      don't queue another work for it.
      This will also avoid adding multiple stats works in case stats work
      didn't complete but gc step started again.
      Signed-off-by: NPaul Blakey <paulb@mellanox.com>
      Reviewed-by: NRoi Dayan <roid@mellanox.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2c889795
    • A
      netfilter: conntrack: avoid gcc-10 zero-length-bounds warning · 2c407aca
      Arnd Bergmann 提交于
      gcc-10 warns around a suspicious access to an empty struct member:
      
      net/netfilter/nf_conntrack_core.c: In function '__nf_conntrack_alloc':
      net/netfilter/nf_conntrack_core.c:1522:9: warning: array subscript 0 is outside the bounds of an interior zero-length array 'u8[0]' {aka 'unsigned char[0]'} [-Wzero-length-bounds]
       1522 |  memset(&ct->__nfct_init_offset[0], 0,
            |         ^~~~~~~~~~~~~~~~~~~~~~~~~~
      In file included from net/netfilter/nf_conntrack_core.c:37:
      include/net/netfilter/nf_conntrack.h:90:5: note: while referencing '__nfct_init_offset'
         90 |  u8 __nfct_init_offset[0];
            |     ^~~~~~~~~~~~~~~~~~
      
      The code is correct but a bit unusual. Rework it slightly in a way that
      does not trigger the warning, using an empty struct instead of an empty
      array. There are probably more elegant ways to do this, but this is the
      smallest change.
      
      Fixes: c41884ce ("netfilter: conntrack: avoid zeroing timer")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2c407aca
  7. 10 5月, 2020 2 次提交
  8. 08 5月, 2020 4 次提交
  9. 07 5月, 2020 5 次提交
  10. 06 5月, 2020 1 次提交
    • J
      bpf, sockmap: bpf_tcp_ingress needs to subtract bytes from sg.size · 81aabbb9
      John Fastabend 提交于
      In bpf_tcp_ingress we used apply_bytes to subtract bytes from sg.size
      which is used to track total bytes in a message. But this is not
      correct because apply_bytes is itself modified in the main loop doing
      the mem_charge.
      
      Then at the end of this we have sg.size incorrectly set and out of
      sync with actual sk values. Then we can get a splat if we try to
      cork the data later and again try to redirect the msg to ingress. To
      fix instead of trying to track msg.size do the easy thing and include
      it as part of the sk_msg_xfer logic so that when the msg is moved the
      sg.size is always correct.
      
      To reproduce the below users will need ingress + cork and hit an
      error path that will then try to 'free' the skmsg.
      
      [  173.699981] BUG: KASAN: null-ptr-deref in sk_msg_free_elem+0xdd/0x120
      [  173.699987] Read of size 8 at addr 0000000000000008 by task test_sockmap/5317
      
      [  173.700000] CPU: 2 PID: 5317 Comm: test_sockmap Tainted: G          I       5.7.0-rc1+ #43
      [  173.700005] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
      [  173.700009] Call Trace:
      [  173.700021]  dump_stack+0x8e/0xcb
      [  173.700029]  ? sk_msg_free_elem+0xdd/0x120
      [  173.700034]  ? sk_msg_free_elem+0xdd/0x120
      [  173.700042]  __kasan_report+0x102/0x15f
      [  173.700052]  ? sk_msg_free_elem+0xdd/0x120
      [  173.700060]  kasan_report+0x32/0x50
      [  173.700070]  sk_msg_free_elem+0xdd/0x120
      [  173.700080]  __sk_msg_free+0x87/0x150
      [  173.700094]  tcp_bpf_send_verdict+0x179/0x4f0
      [  173.700109]  tcp_bpf_sendpage+0x3ce/0x5d0
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/158861290407.14306.5327773422227552482.stgit@john-Precision-5820-Tower
      81aabbb9
  11. 05 5月, 2020 7 次提交
  12. 02 5月, 2020 1 次提交
    • D
      ipv6: Use global sernum for dst validation with nexthop objects · 8f34e53b
      David Ahern 提交于
      Nik reported a bug with pcpu dst cache when nexthop objects are
      used illustrated by the following:
          $ ip netns add foo
          $ ip -netns foo li set lo up
          $ ip -netns foo addr add 2001:db8:11::1/128 dev lo
          $ ip netns exec foo sysctl net.ipv6.conf.all.forwarding=1
          $ ip li add veth1 type veth peer name veth2
          $ ip li set veth1 up
          $ ip addr add 2001:db8:10::1/64 dev veth1
          $ ip li set dev veth2 netns foo
          $ ip -netns foo li set veth2 up
          $ ip -netns foo addr add 2001:db8:10::2/64 dev veth2
          $ ip -6 nexthop add id 100 via 2001:db8:10::2 dev veth1
          $ ip -6 route add 2001:db8:11::1/128 nhid 100
      
          Create a pcpu entry on cpu 0:
          $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
      
          Re-add the route entry:
          $ ip -6 ro del 2001:db8:11::1
          $ ip -6 route add 2001:db8:11::1/128 nhid 100
      
          Route get on cpu 0 returns the stale pcpu:
          $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
          RTNETLINK answers: Network is unreachable
      
          While cpu 1 works:
          $ taskset -a -c 1 ip -6 route get 2001:db8:11::1
          2001:db8:11::1 from :: via 2001:db8:10::2 dev veth1 src 2001:db8:10::1 metric 1024 pref medium
      
      Conversion of FIB entries to work with external nexthop objects
      missed an important difference between IPv4 and IPv6 - how dst
      entries are invalidated when the FIB changes. IPv4 has a per-network
      namespace generation id (rt_genid) that is bumped on changes to the FIB.
      Checking if a dst_entry is still valid means comparing rt_genid in the
      rtable to the current value of rt_genid for the namespace.
      
      IPv6 also has a per network namespace counter, fib6_sernum, but the
      count is saved per fib6_node. With the per-node counter only dst_entries
      based on fib entries under the node are invalidated when changes are
      made to the routes - limiting the scope of invalidations. IPv6 uses a
      reference in the rt6_info, 'from', to track the corresponding fib entry
      used to create the dst_entry. When validating a dst_entry, the 'from'
      is used to backtrack to the fib6_node and check the sernum of it to the
      cookie passed to the dst_check operation.
      
      With the inline format (nexthop definition inline with the fib6_info),
      dst_entries cached in the fib6_nh have a 1:1 correlation between fib
      entries, nexthop data and dst_entries. With external nexthops, IPv6
      looks more like IPv4 which means multiple fib entries across disparate
      fib6_nodes can all reference the same fib6_nh. That means validation
      of dst_entries based on external nexthops needs to use the IPv4 format
      - the per-network namespace counter.
      
      Add sernum to rt6_info and set it when creating a pcpu dst entry. Update
      rt6_get_cookie to return sernum if it is set and update dst_check for
      IPv6 to look for sernum set and based the check on it if so. Finally,
      rt6_get_pcpu_route needs to validate the cached entry before returning
      a pcpu entry (similar to the rt_cache_valid calls in __mkroute_input and
      __mkroute_output for IPv4).
      
      This problem only affects routes using the new, external nexthops.
      
      Thanks to the kbuild test robot for catching the IS_ENABLED needed
      around rt_genid_ipv6 before I sent this out.
      
      Fixes: 5b98324e ("ipv6: Allow routes to use nexthop objects")
      Reported-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid Ahern <dsahern@kernel.org>
      Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Tested-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f34e53b
  13. 01 5月, 2020 4 次提交
    • T
      tunnel: Propagate ECT(1) when decapsulating as recommended by RFC6040 · b7237487
      Toke Høiland-Jørgensen 提交于
      RFC 6040 recommends propagating an ECT(1) mark from an outer tunnel header
      to the inner header if that inner header is already marked as ECT(0). When
      RFC 6040 decapsulation was implemented, this case of propagation was not
      added. This simply appears to be an oversight, so let's fix that.
      
      Fixes: eccc1bb8 ("tunnel: drop packet if ECN present with not-ECT")
      Reported-by: NBob Briscoe <ietf@bobbriscoe.net>
      Reported-by: NOlivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
      Cc: Dave Taht <dave.taht@gmail.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7237487
    • K
      security: Fix the default value of fs_context_parse_param hook · 54261af4
      KP Singh 提交于
      security_fs_context_parse_param is called by vfs_parse_fs_param and
      a succussful return value (i.e 0) implies that a parameter will be
      consumed by the LSM framework. This stops all further parsing of the
      parmeter by VFS. Furthermore, if an LSM hook returns a success, the
      remaining LSM hooks are not invoked for the parameter.
      
      The current default behavior of returning success means that all the
      parameters are expected to be parsed by the LSM hook and none of them
      end up being populated by vfs in fs_context
      
      This was noticed when lsm=bpf is supplied on the command line before any
      other LSM. As the bpf lsm uses this default value to implement a default
      hook, this resulted in a failure to parse any fs_context parameters and
      a failure to mount the root filesystem.
      
      Fixes: 98e828a0 ("security: Refactor declaration of LSM hooks")
      Reported-by: NMikko Ylinen <mikko.ylinen@linux.intel.com>
      Signed-off-by: NKP Singh <kpsingh@google.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      54261af4
    • P
      mptcp: move option parsing into mptcp_incoming_options() · cfde141e
      Paolo Abeni 提交于
      The mptcp_options_received structure carries several per
      packet flags (mp_capable, mp_join, etc.). Such fields must
      be cleared on each packet, even on dropped ones or packet
      not carrying any MPTCP options, but the current mptcp
      code clears them only on TCP option reset.
      
      On several races/corner cases we end-up with stray bits in
      incoming options, leading to WARN_ON splats. e.g.:
      
      [  171.164906] Bad mapping: ssn=32714 map_seq=1 map_data_len=32713
      [  171.165006] WARNING: CPU: 1 PID: 5026 at net/mptcp/subflow.c:533 warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
      [  171.167632] Modules linked in: ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel geneve ip6_udp_tunnel udp_tunnel macsec macvtap tap ipvlan macvlan 8021q garp mrp xfrm_interface veth netdevsim nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun binfmt_misc intel_rapl_msr intel_rapl_common rfkill kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ip_tables xfs libcrc32c crc32c_intel serio_raw virtio_console ata_generic virtio_blk virtio_net net_failover failover ata_piix libata
      [  171.199464] CPU: 1 PID: 5026 Comm: repro Not tainted 5.7.0-rc1.mptcp_f227fdf5d388+ #95
      [  171.200886] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
      [  171.202546] RIP: 0010:warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
      [  171.206537] Code: c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 1d 8b 55 3c 44 89 e6 48 c7 c7 20 51 13 95 e8 37 8b 22 fe <0f> 0b 48 83 c4 08 5b 5d 41 5c c3 89 4c 24 04 e8 db d6 94 fe 8b 4c
      [  171.220473] RSP: 0018:ffffc90000150560 EFLAGS: 00010282
      [  171.221639] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [  171.223108] RDX: 0000000000000000 RSI: 0000000000000008 RDI: fffff5200002a09e
      [  171.224388] RBP: ffff8880aa6e3c00 R08: 0000000000000001 R09: fffffbfff2ec9955
      [  171.225706] R10: ffffffff9764caa7 R11: fffffbfff2ec9954 R12: 0000000000007fca
      [  171.227211] R13: ffff8881066f4a7f R14: ffff8880aa6e3c00 R15: 0000000000000020
      [  171.228460] FS:  00007f8623719740(0000) GS:ffff88810be00000(0000) knlGS:0000000000000000
      [  171.230065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  171.231303] CR2: 00007ffdab190a50 CR3: 00000001038ea006 CR4: 0000000000160ee0
      [  171.232586] Call Trace:
      [  171.233109]  <IRQ>
      [  171.233531] get_mapping_status (linux-mptcp/net/mptcp/subflow.c:691)
      [  171.234371] mptcp_subflow_data_available (linux-mptcp/net/mptcp/subflow.c:736 linux-mptcp/net/mptcp/subflow.c:832)
      [  171.238181] subflow_state_change (linux-mptcp/net/mptcp/subflow.c:1085 (discriminator 1))
      [  171.239066] tcp_fin (linux-mptcp/net/ipv4/tcp_input.c:4217)
      [  171.240123] tcp_data_queue (linux-mptcp/./include/linux/compiler.h:199 linux-mptcp/net/ipv4/tcp_input.c:4822)
      [  171.245083] tcp_rcv_established (linux-mptcp/./include/linux/skbuff.h:1785 linux-mptcp/./include/net/tcp.h:1774 linux-mptcp/./include/net/tcp.h:1847 linux-mptcp/net/ipv4/tcp_input.c:5238 linux-mptcp/net/ipv4/tcp_input.c:5730)
      [  171.254089] tcp_v4_rcv (linux-mptcp/./include/linux/spinlock.h:393 linux-mptcp/net/ipv4/tcp_ipv4.c:2009)
      [  171.258969] ip_protocol_deliver_rcu (linux-mptcp/net/ipv4/ip_input.c:204 (discriminator 1))
      [  171.260214] ip_local_deliver_finish (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/ipv4/ip_input.c:232)
      [  171.261389] ip_local_deliver (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:252)
      [  171.265884] ip_rcv (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:539)
      [  171.273666] process_backlog (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/core/dev.c:6135)
      [  171.275328] net_rx_action (linux-mptcp/net/core/dev.c:6572 linux-mptcp/net/core/dev.c:6640)
      [  171.280472] __do_softirq (linux-mptcp/./arch/x86/include/asm/jump_label.h:25 linux-mptcp/./include/linux/jump_label.h:200 linux-mptcp/./include/trace/events/irq.h:142 linux-mptcp/kernel/softirq.c:293)
      [  171.281379] do_softirq_own_stack (linux-mptcp/arch/x86/entry/entry_64.S:1083)
      [  171.282358]  </IRQ>
      
      We could address the issue clearing explicitly the relevant fields
      in several places - tcp_parse_option, tcp_fast_parse_options,
      possibly others.
      
      Instead we move the MPTCP option parsing into the already existing
      mptcp ingress hook, so that we need to clear the fields in a single
      place.
      
      This allows us dropping an MPTCP hook from the TCP code and
      removing the quite large mptcp_options_received from the tcp_sock
      struct. On the flip side, the MPTCP sockets will traverse the
      option space twice (in tcp_parse_option() and in
      mptcp_incoming_options(). That looks acceptable: we already
      do that for syn and 3rd ack packets, plain TCP socket will
      benefit from it, and even MPTCP sockets will experience better
      code locality, reducing the jumps between TCP and MPTCP code.
      
      v1 -> v2:
       - rebased on current '-net' tree
      
      Fixes: 648ef4b8 ("mptcp: Implement MPTCP receive path")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfde141e
    • P
      mptcp: consolidate synack processing. · 263e1201
      Paolo Abeni 提交于
      Currently the MPTCP code uses 2 hooks to process syn-ack
      packets, mptcp_rcv_synsent() and the sk_rx_dst_set()
      callback.
      
      We can drop the first, moving the relevant code into the
      latter, reducing the hooking into the TCP code. This is
      also needed by the next patch.
      
      v1 -> v2:
       - use local tcp sock ptr instead of casting the sk variable
         several times - DaveM
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      263e1201
  14. 30 4月, 2020 2 次提交