1. 20 1月, 2019 1 次提交
    • W
      bpf: in __bpf_redirect_no_mac pull mac only if present · e7c87bd6
      Willem de Bruijn 提交于
      Syzkaller was able to construct a packet of negative length by
      redirecting from bpf_prog_test_run_skb with BPF_PROG_TYPE_LWT_XMIT:
      
          BUG: KASAN: slab-out-of-bounds in memcpy include/linux/string.h:345 [inline]
          BUG: KASAN: slab-out-of-bounds in skb_copy_from_linear_data include/linux/skbuff.h:3421 [inline]
          BUG: KASAN: slab-out-of-bounds in __pskb_copy_fclone+0x2dd/0xeb0 net/core/skbuff.c:1395
          Read of size 4294967282 at addr ffff8801d798009c by task syz-executor2/12942
      
          kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
          check_memory_region_inline mm/kasan/kasan.c:260 [inline]
          check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
          memcpy+0x23/0x50 mm/kasan/kasan.c:302
          memcpy include/linux/string.h:345 [inline]
          skb_copy_from_linear_data include/linux/skbuff.h:3421 [inline]
          __pskb_copy_fclone+0x2dd/0xeb0 net/core/skbuff.c:1395
          __pskb_copy include/linux/skbuff.h:1053 [inline]
          pskb_copy include/linux/skbuff.h:2904 [inline]
          skb_realloc_headroom+0xe7/0x120 net/core/skbuff.c:1539
          ipip6_tunnel_xmit net/ipv6/sit.c:965 [inline]
          sit_tunnel_xmit+0xe1b/0x30d0 net/ipv6/sit.c:1029
          __netdev_start_xmit include/linux/netdevice.h:4325 [inline]
          netdev_start_xmit include/linux/netdevice.h:4334 [inline]
          xmit_one net/core/dev.c:3219 [inline]
          dev_hard_start_xmit+0x295/0xc90 net/core/dev.c:3235
          __dev_queue_xmit+0x2f0d/0x3950 net/core/dev.c:3805
          dev_queue_xmit+0x17/0x20 net/core/dev.c:3838
          __bpf_tx_skb net/core/filter.c:2016 [inline]
          __bpf_redirect_common net/core/filter.c:2054 [inline]
          __bpf_redirect+0x5cf/0xb20 net/core/filter.c:2061
          ____bpf_clone_redirect net/core/filter.c:2094 [inline]
          bpf_clone_redirect+0x2f6/0x490 net/core/filter.c:2066
          bpf_prog_41f2bcae09cd4ac3+0xb25/0x1000
      
      The generated test constructs a packet with mac header, network
      header, skb->data pointing to network header and skb->len 0.
      
      Redirecting to a sit0 through __bpf_redirect_no_mac pulls the
      mac length, even though skb->data already is at skb->network_header.
      bpf_prog_test_run_skb has already pulled it as LWT_XMIT !is_l2.
      
      Update the offset calculation to pull only if skb->data differs
      from skb->network_header, which is not true in this case.
      
      The test itself can be run only from commit 1cf1cae9 ("bpf:
      introduce BPF_PROG_TEST_RUN command"), but the same type of packets
      with skb at network header could already be built from lwt xmit hooks,
      so this fix is more relevant to that commit.
      
      Also set the mac header on redirect from LWT_XMIT, as even after this
      change to __bpf_redirect_no_mac that field is expected to be set, but
      is not yet in ip_finish_output2.
      
      Fixes: 3a0af8fd ("bpf: BPF for lightweight tunnel infrastructure")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e7c87bd6
  2. 18 1月, 2019 2 次提交
  3. 17 1月, 2019 1 次提交
  4. 10 1月, 2019 2 次提交
    • K
      net/core/neighbour: tell kmemleak about hash tables · 85704cb8
      Konstantin Khlebnikov 提交于
      This fixes false-positive kmemleak reports about leaked neighbour entries:
      
      unreferenced object 0xffff8885c6e4d0a8 (size 1024):
        comm "softirq", pid 0, jiffies 4294922664 (age 167640.804s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 20 2c f3 83 ff ff ff ff  ........ ,......
          08 c0 ef 5f 84 88 ff ff 01 8c 7d 02 01 00 00 00  ..._......}.....
        backtrace:
          [<00000000748509fe>] ip6_finish_output2+0x887/0x1e40
          [<0000000036d7a0d8>] ip6_output+0x1ba/0x600
          [<0000000027ea7dba>] ip6_send_skb+0x92/0x2f0
          [<00000000d6e2111d>] udp_v6_send_skb.isra.24+0x680/0x15e0
          [<000000000668a8be>] udpv6_sendmsg+0x18c9/0x27a0
          [<000000004bd5fa90>] sock_sendmsg+0xb3/0xf0
          [<000000008227b29f>] ___sys_sendmsg+0x745/0x8f0
          [<000000008698009d>] __sys_sendmsg+0xde/0x170
          [<00000000889dacf1>] do_syscall_64+0x9b/0x400
          [<0000000081cdb353>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<000000005767ed39>] 0xffffffffffffffff
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85704cb8
    • Y
      bpf: correctly set initial window on active Fast Open sender · 31aa6503
      Yuchung Cheng 提交于
      The existing BPF TCP initial congestion window (TCP_BPF_IW) does not
      to work on (active) Fast Open sender. This is because it changes the
      (initial) window only if data_segs_out is zero -- but data_segs_out
      is also incremented on SYN-data.  This patch fixes the issue by
      proerly accounting for SYN-data additionally.
      
      Fixes: fc747810 ("bpf: Adds support for setting initial cwnd")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      31aa6503
  5. 06 1月, 2019 1 次提交
    • M
      jump_label: move 'asm goto' support test to Kconfig · e9666d10
      Masahiro Yamada 提交于
      Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".
      
      The jump label is controlled by HAVE_JUMP_LABEL, which is defined
      like this:
      
        #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
        # define HAVE_JUMP_LABEL
        #endif
      
      We can improve this by testing 'asm goto' support in Kconfig, then
      make JUMP_LABEL depend on CC_HAS_ASM_GOTO.
      
      Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
      match to the real kernel capability.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      e9666d10
  6. 05 1月, 2019 1 次提交
    • D
      net, skbuff: do not prefer skb allocation fails early · f8c468e8
      David Rientjes 提交于
      Commit dcda9b04 ("mm, tree wide: replace __GFP_REPEAT by
      __GFP_RETRY_MAYFAIL with more useful semantic") replaced __GFP_REPEAT in
      alloc_skb_with_frags() with __GFP_RETRY_MAYFAIL when the allocation may
      directly reclaim.
      
      The previous behavior would require reclaim up to 1 << order pages for
      skb aligned header_len of order > PAGE_ALLOC_COSTLY_ORDER before failing,
      otherwise the allocations in alloc_skb() would loop in the page allocator
      looking for memory.  __GFP_RETRY_MAYFAIL makes both allocations failable
      under memory pressure, including for the HEAD allocation.
      
      This can cause, among many other things, write() to fail with ENOTCONN
      during RPC when under memory pressure.
      
      These allocations should succeed as they did previous to dcda9b04
      even if it requires calling the oom killer and additional looping in the
      page allocator to find memory.  There is no way to specify the previous
      behavior of __GFP_REPEAT, but it's unlikely to be necessary since the
      previous behavior only guaranteed that 1 << order pages would be reclaimed
      before failing for order > PAGE_ALLOC_COSTLY_ORDER.  That reclaim is not
      guaranteed to be contiguous memory, so repeating for such large orders is
      usually not beneficial.
      
      Removing the setting of __GFP_RETRY_MAYFAIL to restore the previous
      behavior, specifically not allowing alloc_skb() to fail for small orders
      and oom kill if necessary rather than allowing RPCs to fail.
      
      Fixes: dcda9b04 ("mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic")
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8c468e8
  7. 02 1月, 2019 1 次提交
    • D
      sock: Make sock->sk_stamp thread-safe · 3a0ed3e9
      Deepa Dinamani 提交于
      Al Viro mentioned (Message-ID
      <20170626041334.GZ10672@ZenIV.linux.org.uk>)
      that there is probably a race condition
      lurking in accesses of sk_stamp on 32-bit machines.
      
      sock->sk_stamp is of type ktime_t which is always an s64.
      On a 32 bit architecture, we might run into situations of
      unsafe access as the access to the field becomes non atomic.
      
      Use seqlocks for synchronization.
      This allows us to avoid using spinlocks for readers as
      readers do not need mutual exclusion.
      
      Another approach to solve this is to require sk_lock for all
      modifications of the timestamps. The current approach allows
      for timestamps to have their own lock: sk_stamp_lock.
      This allows for the patch to not compete with already
      existing critical sections, and side effects are limited
      to the paths in the patch.
      
      The addition of the new field maintains the data locality
      optimizations from
      commit 9115e8cd ("net: reorganize struct sock for better data
      locality")
      
      Note that all the instances of the sk_stamp accesses
      are either through the ioctl or the syscall recvmsg.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a0ed3e9
  8. 31 12月, 2018 1 次提交
  9. 29 12月, 2018 1 次提交
  10. 25 12月, 2018 1 次提交
  11. 24 12月, 2018 1 次提交
  12. 23 12月, 2018 1 次提交
  13. 22 12月, 2018 3 次提交
    • P
      net: minor cleanup in skb_ext_add() · 682ec859
      Paolo Abeni 提交于
      When the extension to be added is already present, the only
      skb field we may need to update is 'extensions': we can reorder
      the code and avoid a branch.
      
      v1 -> v2:
       - be sure to flag the newly added extension as active
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      682ec859
    • P
      net: fix possible user-after-free in skb_ext_add() · e94e50bd
      Paolo Abeni 提交于
      On cow we can free the old extension: we must avoid dereferencing
      such extension after skb_ext_maybe_cow(). Since 'new' contents
      are always equal to 'old' after the copy, we can fix the above
      accessing the relevant data using 'new'.
      
      Fixes: df5042f4 ("sk_buff: add skb extension infrastructure")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e94e50bd
    • V
      Prevent overflow of sk_msg in sk_msg_clone() · 5c1e7e94
      Vakul Garg 提交于
      Fixed function sk_msg_clone() to prevent overflow of 'dst' while adding
      pages in scatterlist entries. The overflow of 'dst' causes crash in kernel
      tls module while doing record encryption.
      
      Crash fixed by this patch.
      
      [   78.796119] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
      [   78.804900] Mem abort info:
      [   78.807683]   ESR = 0x96000004
      [   78.810744]   Exception class = DABT (current EL), IL = 32 bits
      [   78.816677]   SET = 0, FnV = 0
      [   78.819727]   EA = 0, S1PTW = 0
      [   78.822873] Data abort info:
      [   78.825759]   ISV = 0, ISS = 0x00000004
      [   78.829600]   CM = 0, WnR = 0
      [   78.832576] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000bf8ee311
      [   78.839195] [0000000000000008] pgd=0000000000000000
      [   78.844081] Internal error: Oops: 96000004 [#1] PREEMPT SMP
      [   78.849642] Modules linked in: tls xt_conntrack ipt_REJECT nf_reject_ipv4 ip6table_filter ip6_tables xt_CHECKSUM cpve cpufreq_conservative lm90 ina2xx crct10dif_ce
      [   78.865377] CPU: 0 PID: 6007 Comm: openssl Not tainted 4.20.0-rc6-01647-g754d5da6-dirty #107
      [   78.874149] Hardware name: LS1043A RDB Board (DT)
      [   78.878844] pstate: 60000005 (nZCv daif -PAN -UAO)
      [   78.883632] pc : scatterwalk_copychunks+0x164/0x1c8
      [   78.888500] lr : scatterwalk_copychunks+0x160/0x1c8
      [   78.893366] sp : ffff00001d04b600
      [   78.896668] x29: ffff00001d04b600 x28: ffff80006814c680
      [   78.901970] x27: 0000000000000000 x26: ffff80006c8de786
      [   78.907272] x25: ffff00001d04b760 x24: 000000000000001a
      [   78.912573] x23: 0000000000000006 x22: ffff80006814e440
      [   78.917874] x21: 0000000000000100 x20: 0000000000000000
      [   78.923175] x19: 000081ffffffffff x18: 0000000000000400
      [   78.928476] x17: 0000000000000008 x16: 0000000000000000
      [   78.933778] x15: 0000000000000100 x14: 0000000000000001
      [   78.939079] x13: 0000000000001080 x12: 0000000000000020
      [   78.944381] x11: 0000000000001080 x10: 00000000ffff0002
      [   78.949683] x9 : ffff80006814c248 x8 : 00000000ffff0000
      [   78.954985] x7 : ffff80006814c318 x6 : ffff80006c8de786
      [   78.960286] x5 : 0000000000000f80 x4 : ffff80006c8de000
      [   78.965588] x3 : 0000000000000000 x2 : 0000000000001086
      [   78.970889] x1 : ffff7e0001b74e02 x0 : 0000000000000000
      [   78.976192] Process openssl (pid: 6007, stack limit = 0x00000000291367f9)
      [   78.982968] Call trace:
      [   78.985406]  scatterwalk_copychunks+0x164/0x1c8
      [   78.989927]  skcipher_walk_next+0x28c/0x448
      [   78.994099]  skcipher_walk_done+0xfc/0x258
      [   78.998187]  gcm_encrypt+0x434/0x4c0
      [   79.001758]  tls_push_record+0x354/0xa58 [tls]
      [   79.006194]  bpf_exec_tx_verdict+0x1e4/0x3e8 [tls]
      [   79.010978]  tls_sw_sendmsg+0x650/0x780 [tls]
      [   79.015326]  inet_sendmsg+0x2c/0xf8
      [   79.018806]  sock_sendmsg+0x18/0x30
      [   79.022284]  __sys_sendto+0x104/0x138
      [   79.025935]  __arm64_sys_sendto+0x24/0x30
      [   79.029936]  el0_svc_common+0x60/0xe8
      [   79.033588]  el0_svc_handler+0x2c/0x80
      [   79.037327]  el0_svc+0x8/0xc
      [   79.040200] Code: 6b01005f 54fff788 940169b1 f9000320 (b9400801)
      [   79.046283] ---[ end trace 74db007d069c1cf7 ]---
      
      Fixes: d829e9c4 ("tls: convert to generic sk_msg interface")
      Signed-off-by: NVakul Garg <vakul.garg@nxp.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c1e7e94
  14. 21 12月, 2018 6 次提交
  15. 20 12月, 2018 8 次提交
    • D
      neighbor: Use nda_policy for validating attributes in adds and dump requests · a9cd3439
      David Ahern 提交于
      Add NDA_PROTOCOL to nda_policy and use the policy for attribute parsing and
      validation for adding neighbors and in dump requests. Remove the now duplicate
      checks on nla_len.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9cd3439
    • D
      neighbor: NTF_PROXY is a valid ndm_flag for a dump request · c0fde870
      David Ahern 提交于
      When dumping proxy entries the dump request has NTF_PROXY set in
      ndm_flags. strict mode checking needs to be updated to allow this
      flag.
      
      Fixes: 51183d23 ("net/neighbor: Update neigh_dump_info for strict data checking")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0fde870
    • D
      neighbor: Initialize protocol when new pneigh_entry are created · 754d5da6
      David Ahern 提交于
      pneigh_lookup uses kmalloc versus kzalloc when new entries are allocated.
      Given that the newly added protocol field needs to be initialized.
      
      Fixes: df9b0e30 ("neighbor: Add protocol attribute")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      754d5da6
    • L
      gro_cell: add napi_disable in gro_cells_destroy · 8e1da73a
      Lorenzo Bianconi 提交于
      Add napi_disable routine in gro_cells_destroy since starting from
      commit c42858ea ("gro_cells: remove spinlock protecting receive
      queues") gro_cell_poll and gro_cells_destroy can run concurrently on
      napi_skbs list producing a kernel Oops if the tunnel interface is
      removed while gro_cell_poll is running. The following Oops has been
      triggered removing a vxlan device while the interface is receiving
      traffic
      
      [ 5628.948853] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [ 5628.949981] PGD 0 P4D 0
      [ 5628.950308] Oops: 0002 [#1] SMP PTI
      [ 5628.950748] CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 4.20.0-rc6+ #41
      [ 5628.952940] RIP: 0010:gro_cell_poll+0x49/0x80
      [ 5628.955615] RSP: 0018:ffffc9000004fdd8 EFLAGS: 00010202
      [ 5628.956250] RAX: 0000000000000000 RBX: ffffe8ffffc08150 RCX: 0000000000000000
      [ 5628.957102] RDX: 0000000000000000 RSI: ffff88802356bf00 RDI: ffffe8ffffc08150
      [ 5628.957940] RBP: 0000000000000026 R08: 0000000000000000 R09: 0000000000000000
      [ 5628.958803] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000040
      [ 5628.959661] R13: ffffe8ffffc08100 R14: 0000000000000000 R15: 0000000000000040
      [ 5628.960682] FS:  0000000000000000(0000) GS:ffff88803ea00000(0000) knlGS:0000000000000000
      [ 5628.961616] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5628.962359] CR2: 0000000000000008 CR3: 000000000221c000 CR4: 00000000000006b0
      [ 5628.963188] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 5628.964034] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 5628.964871] Call Trace:
      [ 5628.965179]  net_rx_action+0xf0/0x380
      [ 5628.965637]  __do_softirq+0xc7/0x431
      [ 5628.966510]  run_ksoftirqd+0x24/0x30
      [ 5628.966957]  smpboot_thread_fn+0xc5/0x160
      [ 5628.967436]  kthread+0x113/0x130
      [ 5628.968283]  ret_from_fork+0x3a/0x50
      [ 5628.968721] Modules linked in:
      [ 5628.969099] CR2: 0000000000000008
      [ 5628.969510] ---[ end trace 9d9dedc7181661fe ]---
      [ 5628.970073] RIP: 0010:gro_cell_poll+0x49/0x80
      [ 5628.972965] RSP: 0018:ffffc9000004fdd8 EFLAGS: 00010202
      [ 5628.973611] RAX: 0000000000000000 RBX: ffffe8ffffc08150 RCX: 0000000000000000
      [ 5628.974504] RDX: 0000000000000000 RSI: ffff88802356bf00 RDI: ffffe8ffffc08150
      [ 5628.975462] RBP: 0000000000000026 R08: 0000000000000000 R09: 0000000000000000
      [ 5628.976413] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000040
      [ 5628.977375] R13: ffffe8ffffc08100 R14: 0000000000000000 R15: 0000000000000040
      [ 5628.978296] FS:  0000000000000000(0000) GS:ffff88803ea00000(0000) knlGS:0000000000000000
      [ 5628.979327] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5628.980044] CR2: 0000000000000008 CR3: 000000000221c000 CR4: 00000000000006b0
      [ 5628.980929] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 5628.981736] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 5628.982409] Kernel panic - not syncing: Fatal exception in interrupt
      [ 5628.983307] Kernel Offset: disabled
      
      Fixes: c42858ea ("gro_cells: remove spinlock protecting receive queues")
      Signed-off-by: NLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e1da73a
    • R
      neighbour: register rtnl doit handler · 82cbb5c6
      Roopa Prabhu 提交于
      this patch registers neigh doit handler. The doit handler
      returns a neigh entry given dst and dev. This is similar
      to route and fdb doit (get) handlers. Also moves nda_policy
      declaration from rtnetlink.c to neighbour.c
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Reviewed-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82cbb5c6
    • F
      net: switch secpath to use skb extension infrastructure · 4165079b
      Florian Westphal 提交于
      Remove skb->sp and allocate secpath storage via extension
      infrastructure.  This also reduces sk_buff by 8 bytes on x86_64.
      
      Total size of allyesconfig kernel is reduced slightly, as there is
      less inlined code (one conditional atomic op instead of two on
      skb_clone).
      
      No differences in throughput in following ipsec performance tests:
      - transport mode with aes on 10GB link
      - tunnel mode between two network namespaces with aes and null cipher
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4165079b
    • F
      net: convert bridge_nf to use skb extension infrastructure · de8bda1d
      Florian Westphal 提交于
      This converts the bridge netfilter (calling iptables hooks from bridge)
      facility to use the extension infrastructure.
      
      The bridge_nf specific hooks in skb clone and free paths are removed, they
      have been replaced by the skb_ext hooks that do the same as the bridge nf
      allocations hooks did.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8bda1d
    • F
      sk_buff: add skb extension infrastructure · df5042f4
      Florian Westphal 提交于
      This adds an optional extension infrastructure, with ispec (xfrm) and
      bridge netfilter as first users.
      objdiff shows no changes if kernel is built without xfrm and br_netfilter
      support.
      
      The third (planned future) user is Multipath TCP which is still
      out-of-tree.
      MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
      numbers used by individual subflows.
      
      This DSS mapping is read/written from tcp option space on receive and
      written to tcp option space on transmitted tcp packets that are part of
      and MPTCP connection.
      
      Extending skb_shared_info or adding a private data field to skb fclones
      doesn't work for incoming skb, so a different DSS propagation method would
      be required for the receive side.
      
      mptcp has same requirements as secpath/bridge netfilter:
      
      1. extension memory is released when the sk_buff is free'd.
      2. data is shared after cloning an skb (clone inherits extension)
      3. adding extension to an skb will COW the extension buffer if needed.
      
      The "MPTCP upstreaming" effort adds SKB_EXT_MPTCP extension to store the
      mapping for tx and rx processing.
      
      Two new members are added to sk_buff:
      1. 'active_extensions' byte (filling a hole), telling which extensions
         are available for this skb.
         This has two purposes.
         a) avoids the need to initialize the pointer.
         b) allows to "delete" an extension by clearing its bit
         value in ->active_extensions.
      
         While it would be possible to store the active_extensions byte
         in the extension struct instead of sk_buff, there is one problem
         with this:
          When an extension has to be disabled, we can always clear the
          bit in skb->active_extensions.  But in case it would be stored in the
          extension buffer itself, we might have to COW it first, if
          we are dealing with a cloned skb.  On kmalloc failure we would
          be unable to turn an extension off.
      
      2. extension pointer, located at the end of the sk_buff.
         If the active_extensions byte is 0, the pointer is undefined,
         it is not initialized on skb allocation.
      
      This adds extra code to skb clone and free paths (to deal with
      refcount/free of extension area) but this replaces similar code that
      manages skb->nf_bridge and skb->sp structs in the followup patches of
      the series.
      
      It is possible to add support for extensions that are not preseved on
      clones/copies.
      
      To do this, it would be needed to define a bitmask of all extensions that
      need copy/cow semantics, and change __skb_ext_copy() to check
      ->active_extensions & SKB_EXT_PRESERVE_ON_CLONE, then just set
      ->active_extensions to 0 on the new clone.
      
      This isn't done here because all extensions that get added here
      need the copy/cow semantics.
      
      v2:
      Allocate entire extension space using kmem_cache.
      Upside is that this allows better tracking of used memory,
      downside is that we will allocate more space than strictly needed in
      most cases (its unlikely that all extensions are active/needed at same
      time for same skb).
      The allocated memory (except the small extension header) is not cleared,
      so no additonal overhead aside from memory usage.
      
      Avoid atomic_dec_and_test operation on skb_ext_put()
      by using similar trick as kfree_skbmem() does with fclone_ref:
      If recount is 1, there is no concurrent user and we can free right away.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df5042f4
  16. 19 12月, 2018 1 次提交
    • J
      bpf: sockmap, metadata support for reporting size of msg · 3bdbd022
      John Fastabend 提交于
      This adds metadata to sk_msg_md for BPF programs to read the sk_msg
      size.
      
      When the SK_MSG program is running under an application that is using
      sendfile the data is not copied into sk_msg buffers by default. Rather
      the BPF program uses sk_msg_pull_data to read the bytes in. This
      avoids doing the costly memcopy instructions when they are not in
      fact needed. However, if we don't know the size of the sk_msg we
      have to guess if needed bytes are available by doing a pull request
      which may fail. By including the size of the sk_msg BPF programs can
      check the size before issuing sk_msg_pull_data requests.
      
      Additionally, the same applies for sendmsg calls when the application
      provides multiple iovs. Here the BPF program needs to pull in data
      to update data pointers but its not clear where the data ends without
      a size parameter. In many cases "guessing" is not easy to do
      and results in multiple calls to pull and without bounded loops
      everything gets fairly tricky.
      
      Clean this up by including a u32 size field. Note, all writes into
      sk_msg_md are rejected already from sk_msg_is_valid_access so nothing
      additional is needed there.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      3bdbd022
  17. 17 12月, 2018 2 次提交
  18. 16 12月, 2018 1 次提交
  19. 15 12月, 2018 5 次提交
    • D
      neighbor: Remove externally learned entries from gc_list · e997f8a2
      David Ahern 提交于
      Externally learned entries are similar to PERMANENT entries in the
      sense they are managed by userspace and can not be garbage collected.
      As such remove them from the gc_list, remove the flags check from
      neigh_forced_gc and skip threshold checks in neigh_alloc. As with
      PERMANENT entries, this allows unlimited number of NTF_EXT_LEARNED
      entries.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e997f8a2
    • D
      neighbor: Move neigh_update_ext_learned to core file · 526f1b58
      David Ahern 提交于
      neigh_update_ext_learned has one caller in neighbour.c so does not need
      to be defined in the header. Move it and in the process remove the
      intialization of ndm_flags and just set it based on the flags check.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      526f1b58
    • D
      neighbor: Remove state and flags arguments to neigh_del · 7e6f182b
      David Ahern 提交于
      neigh_del now only has 1 caller, and the state and flags arguments
      are both 0. Remove them and simplify neigh_del.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e6f182b
    • D
      neighbor: Fix state check in neigh_forced_gc · 758a7f0b
      David Ahern 提交于
      PERMANENT entries are not on the gc_list so the state check is now
      redundant. Also, the move to not purge entries until after 5 seconds
      should not apply to FAILED entries; those can be removed immediately
      to make way for newer ones. This restores the previous logic prior to
      the gc_list.
      
      Fixes: 58956317 ("neighbor: Improve garbage collection")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      758a7f0b
    • D
      neighbor: Fix locking order for gc_list changes · 9c29a2f5
      David Ahern 提交于
      Lock checker noted an inverted lock order between neigh_change_state
      (neighbor lock then table lock) and neigh_periodic_work (table lock and
      then neighbor lock) resulting in:
      
      [  121.057652] ======================================================
      [  121.058740] WARNING: possible circular locking dependency detected
      [  121.059861] 4.20.0-rc6+ #43 Not tainted
      [  121.060546] ------------------------------------------------------
      [  121.061630] kworker/0:2/65 is trying to acquire lock:
      [  121.062519] (____ptrval____) (&n->lock){++--}, at: neigh_periodic_work+0x237/0x324
      [  121.063894]
      [  121.063894] but task is already holding lock:
      [  121.064920] (____ptrval____) (&tbl->lock){+.-.}, at: neigh_periodic_work+0x194/0x324
      [  121.066274]
      [  121.066274] which lock already depends on the new lock.
      [  121.066274]
      [  121.067693]
      [  121.067693] the existing dependency chain (in reverse order) is:
      ...
      
      Fix by renaming neigh_change_state to neigh_update_gc_list, changing
      it to only manage whether an entry should be on the gc_list and taking
      locks in the same order as neigh_periodic_work. Invoke at the end of
      neigh_update only if diff between old or new states has the PERMANENT
      flag set.
      
      Fixes: 8cc196d6 ("neighbor: gc_list changes should be protected by table lock")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c29a2f5