1. 19 1月, 2019 5 次提交
  2. 18 1月, 2019 4 次提交
    • P
      net: add a route cache full diagnostic message · 22c2ad61
      Peter Oskolkov 提交于
      In some testing scenarios, dst/route cache can fill up so quickly
      that even an explicit GC call occasionally fails to clean it up. This leads
      to sporadically failing calls to dst_alloc and "network unreachable" errors
      to the user, which is confusing.
      
      This patch adds a diagnostic message to make the cause of the failure
      easier to determine.
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22c2ad61
    • P
      net: Add extack argument to ndo_fdb_add() · 87b0984e
      Petr Machata 提交于
      Drivers may not be able to support certain FDB entries, and an error
      code is insufficient to give clear hints as to the reasons of rejection.
      
      In order to make it possible to communicate the rejection reason, extend
      ndo_fdb_add() with an extack argument. Adapt the existing
      implementations of ndo_fdb_add() to take the parameter (and ignore it).
      Pass the extack parameter when invoking ndo_fdb_add() from rtnl_fdb_add().
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87b0984e
    • D
      net: introduce SO_BINDTOIFINDEX sockopt · f5dd3d0c
      David Herrmann 提交于
      This introduces a new generic SOL_SOCKET-level socket option called
      SO_BINDTOIFINDEX. It behaves similar to SO_BINDTODEVICE, but takes a
      network interface index as argument, rather than the network interface
      name.
      
      User-space often refers to network-interfaces via their index, but has
      to temporarily resolve it to a name for a call into SO_BINDTODEVICE.
      This might pose problems when the network-device is renamed
      asynchronously by other parts of the system. When this happens, the
      SO_BINDTODEVICE might either fail, or worse, it might bind to the wrong
      device.
      
      In most cases user-space only ever operates on devices which they
      either manage themselves, or otherwise have a guarantee that the device
      name will not change (e.g., devices that are UP cannot be renamed).
      However, particularly in libraries this guarantee is non-obvious and it
      would be nice if that race-condition would simply not exist. It would
      make it easier for those libraries to operate even in situations where
      the device-name might change under the hood.
      
      A real use-case that we recently hit is trying to start the network
      stack early in the initrd but make it survive into the real system.
      Existing distributions rename network-interfaces during the transition
      from initrd into the real system. This, obviously, cannot affect
      devices that are up and running (unless you also consider moving them
      between network-namespaces). However, the network manager now has to
      make sure its management engine for dormant devices will not run in
      parallel to these renames. Particularly, when you offload operations
      like DHCP into separate processes, these might setup their sockets
      early, and thus have to resolve the device-name possibly running into
      this race-condition.
      
      By avoiding a call to resolve the device-name, we no longer depend on
      the name and can run network setup of dormant devices in parallel to
      the transition off the initrd. The SO_BINDTOIFINDEX ioctl plugs this
      race.
      Reviewed-by: NTom Gundersen <teg@jklm.no>
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5dd3d0c
    • V
      Optimize sk_msg_clone() by data merge to end dst sg entry · fda497e5
      Vakul Garg 提交于
      Function sk_msg_clone has been modified to merge the data from source sg
      entry to destination sg entry if the cloned data resides in same page
      and is contiguous to the end entry of destination sk_msg. This improves
      kernel tls throughput to the tune of 10%.
      
      When the user space tls application calls sendmsg() with MSG_MORE, it leads
      to calling sk_msg_clone() with new data being cloned placed continuous to
      previously cloned data. Without this optimization, a new SG entry in
      the destination sk_msg i.e. rec->msg_plaintext in tls_clone_plaintext_msg()
      gets used. This leads to exhaustion of sg entries in rec->msg_plaintext
      even before a full 16K of allowable record data is accumulated. Hence we
      lose oppurtunity to encrypt and send a full 16K record.
      
      With this patch, the kernel tls can accumulate full 16K of record data
      irrespective of the size of data passed in sendmsg() with MSG_MORE.
      Signed-off-by: NVakul Garg <vakul.garg@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fda497e5
  3. 10 1月, 2019 2 次提交
    • K
      net/core/neighbour: tell kmemleak about hash tables · 85704cb8
      Konstantin Khlebnikov 提交于
      This fixes false-positive kmemleak reports about leaked neighbour entries:
      
      unreferenced object 0xffff8885c6e4d0a8 (size 1024):
        comm "softirq", pid 0, jiffies 4294922664 (age 167640.804s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 20 2c f3 83 ff ff ff ff  ........ ,......
          08 c0 ef 5f 84 88 ff ff 01 8c 7d 02 01 00 00 00  ..._......}.....
        backtrace:
          [<00000000748509fe>] ip6_finish_output2+0x887/0x1e40
          [<0000000036d7a0d8>] ip6_output+0x1ba/0x600
          [<0000000027ea7dba>] ip6_send_skb+0x92/0x2f0
          [<00000000d6e2111d>] udp_v6_send_skb.isra.24+0x680/0x15e0
          [<000000000668a8be>] udpv6_sendmsg+0x18c9/0x27a0
          [<000000004bd5fa90>] sock_sendmsg+0xb3/0xf0
          [<000000008227b29f>] ___sys_sendmsg+0x745/0x8f0
          [<000000008698009d>] __sys_sendmsg+0xde/0x170
          [<00000000889dacf1>] do_syscall_64+0x9b/0x400
          [<0000000081cdb353>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<000000005767ed39>] 0xffffffffffffffff
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85704cb8
    • Y
      bpf: correctly set initial window on active Fast Open sender · 31aa6503
      Yuchung Cheng 提交于
      The existing BPF TCP initial congestion window (TCP_BPF_IW) does not
      to work on (active) Fast Open sender. This is because it changes the
      (initial) window only if data_segs_out is zero -- but data_segs_out
      is also incremented on SYN-data.  This patch fixes the issue by
      proerly accounting for SYN-data additionally.
      
      Fixes: fc747810 ("bpf: Adds support for setting initial cwnd")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      31aa6503
  4. 06 1月, 2019 1 次提交
    • M
      jump_label: move 'asm goto' support test to Kconfig · e9666d10
      Masahiro Yamada 提交于
      Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".
      
      The jump label is controlled by HAVE_JUMP_LABEL, which is defined
      like this:
      
        #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
        # define HAVE_JUMP_LABEL
        #endif
      
      We can improve this by testing 'asm goto' support in Kconfig, then
      make JUMP_LABEL depend on CC_HAS_ASM_GOTO.
      
      Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
      match to the real kernel capability.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      e9666d10
  5. 05 1月, 2019 1 次提交
    • D
      net, skbuff: do not prefer skb allocation fails early · f8c468e8
      David Rientjes 提交于
      Commit dcda9b04 ("mm, tree wide: replace __GFP_REPEAT by
      __GFP_RETRY_MAYFAIL with more useful semantic") replaced __GFP_REPEAT in
      alloc_skb_with_frags() with __GFP_RETRY_MAYFAIL when the allocation may
      directly reclaim.
      
      The previous behavior would require reclaim up to 1 << order pages for
      skb aligned header_len of order > PAGE_ALLOC_COSTLY_ORDER before failing,
      otherwise the allocations in alloc_skb() would loop in the page allocator
      looking for memory.  __GFP_RETRY_MAYFAIL makes both allocations failable
      under memory pressure, including for the HEAD allocation.
      
      This can cause, among many other things, write() to fail with ENOTCONN
      during RPC when under memory pressure.
      
      These allocations should succeed as they did previous to dcda9b04
      even if it requires calling the oom killer and additional looping in the
      page allocator to find memory.  There is no way to specify the previous
      behavior of __GFP_REPEAT, but it's unlikely to be necessary since the
      previous behavior only guaranteed that 1 << order pages would be reclaimed
      before failing for order > PAGE_ALLOC_COSTLY_ORDER.  That reclaim is not
      guaranteed to be contiguous memory, so repeating for such large orders is
      usually not beneficial.
      
      Removing the setting of __GFP_RETRY_MAYFAIL to restore the previous
      behavior, specifically not allowing alloc_skb() to fail for small orders
      and oom kill if necessary rather than allowing RPCs to fail.
      
      Fixes: dcda9b04 ("mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic")
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8c468e8
  6. 02 1月, 2019 1 次提交
    • D
      sock: Make sock->sk_stamp thread-safe · 3a0ed3e9
      Deepa Dinamani 提交于
      Al Viro mentioned (Message-ID
      <20170626041334.GZ10672@ZenIV.linux.org.uk>)
      that there is probably a race condition
      lurking in accesses of sk_stamp on 32-bit machines.
      
      sock->sk_stamp is of type ktime_t which is always an s64.
      On a 32 bit architecture, we might run into situations of
      unsafe access as the access to the field becomes non atomic.
      
      Use seqlocks for synchronization.
      This allows us to avoid using spinlocks for readers as
      readers do not need mutual exclusion.
      
      Another approach to solve this is to require sk_lock for all
      modifications of the timestamps. The current approach allows
      for timestamps to have their own lock: sk_stamp_lock.
      This allows for the patch to not compete with already
      existing critical sections, and side effects are limited
      to the paths in the patch.
      
      The addition of the new field maintains the data locality
      optimizations from
      commit 9115e8cd ("net: reorganize struct sock for better data
      locality")
      
      Note that all the instances of the sk_stamp accesses
      are either through the ioctl or the syscall recvmsg.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a0ed3e9
  7. 31 12月, 2018 1 次提交
  8. 29 12月, 2018 1 次提交
  9. 25 12月, 2018 1 次提交
  10. 24 12月, 2018 1 次提交
  11. 23 12月, 2018 1 次提交
  12. 22 12月, 2018 3 次提交
    • P
      net: minor cleanup in skb_ext_add() · 682ec859
      Paolo Abeni 提交于
      When the extension to be added is already present, the only
      skb field we may need to update is 'extensions': we can reorder
      the code and avoid a branch.
      
      v1 -> v2:
       - be sure to flag the newly added extension as active
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      682ec859
    • P
      net: fix possible user-after-free in skb_ext_add() · e94e50bd
      Paolo Abeni 提交于
      On cow we can free the old extension: we must avoid dereferencing
      such extension after skb_ext_maybe_cow(). Since 'new' contents
      are always equal to 'old' after the copy, we can fix the above
      accessing the relevant data using 'new'.
      
      Fixes: df5042f4 ("sk_buff: add skb extension infrastructure")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e94e50bd
    • V
      Prevent overflow of sk_msg in sk_msg_clone() · 5c1e7e94
      Vakul Garg 提交于
      Fixed function sk_msg_clone() to prevent overflow of 'dst' while adding
      pages in scatterlist entries. The overflow of 'dst' causes crash in kernel
      tls module while doing record encryption.
      
      Crash fixed by this patch.
      
      [   78.796119] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
      [   78.804900] Mem abort info:
      [   78.807683]   ESR = 0x96000004
      [   78.810744]   Exception class = DABT (current EL), IL = 32 bits
      [   78.816677]   SET = 0, FnV = 0
      [   78.819727]   EA = 0, S1PTW = 0
      [   78.822873] Data abort info:
      [   78.825759]   ISV = 0, ISS = 0x00000004
      [   78.829600]   CM = 0, WnR = 0
      [   78.832576] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000bf8ee311
      [   78.839195] [0000000000000008] pgd=0000000000000000
      [   78.844081] Internal error: Oops: 96000004 [#1] PREEMPT SMP
      [   78.849642] Modules linked in: tls xt_conntrack ipt_REJECT nf_reject_ipv4 ip6table_filter ip6_tables xt_CHECKSUM cpve cpufreq_conservative lm90 ina2xx crct10dif_ce
      [   78.865377] CPU: 0 PID: 6007 Comm: openssl Not tainted 4.20.0-rc6-01647-g754d5da6-dirty #107
      [   78.874149] Hardware name: LS1043A RDB Board (DT)
      [   78.878844] pstate: 60000005 (nZCv daif -PAN -UAO)
      [   78.883632] pc : scatterwalk_copychunks+0x164/0x1c8
      [   78.888500] lr : scatterwalk_copychunks+0x160/0x1c8
      [   78.893366] sp : ffff00001d04b600
      [   78.896668] x29: ffff00001d04b600 x28: ffff80006814c680
      [   78.901970] x27: 0000000000000000 x26: ffff80006c8de786
      [   78.907272] x25: ffff00001d04b760 x24: 000000000000001a
      [   78.912573] x23: 0000000000000006 x22: ffff80006814e440
      [   78.917874] x21: 0000000000000100 x20: 0000000000000000
      [   78.923175] x19: 000081ffffffffff x18: 0000000000000400
      [   78.928476] x17: 0000000000000008 x16: 0000000000000000
      [   78.933778] x15: 0000000000000100 x14: 0000000000000001
      [   78.939079] x13: 0000000000001080 x12: 0000000000000020
      [   78.944381] x11: 0000000000001080 x10: 00000000ffff0002
      [   78.949683] x9 : ffff80006814c248 x8 : 00000000ffff0000
      [   78.954985] x7 : ffff80006814c318 x6 : ffff80006c8de786
      [   78.960286] x5 : 0000000000000f80 x4 : ffff80006c8de000
      [   78.965588] x3 : 0000000000000000 x2 : 0000000000001086
      [   78.970889] x1 : ffff7e0001b74e02 x0 : 0000000000000000
      [   78.976192] Process openssl (pid: 6007, stack limit = 0x00000000291367f9)
      [   78.982968] Call trace:
      [   78.985406]  scatterwalk_copychunks+0x164/0x1c8
      [   78.989927]  skcipher_walk_next+0x28c/0x448
      [   78.994099]  skcipher_walk_done+0xfc/0x258
      [   78.998187]  gcm_encrypt+0x434/0x4c0
      [   79.001758]  tls_push_record+0x354/0xa58 [tls]
      [   79.006194]  bpf_exec_tx_verdict+0x1e4/0x3e8 [tls]
      [   79.010978]  tls_sw_sendmsg+0x650/0x780 [tls]
      [   79.015326]  inet_sendmsg+0x2c/0xf8
      [   79.018806]  sock_sendmsg+0x18/0x30
      [   79.022284]  __sys_sendto+0x104/0x138
      [   79.025935]  __arm64_sys_sendto+0x24/0x30
      [   79.029936]  el0_svc_common+0x60/0xe8
      [   79.033588]  el0_svc_handler+0x2c/0x80
      [   79.037327]  el0_svc+0x8/0xc
      [   79.040200] Code: 6b01005f 54fff788 940169b1 f9000320 (b9400801)
      [   79.046283] ---[ end trace 74db007d069c1cf7 ]---
      
      Fixes: d829e9c4 ("tls: convert to generic sk_msg interface")
      Signed-off-by: NVakul Garg <vakul.garg@nxp.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c1e7e94
  13. 21 12月, 2018 6 次提交
  14. 20 12月, 2018 8 次提交
    • D
      neighbor: Use nda_policy for validating attributes in adds and dump requests · a9cd3439
      David Ahern 提交于
      Add NDA_PROTOCOL to nda_policy and use the policy for attribute parsing and
      validation for adding neighbors and in dump requests. Remove the now duplicate
      checks on nla_len.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9cd3439
    • D
      neighbor: NTF_PROXY is a valid ndm_flag for a dump request · c0fde870
      David Ahern 提交于
      When dumping proxy entries the dump request has NTF_PROXY set in
      ndm_flags. strict mode checking needs to be updated to allow this
      flag.
      
      Fixes: 51183d23 ("net/neighbor: Update neigh_dump_info for strict data checking")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0fde870
    • D
      neighbor: Initialize protocol when new pneigh_entry are created · 754d5da6
      David Ahern 提交于
      pneigh_lookup uses kmalloc versus kzalloc when new entries are allocated.
      Given that the newly added protocol field needs to be initialized.
      
      Fixes: df9b0e30 ("neighbor: Add protocol attribute")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      754d5da6
    • L
      gro_cell: add napi_disable in gro_cells_destroy · 8e1da73a
      Lorenzo Bianconi 提交于
      Add napi_disable routine in gro_cells_destroy since starting from
      commit c42858ea ("gro_cells: remove spinlock protecting receive
      queues") gro_cell_poll and gro_cells_destroy can run concurrently on
      napi_skbs list producing a kernel Oops if the tunnel interface is
      removed while gro_cell_poll is running. The following Oops has been
      triggered removing a vxlan device while the interface is receiving
      traffic
      
      [ 5628.948853] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [ 5628.949981] PGD 0 P4D 0
      [ 5628.950308] Oops: 0002 [#1] SMP PTI
      [ 5628.950748] CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 4.20.0-rc6+ #41
      [ 5628.952940] RIP: 0010:gro_cell_poll+0x49/0x80
      [ 5628.955615] RSP: 0018:ffffc9000004fdd8 EFLAGS: 00010202
      [ 5628.956250] RAX: 0000000000000000 RBX: ffffe8ffffc08150 RCX: 0000000000000000
      [ 5628.957102] RDX: 0000000000000000 RSI: ffff88802356bf00 RDI: ffffe8ffffc08150
      [ 5628.957940] RBP: 0000000000000026 R08: 0000000000000000 R09: 0000000000000000
      [ 5628.958803] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000040
      [ 5628.959661] R13: ffffe8ffffc08100 R14: 0000000000000000 R15: 0000000000000040
      [ 5628.960682] FS:  0000000000000000(0000) GS:ffff88803ea00000(0000) knlGS:0000000000000000
      [ 5628.961616] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5628.962359] CR2: 0000000000000008 CR3: 000000000221c000 CR4: 00000000000006b0
      [ 5628.963188] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 5628.964034] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 5628.964871] Call Trace:
      [ 5628.965179]  net_rx_action+0xf0/0x380
      [ 5628.965637]  __do_softirq+0xc7/0x431
      [ 5628.966510]  run_ksoftirqd+0x24/0x30
      [ 5628.966957]  smpboot_thread_fn+0xc5/0x160
      [ 5628.967436]  kthread+0x113/0x130
      [ 5628.968283]  ret_from_fork+0x3a/0x50
      [ 5628.968721] Modules linked in:
      [ 5628.969099] CR2: 0000000000000008
      [ 5628.969510] ---[ end trace 9d9dedc7181661fe ]---
      [ 5628.970073] RIP: 0010:gro_cell_poll+0x49/0x80
      [ 5628.972965] RSP: 0018:ffffc9000004fdd8 EFLAGS: 00010202
      [ 5628.973611] RAX: 0000000000000000 RBX: ffffe8ffffc08150 RCX: 0000000000000000
      [ 5628.974504] RDX: 0000000000000000 RSI: ffff88802356bf00 RDI: ffffe8ffffc08150
      [ 5628.975462] RBP: 0000000000000026 R08: 0000000000000000 R09: 0000000000000000
      [ 5628.976413] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000040
      [ 5628.977375] R13: ffffe8ffffc08100 R14: 0000000000000000 R15: 0000000000000040
      [ 5628.978296] FS:  0000000000000000(0000) GS:ffff88803ea00000(0000) knlGS:0000000000000000
      [ 5628.979327] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5628.980044] CR2: 0000000000000008 CR3: 000000000221c000 CR4: 00000000000006b0
      [ 5628.980929] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 5628.981736] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 5628.982409] Kernel panic - not syncing: Fatal exception in interrupt
      [ 5628.983307] Kernel Offset: disabled
      
      Fixes: c42858ea ("gro_cells: remove spinlock protecting receive queues")
      Signed-off-by: NLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e1da73a
    • R
      neighbour: register rtnl doit handler · 82cbb5c6
      Roopa Prabhu 提交于
      this patch registers neigh doit handler. The doit handler
      returns a neigh entry given dst and dev. This is similar
      to route and fdb doit (get) handlers. Also moves nda_policy
      declaration from rtnetlink.c to neighbour.c
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Reviewed-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82cbb5c6
    • F
      net: switch secpath to use skb extension infrastructure · 4165079b
      Florian Westphal 提交于
      Remove skb->sp and allocate secpath storage via extension
      infrastructure.  This also reduces sk_buff by 8 bytes on x86_64.
      
      Total size of allyesconfig kernel is reduced slightly, as there is
      less inlined code (one conditional atomic op instead of two on
      skb_clone).
      
      No differences in throughput in following ipsec performance tests:
      - transport mode with aes on 10GB link
      - tunnel mode between two network namespaces with aes and null cipher
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4165079b
    • F
      net: convert bridge_nf to use skb extension infrastructure · de8bda1d
      Florian Westphal 提交于
      This converts the bridge netfilter (calling iptables hooks from bridge)
      facility to use the extension infrastructure.
      
      The bridge_nf specific hooks in skb clone and free paths are removed, they
      have been replaced by the skb_ext hooks that do the same as the bridge nf
      allocations hooks did.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8bda1d
    • F
      sk_buff: add skb extension infrastructure · df5042f4
      Florian Westphal 提交于
      This adds an optional extension infrastructure, with ispec (xfrm) and
      bridge netfilter as first users.
      objdiff shows no changes if kernel is built without xfrm and br_netfilter
      support.
      
      The third (planned future) user is Multipath TCP which is still
      out-of-tree.
      MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
      numbers used by individual subflows.
      
      This DSS mapping is read/written from tcp option space on receive and
      written to tcp option space on transmitted tcp packets that are part of
      and MPTCP connection.
      
      Extending skb_shared_info or adding a private data field to skb fclones
      doesn't work for incoming skb, so a different DSS propagation method would
      be required for the receive side.
      
      mptcp has same requirements as secpath/bridge netfilter:
      
      1. extension memory is released when the sk_buff is free'd.
      2. data is shared after cloning an skb (clone inherits extension)
      3. adding extension to an skb will COW the extension buffer if needed.
      
      The "MPTCP upstreaming" effort adds SKB_EXT_MPTCP extension to store the
      mapping for tx and rx processing.
      
      Two new members are added to sk_buff:
      1. 'active_extensions' byte (filling a hole), telling which extensions
         are available for this skb.
         This has two purposes.
         a) avoids the need to initialize the pointer.
         b) allows to "delete" an extension by clearing its bit
         value in ->active_extensions.
      
         While it would be possible to store the active_extensions byte
         in the extension struct instead of sk_buff, there is one problem
         with this:
          When an extension has to be disabled, we can always clear the
          bit in skb->active_extensions.  But in case it would be stored in the
          extension buffer itself, we might have to COW it first, if
          we are dealing with a cloned skb.  On kmalloc failure we would
          be unable to turn an extension off.
      
      2. extension pointer, located at the end of the sk_buff.
         If the active_extensions byte is 0, the pointer is undefined,
         it is not initialized on skb allocation.
      
      This adds extra code to skb clone and free paths (to deal with
      refcount/free of extension area) but this replaces similar code that
      manages skb->nf_bridge and skb->sp structs in the followup patches of
      the series.
      
      It is possible to add support for extensions that are not preseved on
      clones/copies.
      
      To do this, it would be needed to define a bitmask of all extensions that
      need copy/cow semantics, and change __skb_ext_copy() to check
      ->active_extensions & SKB_EXT_PRESERVE_ON_CLONE, then just set
      ->active_extensions to 0 on the new clone.
      
      This isn't done here because all extensions that get added here
      need the copy/cow semantics.
      
      v2:
      Allocate entire extension space using kmem_cache.
      Upside is that this allows better tracking of used memory,
      downside is that we will allocate more space than strictly needed in
      most cases (its unlikely that all extensions are active/needed at same
      time for same skb).
      The allocated memory (except the small extension header) is not cleared,
      so no additonal overhead aside from memory usage.
      
      Avoid atomic_dec_and_test operation on skb_ext_put()
      by using similar trick as kfree_skbmem() does with fclone_ref:
      If recount is 1, there is no concurrent user and we can free right away.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df5042f4
  15. 19 12月, 2018 1 次提交
    • J
      bpf: sockmap, metadata support for reporting size of msg · 3bdbd022
      John Fastabend 提交于
      This adds metadata to sk_msg_md for BPF programs to read the sk_msg
      size.
      
      When the SK_MSG program is running under an application that is using
      sendfile the data is not copied into sk_msg buffers by default. Rather
      the BPF program uses sk_msg_pull_data to read the bytes in. This
      avoids doing the costly memcopy instructions when they are not in
      fact needed. However, if we don't know the size of the sk_msg we
      have to guess if needed bytes are available by doing a pull request
      which may fail. By including the size of the sk_msg BPF programs can
      check the size before issuing sk_msg_pull_data requests.
      
      Additionally, the same applies for sendmsg calls when the application
      provides multiple iovs. Here the BPF program needs to pull in data
      to update data pointers but its not clear where the data ends without
      a size parameter. In many cases "guessing" is not easy to do
      and results in multiple calls to pull and without bounded loops
      everything gets fairly tricky.
      
      Clean this up by including a u32 size field. Note, all writes into
      sk_msg_md are rejected already from sk_msg_is_valid_access so nothing
      additional is needed there.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      3bdbd022
  16. 17 12月, 2018 2 次提交
  17. 16 12月, 2018 1 次提交