1. 14 7月, 2018 1 次提交
  2. 13 7月, 2018 1 次提交
    • S
      net: Don't copy pfmemalloc flag in __copy_skb_header() · 8b700862
      Stefano Brivio 提交于
      The pfmemalloc flag indicates that the skb was allocated from
      the PFMEMALLOC reserves, and the flag is currently copied on skb
      copy and clone.
      
      However, an skb copied from an skb flagged with pfmemalloc
      wasn't necessarily allocated from PFMEMALLOC reserves, and on
      the other hand an skb allocated that way might be copied from an
      skb that wasn't.
      
      So we should not copy the flag on skb copy, and rather decide
      whether to allow an skb to be associated with sockets unrelated
      to page reclaim depending only on how it was allocated.
      
      Move the pfmemalloc flag before headers_start[0] using an
      existing 1-bit hole, so that __copy_skb_header() doesn't copy
      it.
      
      When cloning, we'll now take care of this flag explicitly,
      contravening to the warning comment of __skb_clone().
      
      While at it, restore the newline usage introduced by commit
      b1937227 ("net: reorganize sk_buff for faster
      __copy_skb_header()") to visually separate bytes used in
      bitfields after headers_start[0], that was gone after commit
      a9e419dc ("netfilter: merge ctinfo into nfct pointer storage
      area"), and describe the pfmemalloc flag in the kernel-doc
      structure comment.
      
      This doesn't change the size of sk_buff or cacheline boundaries,
      but consolidates the 15 bits hole before tc_index into a 2 bytes
      hole before csum, that could now be filled more easily.
      Reported-by: NPatrick Talbert <ptalbert@redhat.com>
      Fixes: c93bdd0e ("netvm: allow skb allocation to use PFMEMALLOC reserves")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b700862
  3. 30 6月, 2018 1 次提交
    • M
      net: cleanup gfp mask in alloc_skb_with_frags · d14b56f5
      Michal Hocko 提交于
      alloc_skb_with_frags uses __GFP_NORETRY for non-sleeping allocations
      which is just a noop and a little bit confusing.
      
      __GFP_NORETRY was added by ed98df33 ("net: use __GFP_NORETRY for
      high order allocations") to prevent from the OOM killer. Yet this was
      not enough because fb05e7a8 ("net: don't wait for order-3 page
      allocation") didn't want an excessive reclaim for non-costly orders
      so it made it completely NOWAIT while it preserved __GFP_NORETRY in
      place which is now redundant.
      
      Drop the pointless __GFP_NORETRY because this function is used as
      copy&paste source for other places.
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d14b56f5
  4. 01 5月, 2018 1 次提交
  5. 27 4月, 2018 1 次提交
    • W
      udp: add udp gso · ee80d1eb
      Willem de Bruijn 提交于
      Implement generic segmentation offload support for udp datagrams. A
      follow-up patch adds support to the protocol stack to generate such
      packets.
      
      UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits
      a large payload into a number of discrete UDP datagrams.
      
      The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it
      from UFO (SKB_UDP_GSO).
      
      IPPROTO_UDPLITE is excluded, as that protocol has no gso handler
      registered.
      
      [ Export __udp_gso_segment for ipv6. -DaveM ]
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee80d1eb
  6. 20 4月, 2018 1 次提交
    • E
      net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends · 88078d98
      Eric Dumazet 提交于
      After working on IP defragmentation lately, I found that some large
      packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
      zero paddings on the last (small) fragment.
      
      While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
      to CHECKSUM_NONE, forcing a full csum validation, even if all prior
      fragments had CHECKSUM_COMPLETE set.
      
      We can instead compute the checksum of the part we are trimming,
      usually smaller than the part we keep.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88078d98
  7. 08 4月, 2018 1 次提交
  8. 31 3月, 2018 1 次提交
    • T
      net: Fix untag for vlan packets without ethernet header · ae474573
      Toshiaki Makita 提交于
      In some situation vlan packets do not have ethernet headers. One example
      is packets from tun devices. Users can specify vlan protocol in tun_pi
      field instead of IP protocol, and skb_vlan_untag() attempts to untag such
      packets.
      
      skb_vlan_untag() (more precisely, skb_reorder_vlan_header() called by it)
      however did not expect packets without ethernet headers, so in such a case
      size argument for memmove() underflowed and triggered crash.
      
      ====
      BUG: unable to handle kernel paging request at ffff8801cccb8000
      IP: __memmove+0x24/0x1a0 arch/x86/lib/memmove_64.S:43
      PGD 9cee067 P4D 9cee067 PUD 1d9401063 PMD 1cccb7063 PTE 2810100028101
      Oops: 000b [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 1 PID: 17663 Comm: syz-executor2 Not tainted 4.16.0-rc7+ #368
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:__memmove+0x24/0x1a0 arch/x86/lib/memmove_64.S:43
      RSP: 0018:ffff8801cc046e28 EFLAGS: 00010287
      RAX: ffff8801ccc244c4 RBX: fffffffffffffffe RCX: fffffffffff6c4c2
      RDX: fffffffffffffffe RSI: ffff8801cccb7ffc RDI: ffff8801cccb8000
      RBP: ffff8801cc046e48 R08: ffff8801ccc244be R09: ffffed0039984899
      R10: 0000000000000001 R11: ffffed0039984898 R12: ffff8801ccc244c4
      R13: ffff8801ccc244c0 R14: ffff8801d96b7c06 R15: ffff8801d96b7b40
      FS:  00007febd562d700(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffff8801cccb8000 CR3: 00000001ccb2f006 CR4: 00000000001606e0
      DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
      Call Trace:
       memmove include/linux/string.h:360 [inline]
       skb_reorder_vlan_header net/core/skbuff.c:5031 [inline]
       skb_vlan_untag+0x470/0xc40 net/core/skbuff.c:5061
       __netif_receive_skb_core+0x119c/0x3460 net/core/dev.c:4460
       __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4627
       netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4701
       netif_receive_skb+0xae/0x390 net/core/dev.c:4725
       tun_rx_batched.isra.50+0x5ee/0x870 drivers/net/tun.c:1555
       tun_get_user+0x299e/0x3c20 drivers/net/tun.c:1962
       tun_chr_write_iter+0xb9/0x160 drivers/net/tun.c:1990
       call_write_iter include/linux/fs.h:1782 [inline]
       new_sync_write fs/read_write.c:469 [inline]
       __vfs_write+0x684/0x970 fs/read_write.c:482
       vfs_write+0x189/0x510 fs/read_write.c:544
       SYSC_write fs/read_write.c:589 [inline]
       SyS_write+0xef/0x220 fs/read_write.c:581
       do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x42/0xb7
      RIP: 0033:0x454879
      RSP: 002b:00007febd562cc68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 00007febd562d6d4 RCX: 0000000000454879
      RDX: 0000000000000157 RSI: 0000000020000180 RDI: 0000000000000014
      RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 00000000000006b0 R14: 00000000006fc120 R15: 0000000000000000
      Code: 90 90 90 90 90 90 90 48 89 f8 48 83 fa 20 0f 82 03 01 00 00 48 39 fe 7d 0f 49 89 f0 49 01 d0 49 39 f8 0f 8f 9f 00 00 00 48 89 d1 <f3> a4 c3 48 81 fa a8 02 00 00 72 05 40 38 fe 74 3b 48 83 ea 20
      RIP: __memmove+0x24/0x1a0 arch/x86/lib/memmove_64.S:43 RSP: ffff8801cc046e28
      CR2: ffff8801cccb8000
      ====
      
      We don't need to copy headers for packets which do not have preceding
      headers of vlan headers, so skip memmove() in that case.
      
      Fixes: 4bbb3e0e ("net: Fix vlan untag for bridge and vlan_dev with reorder_hdr off")
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae474573
  9. 26 3月, 2018 1 次提交
    • Y
      net: permit skb_segment on head_frag frag_list skb · 13acc94e
      Yonghong Song 提交于
      One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
      function skb_segment(), line 3667. The bpf program attaches to
      clsact ingress, calls bpf_skb_change_proto to change protocol
      from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
      to send the changed packet out.
      
      3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
      3473                             netdev_features_t features)
      3474 {
      3475         struct sk_buff *segs = NULL;
      3476         struct sk_buff *tail = NULL;
      ...
      3665                 while (pos < offset + len) {
      3666                         if (i >= nfrags) {
      3667                                 BUG_ON(skb_headlen(list_skb));
      3668
      3669                                 i = 0;
      3670                                 nfrags = skb_shinfo(list_skb)->nr_frags;
      3671                                 frag = skb_shinfo(list_skb)->frags;
      3672                                 frag_skb = list_skb;
      ...
      
      call stack:
      ...
       #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
       #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
       #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
       #4 [ffff883ffef03668] die at ffffffff8101deb2
       #5 [ffff883ffef03698] do_trap at ffffffff8101a700
       #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
       #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
       #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
          [exception RIP: skb_segment+3044]
          RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
          RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
          RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
          RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
          R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
          R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13acc94e
  10. 17 3月, 2018 1 次提交
  11. 16 3月, 2018 1 次提交
    • T
      net: Fix vlan untag for bridge and vlan_dev with reorder_hdr off · 4bbb3e0e
      Toshiaki Makita 提交于
      When we have a bridge with vlan_filtering on and a vlan device on top of
      it, packets would be corrupted in skb_vlan_untag() called from
      br_dev_xmit().
      
      The problem sits in skb_reorder_vlan_header() used in skb_vlan_untag(),
      which makes use of skb->mac_len. In this function mac_len is meant for
      handling rx path with vlan devices with reorder_header disabled, but in
      tx path mac_len is typically 0 and cannot be used, which is the problem
      in this case.
      
      The current code even does not properly handle rx path (skb_vlan_untag()
      called from __netif_receive_skb_core()) with reorder_header off actually.
      
      In rx path single tag case, it works as follows:
      
      - Before skb_reorder_vlan_header()
      
       mac_header                                data
         v                                        v
         +-------------------+-------------+------+----
         |        ETH        |    VLAN     | ETH  |
         |       ADDRS       | TPID | TCI  | TYPE |
         +-------------------+-------------+------+----
         <-------- mac_len --------->
                             <------------->
                              to be removed
      
      - After skb_reorder_vlan_header()
      
                  mac_header                     data
                       v                          v
                       +-------------------+------+----
                       |        ETH        | ETH  |
                       |       ADDRS       | TYPE |
                       +-------------------+------+----
                       <-------- mac_len --------->
      
      This is ok, but in rx double tag case, it corrupts packets:
      
      - Before skb_reorder_vlan_header()
      
       mac_header                                              data
         v                                                      v
         +-------------------+-------------+-------------+------+----
         |        ETH        |    VLAN     |    VLAN     | ETH  |
         |       ADDRS       | TPID | TCI  | TPID | TCI  | TYPE |
         +-------------------+-------------+-------------+------+----
         <--------------- mac_len ---------------->
                                           <------------->
                                          should be removed
                             <--------------------------->
                               actually will be removed
      
      - After skb_reorder_vlan_header()
      
                  mac_header                                   data
                       v                                        v
                                     +-------------------+------+----
                                     |        ETH        | ETH  |
                                     |       ADDRS       | TYPE |
                                     +-------------------+------+----
                       <--------------- mac_len ---------------->
      
      So, two of vlan tags are both removed while only inner one should be
      removed and mac_header (and mac_len) is broken.
      
      skb_vlan_untag() is meant for removing the vlan header at (skb->data - 2),
      so use skb->data and skb->mac_header to calculate the right offset.
      Reported-by: NBrandon Carpenter <brandon.carpenter@cypherpath.com>
      Fixes: a6e18ff1 ("vlan: Fix untag operations of stacked vlans with REORDER_HEADER off")
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bbb3e0e
  12. 10 3月, 2018 1 次提交
  13. 05 3月, 2018 2 次提交
  14. 27 2月, 2018 1 次提交
  15. 17 2月, 2018 1 次提交
  16. 09 2月, 2018 1 次提交
    • K
      net: Whitelist the skbuff_head_cache "cb" field · 79a8a642
      Kees Cook 提交于
      Most callers of put_cmsg() use a "sizeof(foo)" for the length argument.
      Within put_cmsg(), a copy_to_user() call is made with a dynamic size, as a
      result of the cmsg header calculations. This means that hardened usercopy
      will examine the copy, even though it was technically a fixed size and
      should be implicitly whitelisted. All the put_cmsg() calls being built
      from values in skbuff_head_cache are coming out of the protocol-defined
      "cb" field, so whitelist this field entirely instead of creating per-use
      bounce buffers, for which there are concerns about performance.
      
      Original report was:
      
      Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLAB object 'skbuff_head_cache' (offset 64, size 16)!
      WARNING: CPU: 0 PID: 3663 at mm/usercopy.c:81 usercopy_warn+0xdb/0x100 mm/usercopy.c:76
      ...
       __check_heap_object+0x89/0xc0 mm/slab.c:4426
       check_heap_object mm/usercopy.c:236 [inline]
       __check_object_size+0x272/0x530 mm/usercopy.c:259
       check_object_size include/linux/thread_info.h:112 [inline]
       check_copy_size include/linux/thread_info.h:143 [inline]
       copy_to_user include/linux/uaccess.h:154 [inline]
       put_cmsg+0x233/0x3f0 net/core/scm.c:242
       sock_recv_errqueue+0x200/0x3e0 net/core/sock.c:2913
       packet_recvmsg+0xb2e/0x17a0 net/packet/af_packet.c:3296
       sock_recvmsg_nosec net/socket.c:803 [inline]
       sock_recvmsg+0xc9/0x110 net/socket.c:810
       ___sys_recvmsg+0x2a4/0x640 net/socket.c:2179
       __sys_recvmmsg+0x2a9/0xaf0 net/socket.c:2287
       SYSC_recvmmsg net/socket.c:2368 [inline]
       SyS_recvmmsg+0xc4/0x160 net/socket.c:2352
       entry_SYSCALL_64_fastpath+0x29/0xa0
      
      Reported-by: syzbot+e2d6cfb305e9f3911dea@syzkaller.appspotmail.com
      Fixes: 6d07d1cd ("usercopy: Restrict non-usercopy caches to size 0")
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79a8a642
  17. 01 2月, 2018 1 次提交
  18. 29 12月, 2017 1 次提交
    • W
      skbuff: in skb_copy_ubufs unclone before releasing zerocopy · f72c4ac6
      Willem de Bruijn 提交于
      skb_copy_ubufs must unclone before it is safe to modify its
      skb_shared_info with skb_zcopy_clear.
      
      Commit b90ddd56 ("skbuff: skb_copy_ubufs must release uarg even
      without user frags") ensures that all skbs release their zerocopy
      state, even those without frags.
      
      But I forgot an edge case where such an skb arrives that is cloned.
      
      The stack does not build such packets. Vhost/tun skbs have their
      frags orphaned before cloning. TCP skbs only attach zerocopy state
      when a frag is added.
      
      But if TCP packets can be trimmed or linearized, this might occur.
      Tracing the code I found no instance so far (e.g., skb_linearize
      ends up calling skb_zcopy_clear if !skb->data_len).
      
      Still, it is non-obvious that no path exists. And it is fragile to
      rely on this.
      
      Fixes: b90ddd56 ("skbuff: skb_copy_ubufs must release uarg even without user frags")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f72c4ac6
  19. 28 12月, 2017 1 次提交
    • W
      skbuff: in skb_segment, call zerocopy functions once per nskb · bf5c25d6
      Willem de Bruijn 提交于
      This is a net-next follow-up to commit 268b7906 ("skbuff: orphan
      frags before zerocopy clone"), which fixed a bug in net, but added a
      call to skb_zerocopy_clone at each frag to do so.
      
      When segmenting skbs with user frags, either the user frags must be
      replaced with private copies and uarg released, or the uarg must have
      its refcount increased for each new skb.
      
      skb_orphan_frags does the first, except for cases that can handle
      reference counting. skb_zerocopy_clone then does the second.
      
      Call these once per nskb, instead of once per frag.
      
      That is, in the common case. With a frag list, also refresh when the
      origin skb (frag_skb) changes.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf5c25d6
  20. 22 12月, 2017 2 次提交
  21. 16 12月, 2017 1 次提交
  22. 16 11月, 2017 2 次提交
    • M
      mm: remove __GFP_COLD · 453f85d4
      Mel Gorman 提交于
      As the page free path makes no distinction between cache hot and cold
      pages, there is no real useful ordering of pages in the free list that
      allocation requests can take advantage of.  Juding from the users of
      __GFP_COLD, it is likely that a number of them are the result of copying
      other sites instead of actually measuring the impact.  Remove the
      __GFP_COLD parameter which simplifies a number of paths in the page
      allocator.
      
      This is potentially controversial but bear in mind that the size of the
      per-cpu pagelists versus modern cache sizes means that the whole per-cpu
      list can often fit in the L3 cache.  Hence, there is only a potential
      benefit for microbenchmarks that alloc/free pages in a tight loop.  It's
      even worse when THP is taken into account which has little or no chance
      of getting a cache-hot page as the per-cpu list is bypassed and the
      zeroing of multiple pages will thrash the cache anyway.
      
      The truncate microbenchmarks are not shown as this patch affects the
      allocation path and not the free path.  A page fault microbenchmark was
      tested but it showed no sigificant difference which is not surprising
      given that the __GFP_COLD branches are a miniscule percentage of the
      fault path.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      453f85d4
    • L
      kmemcheck: remove annotations · 49502766
      Levin, Alexander (Sasha Levin) 提交于
      Patch series "kmemcheck: kill kmemcheck", v2.
      
      As discussed at LSF/MM, kill kmemcheck.
      
      KASan is a replacement that is able to work without the limitation of
      kmemcheck (single CPU, slow).  KASan is already upstream.
      
      We are also not aware of any users of kmemcheck (or users who don't
      consider KASan as a suitable replacement).
      
      The only objection was that since KASAN wasn't supported by all GCC
      versions provided by distros at that time we should hold off for 2
      years, and try again.
      
      Now that 2 years have passed, and all distros provide gcc that supports
      KASAN, kill kmemcheck again for the very same reasons.
      
      This patch (of 4):
      
      Remove kmemcheck annotations, and calls to kmemcheck from the kernel.
      
      [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
        Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
      Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.comSigned-off-by: NSasha Levin <alexander.levin@verizon.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Hansen <devtimhansen@gmail.com>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49502766
  23. 04 11月, 2017 1 次提交
  24. 22 10月, 2017 1 次提交
    • W
      sock: correct sk_wmem_queued accounting on efault in tcp zerocopy · 54d43117
      Willem de Bruijn 提交于
      Syzkaller hits WARN_ON(sk->sk_wmem_queued) in sk_stream_kill_queues
      after triggering an EFAULT in __zerocopy_sg_from_iter.
      
      On this error, skb_zerocopy_stream_iter resets the skb to its state
      before the operation with __pskb_trim. It cannot kfree_skb like
      datagram callers, as the skb may have data from a previous send call.
      
      __pskb_trim calls skb_condense for unowned skbs, which adjusts their
      truesize. These tcp skbuffs are owned and their truesize must add up
      to sk_wmem_queued. But they match because their skb->sk is NULL until
      tcp_transmit_skb.
      
      Temporarily set skb->sk when calling __pskb_trim to signal that the
      skbuffs are owned and avoid the skb_condense path.
      
      Fixes: 52267790 ("sock: add MSG_ZEROCOPY")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54d43117
  25. 15 10月, 2017 1 次提交
  26. 11 10月, 2017 1 次提交
  27. 05 10月, 2017 1 次提交
  28. 27 9月, 2017 1 次提交
    • D
      bpf: add meta pointer for direct access · de8f3a83
      Daniel Borkmann 提交于
      This work enables generic transfer of metadata from XDP into skb. The
      basic idea is that we can make use of the fact that the resulting skb
      must be linear and already comes with a larger headroom for supporting
      bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
      on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
      for adjusting a new pointer called xdp->data_meta. Thus, the packet has
      a flexible and programmable room for meta data, followed by the actual
      packet data. struct xdp_buff is therefore laid out that we first point
      to data_hard_start, then data_meta directly prepended to data followed
      by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
      account whether we have meta data already prepended and if so, memmove()s
      this along with the given offset provided there's enough room.
      
      xdp->data_meta is optional and programs are not required to use it. The
      rationale is that when we process the packet in XDP (e.g. as DoS filter),
      we can push further meta data along with it for the XDP_PASS case, and
      give the guarantee that a clsact ingress BPF program on the same device
      can pick this up for further post-processing. Since we work with skb
      there, we can also set skb->mark, skb->priority or other skb meta data
      out of BPF, thus having this scratch space generic and programmable
      allows for more flexibility than defining a direct 1:1 transfer of
      potentially new XDP members into skb (it's also more efficient as we
      don't need to initialize/handle each of such new members). The facility
      also works together with GRO aggregation. The scratch space at the head
      of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
      yet supporting xdp->data_meta can simply be set up with xdp->data_meta
      as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
      such that the subsequent match against xdp->data for later access is
      guaranteed to fail.
      
      The verifier treats xdp->data_meta/xdp->data the same way as we treat
      xdp->data/xdp->data_end pointer comparisons. The requirement for doing
      the compare against xdp->data is that it hasn't been modified from it's
      original address we got from ctx access. It may have a range marking
      already from prior successful xdp->data/xdp->data_end pointer comparisons
      though.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8f3a83
  29. 26 9月, 2017 1 次提交
    • E
      net: speed up skb_rbtree_purge() · 7c90584c
      Eric Dumazet 提交于
      As measured in my prior patch ("sch_netem: faster rb tree removal"),
      rbtree_postorder_for_each_entry_safe() is nice looking but much slower
      than using rb_next() directly, except when tree is small enough
      to fit in CPU caches (then the cost is the same)
      
      Also note that there is not even an increase of text size :
      $ size net/core/skbuff.o.before net/core/skbuff.o
         text	   data	    bss	    dec	    hex	filename
        40711	   1298	      0	  42009	   a419	net/core/skbuff.o.before
        40711	   1298	      0	  42009	   a419	net/core/skbuff.o
      
      From: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c90584c
  30. 08 9月, 2017 1 次提交
  31. 02 9月, 2017 2 次提交
  32. 24 8月, 2017 1 次提交
  33. 17 8月, 2017 1 次提交
  34. 10 8月, 2017 1 次提交
  35. 04 8月, 2017 2 次提交
    • W
      sock: ulimit on MSG_ZEROCOPY pages · a91dbff5
      Willem de Bruijn 提交于
      Bound the number of pages that a user may pin.
      
      Follow the lead of perf tools to maintain a per-user bound on memory
      locked pages commit 789f90fc ("perf_counter: per user mlock gift")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a91dbff5
    • W
      sock: MSG_ZEROCOPY notification coalescing · 4ab6c99d
      Willem de Bruijn 提交于
      In the simple case, each sendmsg() call generates data and eventually
      a zerocopy ready notification N, where N indicates the Nth successful
      invocation of sendmsg() with the MSG_ZEROCOPY flag on this socket.
      
      TCP and corked sockets can cause send() calls to append new data to an
      existing sk_buff and, thus, ubuf_info. In that case the notification
      must hold a range. odify ubuf_info to store a inclusive range [N..N+m]
      and add skb_zerocopy_realloc() to optionally extend an existing range.
      
      Also coalesce notifications in this common case: if a notification
      [1, 1] is about to be queued while [0, 0] is the queue tail, just modify
      the head of the queue to read [0, 1].
      
      Coalescing is limited to a few TSO frames worth of data to bound
      notification latency.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ab6c99d