1. 16 1月, 2016 1 次提交
  2. 11 1月, 2016 1 次提交
    • D
      bpf: add skb_postpush_rcsum and fix dev_forward_skb occasions · f8ffad69
      Daniel Borkmann 提交于
      Add a small helper skb_postpush_rcsum() and fix up redirect locations
      that need CHECKSUM_COMPLETE fixups on ingress. dev_forward_skb() expects
      a proper csum that covers also Ethernet header, f.e. since 2c26d34b
      ("net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding"), we
      also do skb_postpull_rcsum() after pulling Ethernet header off via
      eth_type_trans().
      
      When using eBPF in a netns setup f.e. with vxlan in collect metadata mode,
      I can trigger the following csum issue with an IPv6 setup:
      
        [  505.144065] dummy1: hw csum failure
        [...]
        [  505.144108] Call Trace:
        [  505.144112]  <IRQ>  [<ffffffff81372f08>] dump_stack+0x44/0x5c
        [  505.144134]  [<ffffffff81607cea>] netdev_rx_csum_fault+0x3a/0x40
        [  505.144142]  [<ffffffff815fee3f>] __skb_checksum_complete+0xcf/0xe0
        [  505.144149]  [<ffffffff816f0902>] nf_ip6_checksum+0xb2/0x120
        [  505.144161]  [<ffffffffa08c0e0e>] icmpv6_error+0x17e/0x328 [nf_conntrack_ipv6]
        [  505.144170]  [<ffffffffa0898eca>] ? ip6t_do_table+0x2fa/0x645 [ip6_tables]
        [  505.144177]  [<ffffffffa08c0725>] ? ipv6_get_l4proto+0x65/0xd0 [nf_conntrack_ipv6]
        [  505.144189]  [<ffffffffa06c9a12>] nf_conntrack_in+0xc2/0x5a0 [nf_conntrack]
        [  505.144196]  [<ffffffffa08c039c>] ipv6_conntrack_in+0x1c/0x20 [nf_conntrack_ipv6]
        [  505.144204]  [<ffffffff8164385d>] nf_iterate+0x5d/0x70
        [  505.144210]  [<ffffffff816438d6>] nf_hook_slow+0x66/0xc0
        [  505.144218]  [<ffffffff816bd302>] ipv6_rcv+0x3f2/0x4f0
        [  505.144225]  [<ffffffff816bca40>] ? ip6_make_skb+0x1b0/0x1b0
        [  505.144232]  [<ffffffff8160b77b>] __netif_receive_skb_core+0x36b/0x9a0
        [  505.144239]  [<ffffffff8160bdc8>] ? __netif_receive_skb+0x18/0x60
        [  505.144245]  [<ffffffff8160bdc8>] __netif_receive_skb+0x18/0x60
        [  505.144252]  [<ffffffff8160ccff>] process_backlog+0x9f/0x140
        [  505.144259]  [<ffffffff8160c4a5>] net_rx_action+0x145/0x320
        [...]
      
      What happens is that on ingress, we push Ethernet header back in, either
      from cls_bpf or right before skb_do_redirect(), but without updating csum.
      The "hw csum failure" can be fixed by using the new skb_postpush_rcsum()
      helper for the dev_forward_skb() case to correct the csum diff again.
      
      Thanks to Hannes Frederic Sowa for the csum_partial() idea!
      
      Fixes: 3896d655 ("bpf: introduce bpf_clone_redirect() helper")
      Fixes: 27b29f63 ("bpf: add bpf_redirect() helper")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8ffad69
  3. 16 12月, 2015 2 次提交
  4. 15 12月, 2015 1 次提交
  5. 07 12月, 2015 1 次提交
    • R
      core: enable more fine-grained datagram reception control · ea3793ee
      Rainer Weikusat 提交于
      The __skb_recv_datagram routine in core/ datagram.c provides a general
      skb reception factility supposed to be utilized by protocol modules
      providing datagram sockets. It encompasses both the actual recvmsg code
      and a surrounding 'sleep until data is available' loop. This is
      inconvenient if a protocol module has to use additional locking in order
      to maintain some per-socket state the generic datagram socket code is
      unaware of (as the af_unix code does). The patch below moves the recvmsg
      proper code into a new __skb_try_recv_datagram routine which doesn't
      sleep and renames wait_for_more_packets to
      __skb_wait_for_more_packets, both routines being exported interfaces. The
      original __skb_recv_datagram routine is reimplemented on top of these
      two functions such that its user-visible behaviour remains unchanged.
      Signed-off-by: NRainer Weikusat <rweikusat@mobileactivedefense.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea3793ee
  6. 19 11月, 2015 1 次提交
  7. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  8. 21 10月, 2015 1 次提交
  9. 30 9月, 2015 1 次提交
  10. 25 9月, 2015 1 次提交
    • P
      skbuff: Fix skb checksum flag on skb pull · 6ae459bd
      Pravin B Shelar 提交于
      VXLAN device can receive skb with checksum partial. But the checksum
      offset could be in outer header which is pulled on receive. This results
      in negative checksum offset for the skb. Such skb can cause the assert
      failure in skb_checksum_help(). Following patch fixes the bug by setting
      checksum-none while pulling outer header.
      
      Following is the kernel panic msg from old kernel hitting the bug.
      
      ------------[ cut here ]------------
      kernel BUG at net/core/dev.c:1906!
      RIP: 0010:[<ffffffff81518034>] skb_checksum_help+0x144/0x150
      Call Trace:
      <IRQ>
      [<ffffffffa0164c28>] queue_userspace_packet+0x408/0x470 [openvswitch]
      [<ffffffffa016614d>] ovs_dp_upcall+0x5d/0x60 [openvswitch]
      [<ffffffffa0166236>] ovs_dp_process_packet_with_key+0xe6/0x100 [openvswitch]
      [<ffffffffa016629b>] ovs_dp_process_received_packet+0x4b/0x80 [openvswitch]
      [<ffffffffa016c51a>] ovs_vport_receive+0x2a/0x30 [openvswitch]
      [<ffffffffa0171383>] vxlan_rcv+0x53/0x60 [openvswitch]
      [<ffffffffa01734cb>] vxlan_udp_encap_recv+0x8b/0xf0 [openvswitch]
      [<ffffffff8157addc>] udp_queue_rcv_skb+0x2dc/0x3b0
      [<ffffffff8157b56f>] __udp4_lib_rcv+0x1cf/0x6c0
      [<ffffffff8157ba7a>] udp_rcv+0x1a/0x20
      [<ffffffff8154fdbd>] ip_local_deliver_finish+0xdd/0x280
      [<ffffffff81550128>] ip_local_deliver+0x88/0x90
      [<ffffffff8154fa7d>] ip_rcv_finish+0x10d/0x370
      [<ffffffff81550365>] ip_rcv+0x235/0x300
      [<ffffffff8151ba1d>] __netif_receive_skb+0x55d/0x620
      [<ffffffff8151c360>] netif_receive_skb+0x80/0x90
      [<ffffffff81459935>] virtnet_poll+0x555/0x6f0
      [<ffffffff8151cd04>] net_rx_action+0x134/0x290
      [<ffffffff810683d8>] __do_softirq+0xa8/0x210
      [<ffffffff8162fe6c>] call_softirq+0x1c/0x30
      [<ffffffff810161a5>] do_softirq+0x65/0xa0
      [<ffffffff810687be>] irq_exit+0x8e/0xb0
      [<ffffffff81630733>] do_IRQ+0x63/0xe0
      [<ffffffff81625f2e>] common_interrupt+0x6e/0x6e
      Reported-by: NAnupam Chanda <achanda@vmware.com>
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Acked-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ae459bd
  11. 15 9月, 2015 1 次提交
  12. 02 9月, 2015 6 次提交
  13. 22 8月, 2015 1 次提交
    • M
      mm: make page pfmemalloc check more robust · 2f064f34
      Michal Hocko 提交于
      Commit c48a11c7 ("netvm: propagate page->pfmemalloc to skb") added
      checks for page->pfmemalloc to __skb_fill_page_desc():
      
              if (page->pfmemalloc && !page->mapping)
                      skb->pfmemalloc = true;
      
      It assumes page->mapping == NULL implies that page->pfmemalloc can be
      trusted.  However, __delete_from_page_cache() can set set page->mapping
      to NULL and leave page->index value alone.  Due to being in union, a
      non-zero page->index will be interpreted as true page->pfmemalloc.
      
      So the assumption is invalid if the networking code can see such a page.
      And it seems it can.  We have encountered this with a NFS over loopback
      setup when such a page is attached to a new skbuf.  There is no copying
      going on in this case so the page confuses __skb_fill_page_desc which
      interprets the index as pfmemalloc flag and the network stack drops
      packets that have been allocated using the reserves unless they are to
      be queued on sockets handling the swapping which is the case here and
      that leads to hangs when the nfs client waits for a response from the
      server which has been dropped and thus never arrive.
      
      The struct page is already heavily packed so rather than finding another
      hole to put it in, let's do a trick instead.  We can reuse the index
      again but define it to an impossible value (-1UL).  This is the page
      index so it should never see the value that large.  Replace all direct
      users of page->pfmemalloc by page_is_pfmemalloc which will hide this
      nastiness from unspoiled eyes.
      
      The information will get lost if somebody wants to use page->index
      obviously but that was the case before and the original code expected
      that the information should be persisted somewhere else if that is
      really needed (e.g.  what SLAB and SLUB do).
      
      [akpm@linux-foundation.org: fix blooper in slub]
      Fixes: c48a11c7 ("netvm: propagate page->pfmemalloc to skb")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Debugged-by: NVlastimil Babka <vbabka@suse.com>
      Debugged-by: NJiri Bohac <jbohac@suse.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>	[3.6+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f064f34
  14. 11 8月, 2015 1 次提交
  15. 01 8月, 2015 1 次提交
  16. 30 7月, 2015 1 次提交
  17. 22 7月, 2015 1 次提交
    • T
      vxlan: Flow based tunneling · ee122c79
      Thomas Graf 提交于
      Allows putting a VXLAN device into a new flow-based mode in which
      skbs with a ip_tunnel_info dst metadata attached will be encapsulated
      according to the instructions stored in there with the VXLAN device
      defaults taken into consideration.
      
      Similar on the receive side, if the VXLAN_F_COLLECT_METADATA flag is
      set, the packet processing will populate a ip_tunnel_info struct for
      each packet received and attach it to the skb using the new metadata
      dst.  The metadata structure will contain the outer header and tunnel
      header fields which have been stripped off. Layers further up in the
      stack such as routing, tc or netfitler can later match on these fields
      and perform forwarding. It is the responsibility of upper layers to
      ensure that the flag is set if the metadata is needed. The flag limits
      the additional cost of metadata collecting based on demand.
      
      This prepares the VXLAN device to be steered by the routing and other
      subsystems which allows to support encapsulation for a large number
      of tunnel endpoints and tunnel ids through a single net_device which
      improves the scalability.
      
      It also allows for OVS to leverage this mode which in turn allows for
      the removal of the OVS specific VXLAN code.
      
      Because the skb is currently scrubed in vxlan_rcv(), the attachment of
      the new dst metadata is postponed until after scrubing which requires
      the temporary addition of a new member to vxlan_metadata. This member
      is removed again in a later commit after the indirect VXLAN receive API
      has been removed.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee122c79
  18. 21 7月, 2015 2 次提交
  19. 13 6月, 2015 1 次提交
  20. 12 6月, 2015 2 次提交
    • B
      netfilter: bridge: refactor frag_max_size · 411ffb4f
      Bernhard Thaler 提交于
      Currently frag_max_size is member of br_input_skb_cb and copied back and
      forth using IPCB(skb) and BR_INPUT_SKB_CB(skb) each time it is changed or
      used.
      
      Attach frag_max_size to nf_bridge_info and set value in pre_routing and
      forward functions. Use its value in forward and xmit functions.
      Signed-off-by: NBernhard Thaler <bernhard.thaler@wvnet.at>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      411ffb4f
    • B
      netfilter: bridge: detect NAT66 correctly and change MAC address · 72b31f72
      Bernhard Thaler 提交于
      IPv4 iptables allows to REDIRECT/DNAT/SNAT any traffic over a bridge.
      
      e.g. REDIRECT
      $ sysctl -w net.bridge.bridge-nf-call-iptables=1
      $ iptables -t nat -A PREROUTING -p tcp -m tcp --dport 8080 \
        -j REDIRECT --to-ports 81
      
      This does not work with ip6tables on a bridge in NAT66 scenario
      because the REDIRECT/DNAT/SNAT is not correctly detected.
      
      The bridge pre-routing (finish) netfilter hook has to check for a possible
      redirect and then fix the destination mac address. This allows to use the
      ip6tables rules for local REDIRECT/DNAT/SNAT REDIRECT similar to the IPv4
      iptables version.
      
      e.g. REDIRECT
      $ sysctl -w net.bridge.bridge-nf-call-ip6tables=1
      $ ip6tables -t nat -A PREROUTING -p tcp -m tcp --dport 8080 \
        -j REDIRECT --to-ports 81
      
      This patch makes it possible to use IPv6 NAT66 on a bridge. It was tested
      on a bridge with two interfaces using SNAT/DNAT NAT66 rules.
      Reported-by: NArtie Hamilton <artiemhamilton@yahoo.com>
      Signed-off-by: NSven Eckelmann <sven@open-mesh.com>
      [bernhard.thaler@wvnet.at: rebased, add indirect call to ip6_route_input()]
      [bernhard.thaler@wvnet.at: rebased, split into separate patches]
      Signed-off-by: NBernhard Thaler <bernhard.thaler@wvnet.at>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      72b31f72
  21. 05 6月, 2015 1 次提交
  22. 25 5月, 2015 2 次提交
  23. 20 5月, 2015 1 次提交
  24. 18 5月, 2015 1 次提交
    • E
      net: fix two sparse errors · c91d4606
      Eric Dumazet 提交于
      First one in __skb_checksum_validate_complete() fixes the following
      (and other callers)
      
      make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/tcp_ipv4.o
        CHECK   net/ipv4/tcp_ipv4.c
      include/linux/skbuff.h:3052:24: warning: incorrect type in return expression (different base types)
      include/linux/skbuff.h:3052:24:    expected restricted __sum16
      include/linux/skbuff.h:3052:24:    got int
      
      Second is fixing gso_make_checksum() :
      
        CHECK   net/ipv4/gre_offload.c
      include/linux/skbuff.h:3360:14: warning: incorrect type in assignment (different base types)
      include/linux/skbuff.h:3360:14:    expected unsigned short [unsigned] [usertype] csum
      include/linux/skbuff.h:3360:14:    got restricted __sum16
      include/linux/skbuff.h:3365:16: warning: incorrect type in return expression (different base types)
      include/linux/skbuff.h:3365:16:    expected restricted __sum16
      include/linux/skbuff.h:3365:16:    got unsigned short [unsigned] [usertype] csum
      
      Fixes: 5a212329 ("net: Support for csum_bad in skbuff")
      Fixes: 7e2b10c1 ("net: Support for multiple checksums with gso")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      CC: Tom Herbert <tom@herbertland.com>
      Acked-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c91d4606
  25. 14 5月, 2015 6 次提交
  26. 12 5月, 2015 1 次提交