1. 06 11月, 2014 9 次提交
    • D
      ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs · 4c672e4b
      Daniel Borkmann 提交于
      It has been reported that generating an MLD listener report on
      devices with large MTUs (e.g. 9000) and a high number of IPv6
      addresses can trigger a skb_over_panic():
      
      skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
      head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
      dev:port1
       ------------[ cut here ]------------
      kernel BUG at net/core/skbuff.c:100!
      invalid opcode: 0000 [#1] SMP
      Modules linked in: ixgbe(O)
      CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
      [...]
      Call Trace:
       <IRQ>
       [<ffffffff80578226>] ? skb_put+0x3a/0x3b
       [<ffffffff80612a5d>] ? add_grhead+0x45/0x8e
       [<ffffffff80612e3a>] ? add_grec+0x394/0x3d4
       [<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
       [<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
       [<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
       [<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
       [<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
       [<ffffffff8025112b>] ? irq_exit+0x4e/0xd3
       [<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
       [<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70
      
      mld_newpack() skb allocations are usually requested with dev->mtu
      in size, since commit 72e09ad1 ("ipv6: avoid high order allocations")
      we have changed the limit in order to be less likely to fail.
      
      However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
      macros, which determine if we may end up doing an skb_put() for
      adding another record. To avoid possible fragmentation, we check
      the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
      assumption as the actual max allocation size can be much smaller.
      
      The IGMP case doesn't have this issue as commit 57e1ab6e
      ("igmp: refine skb allocations") stores the allocation size in
      the cb[].
      
      Set a reserved_tailroom to make it fit into the MTU and use
      skb_availroom() helper instead. This also allows to get rid of
      igmp_skb_size().
      Reported-by: NWei Liu <lw1a2.jing@gmail.com>
      Fixes: 72e09ad1 ("ipv6: avoid high order allocations")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: David L Stevens <david.stevens@oracle.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c672e4b
    • J
      net: Convert SEQ_START_TOKEN/seq_printf to seq_puts · 1744bea1
      Joe Perches 提交于
      Using a single fixed string is smaller code size than using
      a format and many string arguments.
      
      Reduces overall code size a little.
      
      $ size net/ipv4/igmp.o* net/ipv6/mcast.o* net/ipv6/ip6_flowlabel.o*
         text	   data	    bss	    dec	    hex	filename
        34269	   7012	  14824	  56105	   db29	net/ipv4/igmp.o.new
        34315	   7012	  14824	  56151	   db57	net/ipv4/igmp.o.old
        30078	   7869	  13200	  51147	   c7cb	net/ipv6/mcast.o.new
        30105	   7869	  13200	  51174	   c7e6	net/ipv6/mcast.o.old
        11434	   3748	   8580	  23762	   5cd2	net/ipv6/ip6_flowlabel.o.new
        11491	   3748	   8580	  23819	   5d0b	net/ipv6/ip6_flowlabel.o.old
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1744bea1
    • D
      net: Add and use skb_copy_datagram_msg() helper. · 51f3d02b
      David S. Miller 提交于
      This encapsulates all of the skb_copy_datagram_iovec() callers
      with call argument signature "skb, offset, msghdr->msg_iov, length".
      
      When we move to iov_iters in the networking, the iov_iter object will
      sit in the msghdr.
      
      Having a helper like this means there will be less places to touch
      during that transformation.
      
      Based upon descriptions and patch from Al Viro.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51f3d02b
    • T
      gue: Receive side of remote checksum offload · a8d31c12
      Tom Herbert 提交于
      Add processing of the remote checksum offload option in both the normal
      path as well as the GRO path. The implements patching the affected
      checksum to derive the offloaded checksum.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8d31c12
    • T
      gue: TX support for using remote checksum offload option · b17f709a
      Tom Herbert 提交于
      Add if_tunnel flag TUNNEL_ENCAP_FLAG_REMCSUM to configure
      remote checksum offload on an IP tunnel. Add logic in gue_build_header
      to insert remote checksum offload option.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b17f709a
    • T
      udp: Changes to udp_offload to support remote checksum offload · e585f236
      Tom Herbert 提交于
      Add a new GSO type, SKB_GSO_TUNNEL_REMCSUM, which indicates remote
      checksum offload being done (in this case inner checksum must not
      be offloaded to the NIC).
      
      Added logic in __skb_udp_tunnel_segment to handle remote checksum
      offload case.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e585f236
    • T
      gue: Add infrastructure for flags and options · 5024c33a
      Tom Herbert 提交于
      Add functions and basic definitions for processing standard flags,
      private flags, and control messages. This includes definitions
      to compute length of optional fields corresponding to a set of flags.
      Flag validation is in validate_gue_flags function. This checks for
      unknown flags, and that length of optional fields is <= length
      in guehdr hlen.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5024c33a
    • T
      udp: Offload outer UDP tunnel csum if available · 4bcb877d
      Tom Herbert 提交于
      In __skb_udp_tunnel_segment if outer UDP checksums are enabled and
      ip_summed is not already CHECKSUM_PARTIAL, set up checksum offload
      if device features allow it.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bcb877d
    • T
      net: Move fou_build_header into fou.c and refactor · 63487bab
      Tom Herbert 提交于
      Move fou_build_header out of ip_tunnel.c and into fou.c splitting
      it up into fou_build_header, gue_build_header, and fou_build_udp.
      This allows for other users for TX of FOU or GUE. Change ip_tunnel_encap
      to call fou_build_header or gue_build_header based on the tunnel
      encapsulation type. Similarly, added fou_encap_hlen and gue_encap_hlen
      functions which are called by ip_encap_hlen. New net/fou.h has
      prototypes and defines for this.
      
      Added NET_FOU_IP_TUNNELS configuration. When this is set, IP tunnels
      can use FOU/GUE and fou module is also selected.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63487bab
  2. 05 11月, 2014 14 次提交
  3. 31 10月, 2014 7 次提交
  4. 30 10月, 2014 3 次提交
    • N
      inet: frags: remove the WARN_ON from inet_evict_bucket · d70127e8
      Nikolay Aleksandrov 提交于
      The WARN_ON in inet_evict_bucket can be triggered by a valid case:
      inet_frag_kill and inet_evict_bucket can be running in parallel on the
      same queue which means that there has been at least one more ref added
      by a previous inet_frag_find call, but inet_frag_kill can delete the
      timer before inet_evict_bucket which will cause the WARN_ON() there to
      trigger since we'll have refcnt!=1. Now, this case is valid because the
      queue is being "killed" for some reason (removed from the chain list and
      its timer deleted) so it will get destroyed in the end by one of the
      inet_frag_put() calls which reaches 0 i.e. refcnt is still valid.
      
      CC: Florian Westphal <fw@strlen.de>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: Patrick McLean <chutzpah@gentoo.org>
      
      Fixes: b13d3cbf ("inet: frag: move eviction of queues to work queue")
      Reported-by: NPatrick McLean <chutzpah@gentoo.org>
      Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d70127e8
    • N
      inet: frags: fix a race between inet_evict_bucket and inet_frag_kill · 65ba1f1e
      Nikolay Aleksandrov 提交于
      When the evictor is running it adds some chosen frags to a local list to
      be evicted once the chain lock has been released but at the same time
      the *frag_queue can be running for some of the same queues and it
      may call inet_frag_kill which will wait on the chain lock and
      will then delete the queue from the wrong list since it was added in the
      eviction one. The fix is simple - check if the queue has the evict flag
      set under the chain lock before deleting it, this is safe because the
      evict flag is set only under that lock and having the flag set also means
      that the queue has been detached from the chain list, so no need to delete
      it again.
      An important note to make is that we're safe w.r.t refcnt because
      inet_frag_kill and inet_evict_bucket will sync on the del_timer operation
      where only one of the two can succeed (or if the timer is executing -
      none of them), the cases are:
      1. inet_frag_kill succeeds in del_timer
       - then the timer ref is removed, but inet_evict_bucket will not add
         this queue to its expire list but will restart eviction in that chain
      2. inet_evict_bucket succeeds in del_timer
       - then the timer ref is kept until the evictor "expires" the queue, but
         inet_frag_kill will remove the initial ref and will set
         INET_FRAG_COMPLETE which will make the frag_expire fn just to remove
         its ref.
      In the end all of the queue users will do an inet_frag_put and the one
      that reaches 0 will free it. The refcount balance should be okay.
      
      CC: Florian Westphal <fw@strlen.de>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: Patrick McLean <chutzpah@gentoo.org>
      
      Fixes: b13d3cbf ("inet: frag: move eviction of queues to work queue")
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Reported-by: NPatrick McLean <chutzpah@gentoo.org>
      Tested-by: NPatrick McLean <chutzpah@gentoo.org>
      Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65ba1f1e
    • E
      tcp: allow for bigger reordering level · dca145ff
      Eric Dumazet 提交于
      While testing upcoming Yaogong patch (converting out of order queue
      into an RB tree), I hit the max reordering level of linux TCP stack.
      
      Reordering level was limited to 127 for no good reason, and some
      network setups [1] can easily reach this limit and get limited
      throughput.
      
      Allow a new max limit of 300, and add a sysctl to allow admins to even
      allow bigger (or lower) values if needed.
      
      [1] Aggregation of links, per packet load balancing, fabrics not doing
       deep packet inspections, alternative TCP congestion modules...
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yaogong Wang <wygivan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dca145ff
  5. 28 10月, 2014 1 次提交
  6. 26 10月, 2014 1 次提交
    • E
      tcp: md5: do not use alloc_percpu() · 349ce993
      Eric Dumazet 提交于
      percpu tcp_md5sig_pool contains memory blobs that ultimately
      go through sg_set_buf().
      
      -> sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf));
      
      This requires that whole area is in a physically contiguous portion
      of memory. And that @buf is not backed by vmalloc().
      
      Given that alloc_percpu() can use vmalloc() areas, this does not
      fit the requirements.
      
      Replace alloc_percpu() by a static DEFINE_PER_CPU() as tcp_md5sig_pool
      is small anyway, there is no gain to dynamically allocate it.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: 765cf997 ("tcp: md5: remove one indirection level in tcp_md5sig_pool")
      Reported-by: NCrestez Dan Leonard <cdleonard@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      349ce993
  7. 23 10月, 2014 1 次提交
    • S
      net: fix saving TX flow hash in sock for outgoing connections · 9e7ceb06
      Sathya Perla 提交于
      The commit "net: Save TX flow hash in sock and set in skbuf on xmit"
      introduced the inet_set_txhash() and ip6_set_txhash() routines to calculate
      and record flow hash(sk_txhash) in the socket structure. sk_txhash is used
      to set skb->hash which is used to spread flows across multiple TXQs.
      
      But, the above routines are invoked before the source port of the connection
      is created. Because of this all outgoing connections that just differ in the
      source port get hashed into the same TXQ.
      
      This patch fixes this problem for IPv4/6 by invoking the the above routines
      after the source port is available for the socket.
      
      Fixes: b73c3d0e("net: Save TX flow hash in sock and set in skbuf on xmit")
      Signed-off-by: NSathya Perla <sathya.perla@emulex.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e7ceb06
  8. 21 10月, 2014 2 次提交
    • F
      net: make skb_gso_segment error handling more robust · 330966e5
      Florian Westphal 提交于
      skb_gso_segment has three possible return values:
      1. a pointer to the first segmented skb
      2. an errno value (IS_ERR())
      3. NULL.  This can happen when GSO is used for header verification.
      
      However, several callers currently test IS_ERR instead of IS_ERR_OR_NULL
      and would oops when NULL is returned.
      
      Note that these call sites should never actually see such a NULL return
      value; all callers mask out the GSO bits in the feature argument.
      
      However, there have been issues with some protocol handlers erronously not
      respecting the specified feature mask in some cases.
      
      It is preferable to get 'have to turn off hw offloading, else slow' reports
      rather than 'kernel crashes'.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      330966e5
    • F
      net: gso: use feature flag argument in all protocol gso handlers · 1e16aa3d
      Florian Westphal 提交于
      skb_gso_segment() has a 'features' argument representing offload features
      available to the output path.
      
      A few handlers, e.g. GRE, instead re-fetch the features of skb->dev and use
      those instead of the provided ones when handing encapsulation/tunnels.
      
      Depending on dev->hw_enc_features of the output device skb_gso_segment() can
      then return NULL even when the caller has disabled all GSO feature bits,
      as segmentation of inner header thinks device will take care of segmentation.
      
      This e.g. affects the tbf scheduler, which will silently drop GRE-encap GSO skbs
      that did not fit the remaining token quota as the segmentation does not work
      when device supports corresponding hw offload capabilities.
      
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e16aa3d
  9. 19 10月, 2014 1 次提交
  10. 18 10月, 2014 1 次提交