1. 13 11月, 2014 2 次提交
  2. 12 11月, 2014 4 次提交
    • W
      neigh: remove dynamic neigh table registration support · d7480fd3
      WANG Cong 提交于
      Currently there are only three neigh tables in the whole kernel:
      arp table, ndisc table and decnet neigh table. What's more,
      we don't support registering multiple tables per family.
      Therefore we can just make these tables statically built-in.
      
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7480fd3
    • J
      net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited · ba7a46f1
      Joe Perches 提交于
      Use the more common dynamic_debug capable net_dbg_ratelimited
      and remove the LIMIT_NETDEBUG macro.
      
      All messages are still ratelimited.
      
      Some KERN_<LEVEL> uses are changed to KERN_DEBUG.
      
      This may have some negative impact on messages that were
      emitted at KERN_INFO that are not not enabled at all unless
      DEBUG is defined or dynamic_debug is enabled.  Even so,
      these messages are now _not_ emitted by default.
      
      This also eliminates the use of the net_msg_warn sysctl
      "/proc/sys/net/core/warnings".  For backward compatibility,
      the sysctl is not removed, but it has no function.  The extern
      declaration of net_msg_warn is removed from sock.h and made
      static in net/core/sysctl_net_core.c
      
      Miscellanea:
      
      o Update the sysctl documentation
      o Remove the embedded uses of pr_fmt
      o Coalesce format fragments
      o Realign arguments
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba7a46f1
    • E
      net: introduce SO_INCOMING_CPU · 2c8c56e1
      Eric Dumazet 提交于
      Alternative to RPS/RFS is to use hardware support for multiple
      queues.
      
      Then split a set of million of sockets into worker threads, each
      one using epoll() to manage events on its own socket pool.
      
      Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
      know after accept() or connect() on which queue/cpu a socket is managed.
      
      We normally use one cpu per RX queue (IRQ smp_affinity being properly
      set), so remembering on socket structure which cpu delivered last packet
      is enough to solve the problem.
      
      After accept(), connect(), or even file descriptor passing around
      processes, applications can use :
      
       int cpu;
       socklen_t len = sizeof(cpu);
      
       getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
      
      And use this information to put the socket into the right silo
      for optimal performance, as all networking stack should run
      on the appropriate cpu, without need to send IPI (RPS/RFS).
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c8c56e1
    • E
      tcp: move sk_mark_napi_id() at the right place · 3d97379a
      Eric Dumazet 提交于
      sk_mark_napi_id() is used to record for a flow napi id of incoming
      packets for busypoll sake.
      We should do this only on established flows, not on listeners.
      
      This was 'working' by virtue of the socket cloning, but doing
      this on SYN packets in unecessary cache line dirtying.
      
      Even if we move sk_napi_id in the same cache line than sk_lock,
      we are working to make SYN processing lockless, so it is desirable
      to set sk_napi_id only for established flows.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d97379a
  3. 11 11月, 2014 2 次提交
  4. 08 11月, 2014 1 次提交
  5. 07 11月, 2014 2 次提交
  6. 06 11月, 2014 14 次提交
    • P
      net: Remove MPLS GSO feature. · 59b93b41
      Pravin B Shelar 提交于
      Device can export MPLS GSO support in dev->mpls_features same way
      it export vlan features in dev->vlan_features. So it is safe to
      remove NETIF_F_GSO_MPLS redundant flag.
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      59b93b41
    • T
      fou: Fix typo in returning flags in netlink · e1b2cb65
      Tom Herbert 提交于
      When filling netlink info, dport is being returned as flags. Fix
      instances to return correct value.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1b2cb65
    • D
      ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs · 4c672e4b
      Daniel Borkmann 提交于
      It has been reported that generating an MLD listener report on
      devices with large MTUs (e.g. 9000) and a high number of IPv6
      addresses can trigger a skb_over_panic():
      
      skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
      head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
      dev:port1
       ------------[ cut here ]------------
      kernel BUG at net/core/skbuff.c:100!
      invalid opcode: 0000 [#1] SMP
      Modules linked in: ixgbe(O)
      CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
      [...]
      Call Trace:
       <IRQ>
       [<ffffffff80578226>] ? skb_put+0x3a/0x3b
       [<ffffffff80612a5d>] ? add_grhead+0x45/0x8e
       [<ffffffff80612e3a>] ? add_grec+0x394/0x3d4
       [<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
       [<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
       [<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
       [<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
       [<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
       [<ffffffff8025112b>] ? irq_exit+0x4e/0xd3
       [<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
       [<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70
      
      mld_newpack() skb allocations are usually requested with dev->mtu
      in size, since commit 72e09ad1 ("ipv6: avoid high order allocations")
      we have changed the limit in order to be less likely to fail.
      
      However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
      macros, which determine if we may end up doing an skb_put() for
      adding another record. To avoid possible fragmentation, we check
      the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
      assumption as the actual max allocation size can be much smaller.
      
      The IGMP case doesn't have this issue as commit 57e1ab6e
      ("igmp: refine skb allocations") stores the allocation size in
      the cb[].
      
      Set a reserved_tailroom to make it fit into the MTU and use
      skb_availroom() helper instead. This also allows to get rid of
      igmp_skb_size().
      Reported-by: NWei Liu <lw1a2.jing@gmail.com>
      Fixes: 72e09ad1 ("ipv6: avoid high order allocations")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: David L Stevens <david.stevens@oracle.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c672e4b
    • J
      net: Convert SEQ_START_TOKEN/seq_printf to seq_puts · 1744bea1
      Joe Perches 提交于
      Using a single fixed string is smaller code size than using
      a format and many string arguments.
      
      Reduces overall code size a little.
      
      $ size net/ipv4/igmp.o* net/ipv6/mcast.o* net/ipv6/ip6_flowlabel.o*
         text	   data	    bss	    dec	    hex	filename
        34269	   7012	  14824	  56105	   db29	net/ipv4/igmp.o.new
        34315	   7012	  14824	  56151	   db57	net/ipv4/igmp.o.old
        30078	   7869	  13200	  51147	   c7cb	net/ipv6/mcast.o.new
        30105	   7869	  13200	  51174	   c7e6	net/ipv6/mcast.o.old
        11434	   3748	   8580	  23762	   5cd2	net/ipv6/ip6_flowlabel.o.new
        11491	   3748	   8580	  23819	   5d0b	net/ipv6/ip6_flowlabel.o.old
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1744bea1
    • M
      tcp: zero retrans_stamp if all retrans were acked · 1f37bf87
      Marcelo Leitner 提交于
      Ueki Kohei reported that when we are using NewReno with connections that
      have a very low traffic, we may timeout the connection too early if a
      second loss occurs after the first one was successfully acked but no
      data was transfered later. Below is his description of it:
      
      When SACK is disabled, and a socket suffers multiple separate TCP
      retransmissions, that socket's ETIMEDOUT value is calculated from the
      time of the *first* retransmission instead of the *latest*
      retransmission.
      
      This happens because the tcp_sock's retrans_stamp is set once then never
      cleared.
      
      Take the following connection:
      
                            Linux                    remote-machine
                              |                           |
               send#1---->(*1)|--------> data#1 --------->|
                        |     |                           |
                       RTO    :                           :
                        |     |                           |
                       ---(*2)|----> data#1(retrans) ---->|
                        | (*3)|<---------- ACK <----------|
                        |     |                           |
                        |     :                           :
                        |     :                           :
                        |     :                           :
                      16 minutes (or more)                :
                        |     :                           :
                        |     :                           :
                        |     :                           :
                        |     |                           |
               send#2---->(*4)|--------> data#2 --------->|
                        |     |                           |
                       RTO    :                           :
                        |     |                           |
                       ---(*5)|----> data#2(retrans) ---->|
                        |     |                           |
                        |     |                           |
                      RTO*2   :                           :
                        |     |                           |
                        |     |                           |
            ETIMEDOUT<----(*6)|                           |
      
      (*1) One data packet sent.
      (*2) Because no ACK packet is received, the packet is retransmitted.
      (*3) The ACK packet is received. The transmitted packet is acknowledged.
      
      At this point the first "retransmission event" has passed and been
      recovered from. Any future retransmission is a completely new "event".
      
      (*4) After 16 minutes (to correspond with retries2=15), a new data
      packet is sent. Note: No data is transmitted between (*3) and (*4).
      
      The socket's timeout SHOULD be calculated from this point in time, but
      instead it's calculated from the prior "event" 16 minutes ago.
      
      (*5) Because no ACK packet is received, the packet is retransmitted.
      (*6) At the time of the 2nd retransmission, the socket returns
      ETIMEDOUT.
      
      Therefore, now we clear retrans_stamp as soon as all data during the
      loss window is fully acked.
      
      Reported-by: Ueki Kohei
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f37bf87
    • D
      net: Add and use skb_copy_datagram_msg() helper. · 51f3d02b
      David S. Miller 提交于
      This encapsulates all of the skb_copy_datagram_iovec() callers
      with call argument signature "skb, offset, msghdr->msg_iov, length".
      
      When we move to iov_iters in the networking, the iov_iter object will
      sit in the msghdr.
      
      Having a helper like this means there will be less places to touch
      during that transformation.
      
      Based upon descriptions and patch from Al Viro.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51f3d02b
    • T
      gue: Receive side of remote checksum offload · a8d31c12
      Tom Herbert 提交于
      Add processing of the remote checksum offload option in both the normal
      path as well as the GRO path. The implements patching the affected
      checksum to derive the offloaded checksum.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8d31c12
    • T
      gue: TX support for using remote checksum offload option · b17f709a
      Tom Herbert 提交于
      Add if_tunnel flag TUNNEL_ENCAP_FLAG_REMCSUM to configure
      remote checksum offload on an IP tunnel. Add logic in gue_build_header
      to insert remote checksum offload option.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b17f709a
    • T
      udp: Changes to udp_offload to support remote checksum offload · e585f236
      Tom Herbert 提交于
      Add a new GSO type, SKB_GSO_TUNNEL_REMCSUM, which indicates remote
      checksum offload being done (in this case inner checksum must not
      be offloaded to the NIC).
      
      Added logic in __skb_udp_tunnel_segment to handle remote checksum
      offload case.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e585f236
    • T
      gue: Add infrastructure for flags and options · 5024c33a
      Tom Herbert 提交于
      Add functions and basic definitions for processing standard flags,
      private flags, and control messages. This includes definitions
      to compute length of optional fields corresponding to a set of flags.
      Flag validation is in validate_gue_flags function. This checks for
      unknown flags, and that length of optional fields is <= length
      in guehdr hlen.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5024c33a
    • T
      udp: Offload outer UDP tunnel csum if available · 4bcb877d
      Tom Herbert 提交于
      In __skb_udp_tunnel_segment if outer UDP checksums are enabled and
      ip_summed is not already CHECKSUM_PARTIAL, set up checksum offload
      if device features allow it.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bcb877d
    • T
      net: Move fou_build_header into fou.c and refactor · 63487bab
      Tom Herbert 提交于
      Move fou_build_header out of ip_tunnel.c and into fou.c splitting
      it up into fou_build_header, gue_build_header, and fou_build_udp.
      This allows for other users for TX of FOU or GUE. Change ip_tunnel_encap
      to call fou_build_header or gue_build_header based on the tunnel
      encapsulation type. Similarly, added fou_encap_hlen and gue_encap_hlen
      functions which are called by ip_encap_hlen. New net/fou.h has
      prototypes and defines for this.
      
      Added NET_FOU_IP_TUNNELS configuration. When this is set, IP tunnels
      can use FOU/GUE and fou module is also selected.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63487bab
    • J
      geneve: Unregister pernet subsys on module unload. · d3ca9eaf
      Jesse Gross 提交于
      The pernet ops aren't ever unregistered, which causes a memory
      leak and an OOPs if the module is ever reinserted.
      
      Fixes: 0b5e8b8e ("net: Add Geneve tunneling protocol driver")
      CC: Andy Zhou <azhou@nicira.com>
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3ca9eaf
    • J
      geneve: Set GSO type on transmit. · 45cac46e
      Jesse Gross 提交于
      Geneve does not currently set the inner protocol type when
      transmitting packets. This causes GSO segmentation to fail on NICs
      that do not support Geneve offloading.
      
      CC: Andy Zhou <azhou@nicira.com>
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45cac46e
  7. 05 11月, 2014 14 次提交
  8. 31 10月, 2014 1 次提交