1. 04 11月, 2012 4 次提交
    • A
      ipv6: introduce ip6_rt_put() · 94e187c0
      Amerigo Wang 提交于
      As suggested by Eric, we could introduce a helper function
      for ipv6 too, to avoid checking if rt is NULL before
      dst_release().
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94e187c0
    • E
      ipv4: avoid a test in ip_rt_put() · 6da025fa
      Eric Dumazet 提交于
      We can save a test in ip_rt_put(), considering dst_release() accepts
      a NULL parameter, and dst is first element in rtable.
      
      Add a BUILD_BUG_ON() to catch any change that could break this
      assertion.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Cong Wang <amwang@redhat.com>
      Acked-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6da025fa
    • N
      sctp: Clean up type-punning in sctp_cmd_t union · b26ddd81
      Neil Horman 提交于
      Lots of points in the sctp_cmd_interpreter function treat the sctp_cmd_t arg as
      a void pointer, even though they are written as various other types.  Theres no
      need for this as doing so just leads to possible type-punning issues that could
      cause crashes, and if we remain type-consistent we can actually just remove the
      void * member of the union entirely.
      
      Change Notes:
      
      v2)
      	* Dropped chunk that modified SCTP_NULL to create a marker pattern
      	 should anyone try to use a SCTP_NULL() assigned sctp_arg_t, Assigning
      	 to .zero provides the same effect and should be faster, per Vlad Y.
      
      v3)
      	* Reverted part of V2, opting to use memset instead of .zero, so that
      	 the entire union is initalized thus avoiding the i164 speculative load
      	 problems previously encountered, per Dave M..  Also rewrote
      	 SCTP_[NO]FORCE so as to use common infrastructure a little more
      
      Signed-off-by: Neil Horman <nhorman@tuxdriver.com
      CC: Vlad Yasevich <vyasevich@gmail.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: linux-sctp@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b26ddd81
    • E
      tcp: better retrans tracking for defer-accept · e6c022a4
      Eric Dumazet 提交于
      For passive TCP connections using TCP_DEFER_ACCEPT facility,
      we incorrectly increment req->retrans each time timeout triggers
      while no SYNACK is sent.
      
      SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
      which we received the ACK from client). Only the last SYNACK is sent
      so that we can receive again an ACK from client, to move the req into
      accept queue. We plan to change this later to avoid the useless
      retransmit (and potential problem as this SYNACK could be lost)
      
      TCP_INFO later gives wrong information to user, claiming imaginary
      retransmits.
      
      Decouple req->retrans field into two independent fields :
      
      num_retrans : number of retransmit
      num_timeout : number of timeouts
      
      num_timeout is the counter that is incremented at each timeout,
      regardless of actual SYNACK being sent or not, and used to
      compute the exponential timeout.
      
      Introduce inet_rtx_syn_ack() helper to increment num_retrans
      only if ->rtx_syn_ack() succeeded.
      
      Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
      when we re-send a SYNACK in answer to a (retransmitted) SYN.
      Prior to this patch, we were not counting these retransmits.
      
      Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
      only if a synack packet was successfully queued.
      Reported-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Elliott Hughes <enh@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6c022a4
  2. 26 10月, 2012 4 次提交
  3. 23 10月, 2012 2 次提交
  4. 22 10月, 2012 1 次提交
  5. 09 10月, 2012 2 次提交
    • J
      ipv4: Add FLOWI_FLAG_KNOWN_NH · c92b9655
      Julian Anastasov 提交于
      Add flag to request that output route should be
      returned with known rt_gateway, in case we want to use
      it as nexthop for neighbour resolving.
      
      	The returned route can be cached as follows:
      
      - in NH exception: because the cached routes are not shared
      	with other destinations
      - in FIB NH: when using gateway because all destinations for
      	NH share same gateway
      
      	As last option, to return rt_gateway!=0 we have to
      set DST_NOCACHE.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c92b9655
    • J
      ipv4: introduce rt_uses_gateway · 155e8336
      Julian Anastasov 提交于
      Add new flag to remember when route is via gateway.
      We will use it to allow rt_gateway to contain address of
      directly connected host for the cases when DST_NOCACHE is
      used or when the NH exception caches per-destination route
      without DST_NOCACHE flag, i.e. when routes are not used for
      other destinations. By this way we force the neighbour
      resolving to work with the routed destination but we
      can use different address in the packet, feature needed
      for IPVS-DR where original packet for virtual IP is routed
      via route to real IP.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      155e8336
  6. 06 10月, 2012 1 次提交
  7. 05 10月, 2012 2 次提交
    • N
      sctp: check src addr when processing SACK to update transport state · edfee033
      Nicolas Dichtel 提交于
      Suppose we have an SCTP connection with two paths. After connection is
      established, path1 is not available, thus this path is marked as inactive. Then
      traffic goes through path2, but for some reasons packets are delayed (after
      rto.max). Because packets are delayed, the retransmit mechanism will switch
      again to path1. At this time, we receive a delayed SACK from path2. When we
      update the state of the path in sctp_check_transmitted(), we do not take into
      account the source address of the SACK, hence we update the wrong path.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edfee033
    • E
      ipv4: add a fib_type to fib_info · f4ef85bb
      Eric Dumazet 提交于
      commit d2d68ba9 (ipv4: Cache input routes in fib_info nexthops.)
      introduced a regression for forwarding.
      
      This was hard to reproduce but the symptom was that packets were
      delivered to local host instead of being forwarded.
      
      David suggested to add fib_type to fib_info so that we dont
      inadvertently share same fib_info for different purposes.
      
      With help from Julian Anastasov who provided very helpful
      hints, reproduced here :
      
      <quote>
              Can it be a problem related to fib_info reuse
      from different routes. For example, when local IP address
      is created for subnet we have:
      
      broadcast 192.168.0.255 dev DEV  proto kernel  scope link  src
      192.168.0.1
      192.168.0.0/24 dev DEV  proto kernel  scope link  src 192.168.0.1
      local 192.168.0.1 dev DEV  proto kernel  scope host  src 192.168.0.1
      
              The "dev DEV  proto kernel  scope link  src 192.168.0.1" is
      a reused fib_info structure where we put cached routes.
      The result can be same fib_info for 192.168.0.255 and
      192.168.0.0/24. RTN_BROADCAST is cached only for input
      routes. Incoming broadcast to 192.168.0.255 can be cached
      and can cause problems for traffic forwarded to 192.168.0.0/24.
      So, this patch should solve the problem because it
      separates the broadcast from unicast traffic.
      
              And the ip_route_input_slow caching will work for
      local and broadcast input routes (above routes 1 and 3) just
      because they differ in scope and use different fib_info.
      
      </quote>
      
      Many thanks to Chris Clayton for his patience and help.
      Reported-by: NChris Clayton <chris2553@googlemail.com>
      Bisected-by: NChris Clayton <chris2553@googlemail.com>
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Tested-by: NChris Clayton <chris2553@googlemail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4ef85bb
  8. 02 10月, 2012 2 次提交
    • E
      ipv4: gre: add GRO capability · 60769a5d
      Eric Dumazet 提交于
      Add GRO capability to IPv4 GRE tunnels, using the gro_cells
      infrastructure.
      
      Tested using IPv4 and IPv6 TCP traffic inside this tunnel, and
      checking GRO is building large packets.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60769a5d
    • E
      net: add gro_cells infrastructure · c9e6bc64
      Eric Dumazet 提交于
      This adds a new include file (include/net/gro_cells.h), to bring GRO
      (Generic Receive Offload) capability to tunnels, in a modular way.
      
      Because tunnels receive path is lockless, and GRO adds a serialization
      using a napi_struct, I chose to add an array of up to
      DEFAULT_MAX_NUM_RSS_QUEUES cells, so that multi queue devices wont be
      slowed down because of GRO layer.
      
      skb_get_rx_queue() is used as selector.
      
      In the future, we might add optional fanout capabilities, using rxhash
      for example.
      
      With help from Ben Hutchings who reminded me
      netif_get_num_default_rss_queues() function.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9e6bc64
  9. 28 9月, 2012 7 次提交
    • J
      ipvs: API change to avoid rescan of IPv6 exthdr · d4383f04
      Jesper Dangaard Brouer 提交于
      Reduce the number of times we scan/skip the IPv6 exthdrs.
      
      This patch contains a lot of API changes.  This is done, to avoid
      repeating the scan of finding the IPv6 headers, via ipv6_find_hdr(),
      which is called by ip_vs_fill_iph_skb().
      
      Finding the IPv6 headers is done as early as possible, and passed on
      as a pointer "struct ip_vs_iphdr *" to the affected functions.
      
      This patch reduce/removes 19 calls to ip_vs_fill_iph_skb().
      
      Notice, I have choosen, not to change the API of function
      pointer "(*schedule)" (in struct ip_vs_scheduler) as it can be
      used by external schedulers, via {un,}register_ip_vs_scheduler.
      Only 4 out of 10 schedulers use info from ip_vs_iphdr*, and when
      they do, they are only interested in iph->{s,d}addr.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      d4383f04
    • J
      ipvs: Complete IPv6 fragment handling for IPVS · 2f74713d
      Jesper Dangaard Brouer 提交于
      IPVS now supports fragmented packets, with support from nf_conntrack_reasm.c
      
      Based on patch from: Hans Schillstrom.
      
      IPVS do like conntrack i.e. use the skb->nfct_reasm
      (i.e. when all fragments is collected, nf_ct_frag6_output()
      starts a "re-play" of all fragments into the interrupted
      PREROUTING chain at prio -399 (NF_IP6_PRI_CONNTRACK_DEFRAG+1)
      with nfct_reasm pointing to the assembled packet.)
      
      Notice, module nf_defrag_ipv6 must be loaded for this to work.
      Report unhandled fragments, and recommend user to load nf_defrag_ipv6.
      
      To handle fw-mark for fragments.  Add a new IPVS hook into prerouting
      chain at prio -99 (NF_IP6_PRI_NAT_DST+1) to catch fragments, and copy
      fw-mark info from the first packet with an upper layer header.
      
      IPv6 fragment handling should be the last thing on the IPVS IPv6
      missing support list.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NHans Schillstrom <hans@schillstrom.com>
      Acked-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      2f74713d
    • J
      ipvs: Fix faulty IPv6 extension header handling in IPVS · 63dca2c0
      Jesper Dangaard Brouer 提交于
      IPv6 packets can contain extension headers, thus its wrong to assume
      that the transport/upper-layer header, starts right after (struct
      ipv6hdr) the IPv6 header.  IPVS uses this false assumption, and will
      write SNAT & DNAT modifications at a fixed pos which will corrupt the
      message.
      
      To fix this, proper header position must be found before modifying
      packets.  Introducing ip_vs_fill_iph_skb(), which uses ipv6_find_hdr()
      to skip the exthdrs. It finds (1) the transport header offset, (2) the
      protocol, and (3) detects if the packet is a fragment.
      
      Note, that fragments in IPv6 is represented via an exthdr.  Thus, this
      is detected while skipping through the exthdrs.
      
      This patch depends on commit 84018f55:
       "netfilter: ip6_tables: add flags parameter to ipv6_find_hdr()"
      This also adds a dependency to ip6_tables.
      
      Originally based on patch from: Hans Schillstrom
      
      kABI notes:
      Changing struct ip_vs_iphdr is a potential minor kABI breaker,
      because external modules can be compiled with another version of
      this struct.  This should not matter, as they would most-likely
      be using a compiled-in version of ip_vs_fill_iphdr().  When
      recompiled, they will notice ip_vs_fill_iphdr() no longer exists,
      and they have to used ip_vs_fill_iph_skb() instead.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      63dca2c0
    • J
      ipvs: Use config macro IS_ENABLED() · a638e514
      Jesper Dangaard Brouer 提交于
      Cleanup patch.
      
      Use the IS_ENABLED macro, instead of having to check
      both the build and the module CONFIG_ option.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      a638e514
    • J
      ipvs: Trivial changes, use compressed IPv6 address in output · 120b9c14
      Jesper Dangaard Brouer 提交于
      Have not converted the proc file output to compressed IPv6 addresses.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      120b9c14
    • E
      net: remove sk_init() helper · e2bcabec
      Eric Dumazet 提交于
      It seems sk_init() has no value today and even does strange things :
      
      # grep . /proc/sys/net/core/?mem_*
      /proc/sys/net/core/rmem_default:212992
      /proc/sys/net/core/rmem_max:131071
      /proc/sys/net/core/wmem_default:212992
      /proc/sys/net/core/wmem_max:131071
      
      We can remove it completely.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NShan Wei <davidshan@tencent.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2bcabec
    • S
      tunnel: drop packet if ECN present with not-ECT · eccc1bb8
      stephen hemminger 提交于
      Linux tunnels were written before RFC6040 and therefore never
      implemented the corner case of ECN getting set in the outer header
      and the inner header not being ready for it.
      
      Section 4.2.  Default Tunnel Egress Behaviour.
       o If the inner ECN field is Not-ECT, the decapsulator MUST NOT
            propagate any other ECN codepoint onwards.  This is because the
            inner Not-ECT marking is set by transports that rely on dropped
            packets as an indication of congestion and would not understand or
            respond to any other ECN codepoint [RFC4774].  Specifically:
      
            *  If the inner ECN field is Not-ECT and the outer ECN field is
               CE, the decapsulator MUST drop the packet.
      
            *  If the inner ECN field is Not-ECT and the outer ECN field is
               Not-ECT, ECT(0), or ECT(1), the decapsulator MUST forward the
               outgoing packet with the ECN field cleared to Not-ECT.
      
      This patch moves the ECN decap logic out of the individual tunnels
      into a common place.
      
      It also adds logging to allow detecting broken systems that
      set ECN bits incorrectly when tunneling (or an intermediate
      router might be changing the header).
      
      Overloads rx_frame_error to keep track of ECN related error.
      
      Thanks to Chris Wright who caught this while reviewing the new VXLAN
      tunnel.
      
      This code was tested by injecting faulty logic in other end GRE
      to send incorrectly encapsulated packets.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eccc1bb8
  10. 25 9月, 2012 13 次提交
  11. 23 9月, 2012 2 次提交