1. 20 7月, 2012 2 次提交
    • E
      ipv4: tcp: remove per net tcp_sock · be9f4a44
      Eric Dumazet 提交于
      tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket
      per network namespace.
      
      This leads to bad behavior on multiqueue NICS, because many cpus
      contend for the socket lock and once socket lock is acquired, extra
      false sharing on various socket fields slow down the operations.
      
      To better resist to attacks, we use a percpu socket. Each cpu can
      run without contention, using appropriate memory (local node)
      
      Additional features :
      
      1) We also mirror the queue_mapping of the incoming skb, so that
      answers use the same queue if possible.
      
      2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree()
      
      3) We now limit the number of in-flight RST/ACK [1] packets
      per cpu, instead of per namespace, and we honor the sysctl_wmem_default
      limit dynamically. (Prior to this patch, sysctl_wmem_default value was
      copied at boot time, so any further change would not affect tcp_sock
      limit)
      
      [1] These packets are only generated when no socket was matched for
      the incoming packet.
      Reported-by: NBill Sommerfeld <wsommerfeld@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be9f4a44
    • J
      ipv4: use seqlock for nh_exceptions · aee06da6
      Julian Anastasov 提交于
      Use global seqlock for the nh_exceptions. Call
      fnhe_oldest with the right hash chain. Correct the diff
      value for dst_set_expires.
      
      v2: after suggestions from Eric Dumazet:
      * get rid of spin lock fnhe_lock, rearrange update_or_create_fnhe
      * continue daddr search in rt_bind_exception
      
      v3:
      * remove the daddr check before seqlock in rt_bind_exception
      * restart lookup in rt_bind_exception on detected seqlock change,
      as suggested by David Miller
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aee06da6
  2. 19 7月, 2012 8 次提交
  3. 18 7月, 2012 4 次提交
  4. 17 7月, 2012 15 次提交
    • D
      ipv4: Add FIB nexthop exceptions. · 4895c771
      David S. Miller 提交于
      In a regime where we have subnetted route entries, we need a way to
      store persistent storage about destination specific learned values
      such as redirects and PMTU values.
      
      This is implemented here via nexthop exceptions.
      
      The initial implementation is a 2048 entry hash table with relaiming
      starting at chain length 5.  A more sophisticated scheme can be
      devised if that proves necessary.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4895c771
    • E
      tcp: implement RFC 5961 4.2 · 0c24604b
      Eric Dumazet 提交于
      Implement the RFC 5691 mitigation against Blind
      Reset attack using SYN bit.
      
      Section 4.2 of RFC 5961 advises to send a Challenge ACK and drop
      incoming packet, instead of resetting the session.
      
      Add a new SNMP counter to count number of challenge acks sent
      in response to SYN packets.
      (netstat -s | grep TCPSYNChallenge)
      
      Remove obsolete TCPAbortOnSyn, since we no longer abort a TCP session
      because of a SYN flag.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Kiran Kumar Kella <kkiran@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c24604b
    • D
      net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}() · 6700c270
      David S. Miller 提交于
      This will be used so that we can compose a full flow key.
      
      Even though we have a route in this context, we need more.  In the
      future the routes will be without destination address, source address,
      etc. keying.  One ipv4 route will cover entire subnets, etc.
      
      In this environment we have to have a way to possess persistent storage
      for redirects and PMTU information.  This persistent storage will exist
      in the FIB tables, and that's why we'll need to be able to rebuild a
      full lookup flow key here.  Using that flow key will do a fib_lookup()
      and create/update the persistent entry.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6700c270
    • E
      tcp: implement RFC 5961 3.2 · 282f23c6
      Eric Dumazet 提交于
      Implement the RFC 5691 mitigation against Blind
      Reset attack using RST bit.
      
      Idea is to validate incoming RST sequence,
      to match RCV.NXT value, instead of previouly accepted
      window : (RCV.NXT <= SEG.SEQ < RCV.NXT+RCV.WND)
      
      If sequence is in window but not an exact match, send
      a "challenge ACK", so that the other part can resend an
      RST with the appropriate sequence.
      
      Add a new sysctl, tcp_challenge_ack_limit, to limit
      number of challenge ACK sent per second.
      
      Add a new SNMP counter to count number of challenge acks sent.
      (netstat -s | grep TCPChallengeACK)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Kiran Kumar Kella <kkiran@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      282f23c6
    • L
      ipv6: fix unappropriate errno returned for non-multicast address · a858d64b
      Li Wei 提交于
      We need to check the passed in multicast address and return
      appropriate errno(EINVAL) if it is not valid. And it's no need
      to walk through the ipv6_mc_list in this situation.
      Signed-off-by: NLi Wei <lw@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a858d64b
    • M
      irda: Fix typo in irda · ad8c9453
      Masanari Iida 提交于
      Correct spelling typo in irda.
      Signed-off-by: NMasanari Iida <standby24x7@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad8c9453
    • I
      sctp: fix sparse warning for sctp_init_cause_fixed · db28aafa
      Ioan Orghici 提交于
      Fix the following sparse warning:
      	* symbol 'sctp_init_cause_fixed' was not declared. Should it be
      	  static?
      Signed-off-by: NIoan Orghici <ioanorghici@gmail.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db28aafa
    • E
      netem: refine early skb orphaning · 5a308f40
      Eric Dumazet 提交于
      netem does an early orphaning of skbs. Doing so breaks TCP Small Queue
      or any mechanism relying on socket sk_wmem_alloc feedback.
      
      Ideally, we should perform this orphaning after the rate module and
      before the delay module, to mimic what happens on a real link :
      
      skb orphaning is indeed normally done at TX completion, before the
      transit on the link.
      
      +-------+   +--------+  +---------------+  +-----------------+
      + Qdisc +---> Device +--> TX completion +--> links / hops    +->
      +       +   +  xmit  +  + skb orphaning +  + propagation     +
      +-------+   +--------+  +---------------+  +-----------------+
            < rate limiting >                  < delay, drops, reorders >
      
      If netem is used without delay feature (drops, reorders, rate
      limiting), then we should avoid early skb orphaning, to keep pressure
      on sockets as long as packets are still in qdisc queue.
      
      Ideally, netem should be refactored to implement delay module
      as the last stage. Current algorithm merges the two phases
      (rate limiting + delay) so its not correct.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Cc: Mark Gordon <msg@google.com>
      Cc: Andreas Terzis <aterzis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a308f40
    • T
      bridge: Fix enforcement of multicast hash_max limit · 036be6db
      Thomas Graf 提交于
      The hash size is doubled when it needs to grow and compared against
      hash_max. The >= comparison will limit the hash table size to half
      of what is expected i.e. the default 512 hash_max will not allow
      the hash table to grow larger than 256.
      
      Also print the hash table limit instead of the desirable size when
      the limit is reached.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      036be6db
    • D
      ipv6: fix RTPROT_RA markup of RA routes w/nexthops · f0396f60
      Denis Ovsienko 提交于
      Userspace implementations of network routing protocols sometimes need to
      tell RA-originated IPv6 routes from other kernel routes to make proper
      routing decisions. This makes most sense for RA routes with nexthops,
      namely, default routes and Route Information routes.
      
      The intended mean of preserving RA route origin in a netlink message is
      through indicating RTPROT_RA as protocol code. Function rt6_fill_node()
      tried to do that for default routes, but its test condition was taken
      wrong. This change is modeled after the original mailing list posting
      by Jeff Haran. It fixes the test condition for default route case and
      sets the same behaviour for Route Information case (both types use
      nexthops). Handling of the 3rd RA route type, Prefix Information, is
      left unchanged, as it stands for interface connected routes (without
      nexthops).
      Signed-off-by: NDenis Ovsienko <infrastation@yandex.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0396f60
    • T
      6lowpan: Change byte order when storing/accessing to len field · 5e96855f
      Tony Cheneau 提交于
      Lenght field should be encoded using big endian byte order, such as intend in the specs.
      As it is currently written, the len field would not be decoded properly on an implementation using the correct byte ordering. Hence, it could lead to interroperability issues.
      
      Also, I rewrote the code so that iphc0 argument of lowpan_alloc_new_frame could be removed.
      Signed-off-by: NTony Cheneau <tony.cheneau@amnesiak.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e96855f
    • T
      6lowpan: Change byte order when storing/accessing u16 tag · 4576039f
      Tony Cheneau 提交于
      The tag field should be stored and accessed using big endian byte order (as
      intended in the specs). Or else, when displayed with a trafic analyser, such a
      Wireshark, the field not properly displayed (e.g. 0x01 00 instead of 0x00 01,
      and so on).
      Signed-off-by: NTony Cheneau <tony.cheneau@amnesiak.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4576039f
    • T
      6lowpan: Fix null pointer dereference in UDP uncompression function · d4787a15
      Tony Cheneau 提交于
      When a UDP packet gets fragmented, a crash will occur at reassembly time.
      This is because skb->transport_header is not set during earlier period of fragment reassembly.
      As a consequence, call to udp_hdr() return NULL and uh (which is NULL) gets
      dereferenced without much test.
      Signed-off-by: NTony Cheneau <tony.cheneau@amnesiak.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4787a15
    • A
      net: make sock diag per-namespace · 51d7cccf
      Andrey Vagin 提交于
      Before this patch sock_diag works for init_net only and dumps
      information about sockets from all namespaces.
      
      This patch expands sock_diag for all name-spaces.
      It creates a netlink kernel socket for each netns and filters
      data during dumping.
      
      v2: filter accoding with netns in all places
          remove an unused variable.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Signed-off-by: NAndrew Vagin <avagin@openvz.org>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51d7cccf
    • E
      tcp: add OFO snmp counters · a6df1ae9
      Eric Dumazet 提交于
      Add three SNMP TCP counters, to better track TCP behavior
      at global stage (netstat -s), when packets are received
      Out Of Order (OFO)
      
      TCPOFOQueue : Number of packets queued in OFO queue
      
      TCPOFODrop  : Number of packets meant to be queued in OFO
                    but dropped because socket rcvbuf limit hit.
      
      TCPOFOMerge : Number of packets in OFO that were merged with
                    other packets.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6df1ae9
  5. 16 7月, 2012 3 次提交
  6. 14 7月, 2012 8 次提交