1. 30 10月, 2008 3 次提交
  2. 29 10月, 2008 5 次提交
    • E
      udp: RCU handling for Unicast packets. · 271b72c7
      Eric Dumazet 提交于
      Goals are :
      
      1) Optimizing handling of incoming Unicast UDP frames, so that no memory
       writes should happen in the fast path.
      
       Note: Multicasts and broadcasts still will need to take a lock,
       because doing a full lockless lookup in this case is difficult.
      
      2) No expensive operations in the socket bind/unhash phases :
        - No expensive synchronize_rcu() calls.
      
        - No added rcu_head in socket structure, increasing memory needs,
        but more important, forcing us to use call_rcu() calls,
        that have the bad property of making sockets structure cold.
        (rcu grace period between socket freeing and its potential reuse
         make this socket being cold in CPU cache).
        David did a previous patch using call_rcu() and noticed a 20%
        impact on TCP connection rates.
        Quoting Cristopher Lameter :
         "Right. That results in cacheline cooldown. You'd want to recycle
          the object as they are cache hot on a per cpu basis. That is screwed
          up by the delayed regular rcu processing. We have seen multiple
          regressions due to cacheline cooldown.
          The only choice in cacheline hot sensitive areas is to deal with the
          complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
      
        - Because udp sockets are allocated from dedicated kmem_cache,
        use of SLAB_DESTROY_BY_RCU can help here.
      
      Theory of operation :
      ---------------------
      
      As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
      special attention must be taken by readers and writers.
      
      Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
      reused, inserted in a different chain or in worst case in the same chain
      while readers could do lookups in the same time.
      
      In order to avoid loops, a reader must check each socket found in a chain
      really belongs to the chain the reader was traversing. If it finds a
      mismatch, lookup must start again at the begining. This *restart* loop
      is the reason we had to use rdlock for the multicast case, because
      we dont want to send same message several times to the same socket.
      
      We use RCU only for fast path.
      Thus, /proc/net/udp still takes spinlocks.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      271b72c7
    • E
      udp: introduce struct udp_table and multiple spinlocks · 645ca708
      Eric Dumazet 提交于
      UDP sockets are hashed in a 128 slots hash table.
      
      This hash table is protected by *one* rwlock.
      
      This rwlock is readlocked each time an incoming UDP message is handled.
      
      This rwlock is writelocked each time a socket must be inserted in
      hash table (bind time), or deleted from this table (close time)
      
      This is not scalable on SMP machines :
      
      1) Even in read mode, lock() and unlock() are atomic operations and
       must dirty a contended cache line, shared by all cpus.
      
      2) A writer might be starved if many readers are 'in flight'. This can
       happen on a machine with some NIC receiving many UDP messages. User
       process can be delayed a long time at socket creation/dismantle time.
      
      This patch prepares RCU migration, by introducing 'struct udp_table
      and struct udp_hslot', and using one spinlock per chain, to reduce
      contention on central rwlock.
      
      Introducing one spinlock per chain reduces latencies, for port
      randomization on heavily loaded UDP servers. This also speedup
      bindings to specific ports.
      
      udp_lib_unhash() was uninlined, becoming to big.
      
      Some cleanups were done to ease review of following patch
      (RCUification of UDP Unicast lookups)
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      645ca708
    • H
      0c6ce78a
    • H
      net: replace all current users of NIP6_SEQFMT with %#p6 · b071195d
      Harvey Harrison 提交于
      The define in kernel.h can be done away with at a later time.
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b071195d
    • A
      net: reduce structures when XFRM=n · def8b4fa
      Alexey Dobriyan 提交于
      ifdef out
      * struct sk_buff::sp		(pointer)
      * struct dst_entry::xfrm	(pointer)
      * struct sock::sk_policy	(2 pointers)
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      def8b4fa
  3. 20 10月, 2008 1 次提交
  4. 17 10月, 2008 1 次提交
  5. 16 10月, 2008 1 次提交
    • P
      IPV6: Fix default gateway criteria wrt. HIGH/LOW preference radv option · 22441cfa
      Pedro Ribeiro 提交于
      Problem observed:
                     In IPv6, in the presence of multiple routers candidates to
                     default gateway in one segment, each sending a different
                     value of preference, the Linux hosts connected to the
                     segment weren't selecting the right one in all the
                     combinations possible of LOW/MEDIUM/HIGH preference.
      
      This patch changes two files:
      include/linux/icmpv6.h
                     Get the "router_pref" bitfield in the right place
                     (as RFC4191 says), named the bit left with this fix as
                     "home_agent" (RFC3775 say that's his function)
      
      net/ipv6/ndisc.c
                     Corrects the binary logic behind the updating of the
                     router preference in the flags of the routing table
      
      Result:
                     With this two fixes applied, the default route used by
                     the system was to consistent with the rules mentioned
                     in RFC4191 in case of changes in the value of preference
                     in router advertisements
      Signed-off-by: NPedro Ribeiro <pribeiro@net.ipl.pt>
      Acked-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22441cfa
  6. 15 10月, 2008 1 次提交
  7. 14 10月, 2008 1 次提交
  8. 10 10月, 2008 5 次提交
    • G
      tcpv6: fix error with CONFIG_TCP_MD5SIG disabled · fa3e5b4e
      Guo-Fu Tseng 提交于
      This patch fix error with CONFIG_TCP_MD5SIG disabled.
      Signed-off-by: NGuo-Fu Tseng <cooldavid@cooldavid.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa3e5b4e
    • I
      tcpv6: combine tcp_v6_send_(reset|ack) · 626e264d
      Ilpo Järvinen 提交于
      $ codiff tcp_ipv6.o.old tcp_ipv6.o.new
      net/ipv6/tcp_ipv6.c:
        tcp_v6_md5_hash_hdr | -144
        tcp_v6_send_ack     | -585
        tcp_v6_send_reset   | -540
       3 functions changed, 1269 bytes removed, diff: -1269
      
      net/ipv6/tcp_ipv6.c:
        tcp_v6_send_response | +791
       1 function changed, 791 bytes added, diff: +791
      
      tcp_ipv6.o.new:
       4 functions changed, 791 bytes added, 1269 bytes removed, diff: -478
      
      I choose to leave the reset related netns comment in place (not
      the one that is killed) as I cannot understand its English so
      it's a bit hard for me to evaluate its usefulness :-).
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      626e264d
    • I
      tcpv6: convert opt[] -> topt in tcp_v6_send_reset · 81ada62d
      Ilpo Järvinen 提交于
      after this I get:
      
      $ diff-funcs tcp_v6_send_reset tcp_ipv6.c tcp_ipv6.c tcp_v6_send_ack
       --- tcp_ipv6.c:tcp_v6_send_reset()
       +++ tcp_ipv6.c:tcp_v6_send_ack()
      @@ -1,4 +1,5 @@
      -static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
      +static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
      u32 ts,
      +                           struct tcp_md5sig_key *key)
       {
              struct tcphdr *th = tcp_hdr(skb), *t1;
              struct sk_buff *buff;
      @@ -7,31 +8,14 @@
              struct sock *ctl_sk = net->ipv6.tcp_sk;
              unsigned int tot_len = sizeof(struct tcphdr);
              __be32 *topt;
      -#ifdef CONFIG_TCP_MD5SIG
      -       struct tcp_md5sig_key *key;
      -#endif
      -
      -       if (th->rst)
      -               return;
      -
      -       if (!ipv6_unicast_destination(skb))
      -               return;
      
      +       if (ts)
      +               tot_len += TCPOLEN_TSTAMP_ALIGNED;
       #ifdef CONFIG_TCP_MD5SIG
      -       if (sk)
      -               key = tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr);
      -       else
      -               key = NULL;
      -
              if (key)
                      tot_len += TCPOLEN_MD5SIG_ALIGNED;
       #endif
      
      -       /*
      -        * We need to grab some memory, and put together an RST,
      -        * and then put it into the queue to be sent.
      -        */
      -
              buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
                               GFP_ATOMIC);
              if (buff == NULL)
      @@ -46,18 +30,20 @@
              t1->dest = th->source;
              t1->source = th->dest;
              t1->doff = tot_len / 4;
      -       t1->rst = 1;
      -
      -       if(th->ack) {
      -               t1->seq = th->ack_seq;
      -       } else {
      -               t1->ack = 1;
      -               t1->ack_seq = htonl(ntohl(th->seq) + th->syn + th->fin
      -                                   + skb->len - (th->doff<<2));
      -       }
      +       t1->seq = htonl(seq);
      +       t1->ack_seq = htonl(ack);
      +       t1->ack = 1;
      +       t1->window = htons(win);
      
              topt = (__be32 *)(t1 + 1);
      
      +       if (ts) {
      +               *topt++ = htonl((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) |
      +                               (TCPOPT_TIMESTAMP << 8) |
      TCPOLEN_TIMESTAMP);
      +               *topt++ = htonl(tcp_time_stamp);
      +               *topt++ = htonl(ts);
      +       }
      +
       #ifdef CONFIG_TCP_MD5SIG
              if (key) {
                      *topt++ = htonl((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) |
      @@ -84,15 +70,10 @@
              fl.fl_ip_sport = t1->source;
              security_skb_classify_flow(skb, &fl);
      
      -       /* Pass a socket to ip6_dst_lookup either it is for RST
      -        * Underlying function will use this to retrieve the network
      -        * namespace
      -        */
              if (!ip6_dst_lookup(ctl_sk, &buff->dst, &fl)) {
                      if (xfrm_lookup(&buff->dst, &fl, NULL, 0) >= 0) {
                              ip6_xmit(ctl_sk, buff, &fl, NULL, 0);
                              TCP_INC_STATS_BH(net, TCP_MIB_OUTSEGS);
      -                       TCP_INC_STATS_BH(net, TCP_MIB_OUTRSTS);
                              return;
                      }
              }
      
      
      ...which starts to be trivial to combine.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81ada62d
    • I
    • I
      tcpv[46]: fix md5 pseudoheader address field ordering · 78e645cb
      Ilpo Järvinen 提交于
      Maybe it's just me but I guess those md5 people made a mess
      out of it by having *_md5_hash_* to use daddr, saddr order
      instead of the one that is natural (and equal to what csum
      functions use). For the segment were sending, the original
      addresses are reversed so buff's saddr == skb's daddr and
      vice-versa.
      
      Maybe I can finally proceed with unification of some code
      after fixing it first... :-)
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      78e645cb
  9. 09 10月, 2008 14 次提交
  10. 08 10月, 2008 8 次提交