1. 26 11月, 2008 2 次提交
  2. 25 11月, 2008 2 次提交
  3. 24 11月, 2008 2 次提交
    • A
      net: fix tunnels in netns after ndo_ changes · be77e593
      Alexey Dobriyan 提交于
      dev_net_set() should be the very first thing after alloc_netdev().
      
      "ndo_" changes turned simple assignment (which is OK to do before netns
      assignment) into quite non-trivial operation (which is not OK, init_net was
      used). This leads to incomplete initialisation of tunnel device in netns.
      
      BUG: unable to handle kernel NULL pointer dereference at 00000004
      IP: [<c02efdb5>] ip6_tnl_exit_net+0x37/0x4f
      *pde = 00000000 
      Oops: 0000 [#1] PREEMPT DEBUG_PAGEALLOC
      last sysfs file: /sys/class/net/lo/operstate
      
      Pid: 10, comm: netns Not tainted (2.6.28-rc6 #1) 
      EIP: 0060:[<c02efdb5>] EFLAGS: 00010246 CPU: 0
      EIP is at ip6_tnl_exit_net+0x37/0x4f
      EAX: 00000000 EBX: 00000020 ECX: 00000000 EDX: 00000003
      ESI: c5caef30 EDI: c782bbe8 EBP: c7909f50 ESP: c7909f48
       DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
      Process netns (pid: 10, ti=c7908000 task=c7905780 task.ti=c7908000)
      Stack:
       c03e75e0 c7390bc8 c7909f60 c0245448 c7390bd8 c7390bf0 c7909fa8 c012577a
       00000000 00000002 00000000 c0125736 c782bbe8 c7909f90 c0308fe3 c782bc04
       c7390bd4 c0245406 c084b718 c04f0770 c03ad785 c782bbe8 c782bc04 c782bc0c
      Call Trace:
       [<c0245448>] ? cleanup_net+0x42/0x82
       [<c012577a>] ? run_workqueue+0xd6/0x1ae
       [<c0125736>] ? run_workqueue+0x92/0x1ae
       [<c0308fe3>] ? schedule+0x275/0x285
       [<c0245406>] ? cleanup_net+0x0/0x82
       [<c0125ae1>] ? worker_thread+0x81/0x8d
       [<c0128344>] ? autoremove_wake_function+0x0/0x33
       [<c0125a60>] ? worker_thread+0x0/0x8d
       [<c012815c>] ? kthread+0x39/0x5e
       [<c0128123>] ? kthread+0x0/0x5e
       [<c0103b9f>] ? kernel_thread_helper+0x7/0x10
      Code: db e8 05 ff ff ff 89 c6 e8 dc 04 f6 ff eb 08 8b 40 04 e8 38 89 f5 ff 8b 44 9e 04 85 c0 75 f0 43 83 fb 20 75 f2 8b 86 84 00 00 00 <8b> 40 04 e8 1c 89 f5 ff e8 98 04 f6 ff 89 f0 e8 f8 63 e6 ff 5b 
      EIP: [<c02efdb5>] ip6_tnl_exit_net+0x37/0x4f SS:ESP 0068:c7909f48
      ---[ end trace 6c2f2328fccd3e0c ]---
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be77e593
    • E
      net: Convert TCP/DCCP listening hash tables to use RCU · c25eb3bf
      Eric Dumazet 提交于
      This is the last step to be able to perform full RCU lookups
      in __inet_lookup() : After established/timewait tables, we
      add RCU lookups to listening hash table.
      
      The only trick here is that a socket of a given type (TCP ipv4,
      TCP ipv6, ...) can now flight between two different tables
      (established and listening) during a RCU grace period, so we
      must use different 'nulls' end-of-chain values for two tables.
      
      We define a large value :
      
      #define LISTENING_NULLS_BASE (1U << 29)
      
      So that slots in listening table are guaranteed to have different
      end-of-chain values than slots in established table. A reader can
      still detect it finished its lookup in the right chain.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c25eb3bf
  4. 22 11月, 2008 1 次提交
  5. 21 11月, 2008 3 次提交
  6. 20 11月, 2008 5 次提交
  7. 17 11月, 2008 2 次提交
    • E
      net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls · 3ab5aee7
      Eric Dumazet 提交于
      RCU was added to UDP lookups, using a fast infrastructure :
      - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
        price of call_rcu() at freeing time.
      - hlist_nulls permits to use few memory barriers.
      
      This patch uses same infrastructure for TCP/DCCP established
      and timewait sockets.
      
      Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
      using short lived TCP connections. A followup patch, converting
      rwlocks to spinlocks will even speedup this case.
      
      __inet_lookup_established() is pretty fast now we dont have to
      dirty a contended cache line (read_lock/read_unlock)
      
      Only established and timewait hashtable are converted to RCU
      (bind table and listen table are still using traditional locking)
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ab5aee7
    • E
      udp: Use hlist_nulls in UDP RCU code · 88ab1932
      Eric Dumazet 提交于
      This is a straightforward patch, using hlist_nulls infrastructure.
      
      RCUification already done on UDP two weeks ago.
      
      Using hlist_nulls permits us to avoid some memory barriers, both
      at lookup time and delete time.
      
      Patch is large because it adds new macros to include/net/sock.h.
      These macros will be used by TCP & DCCP in next patch.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88ab1932
  8. 13 11月, 2008 1 次提交
    • B
      ipv6: routing header fixes · 6e093d9d
      Brian Haley 提交于
      This patch fixes two bugs:
      
      1. setsockopt() of anything but a Type 2 routing header should return
      EINVAL instead of EPERM.  Noticed by Shan Wei
      (shanwei@cn.fujitsu.com).
      
      2. setsockopt()/sendmsg() of a Type 2 routing header with invalid
      length or segments should return EINVAL.  These values are statically
      fixed in RFC 3775, unlike the variable Type 0 was.
      Signed-off-by: NBrian Haley <brian.haley@hp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e093d9d
  9. 12 11月, 2008 1 次提交
  10. 11 11月, 2008 1 次提交
  11. 06 11月, 2008 1 次提交
    • B
      bonding: send IPv6 neighbor advertisement on failover · 305d552a
      Brian Haley 提交于
      This patch adds better IPv6 failover support for bonding devices,
      especially when in active-backup mode and there are only IPv6 addresses
      configured, as reported by Alex Sidorenko.
      
      - Creates a new file, net/drivers/bonding/bond_ipv6.c, for the
         IPv6-specific routines.  Both regular bonds and VLANs over bonds
         are supported.
      
      - Adds a new tunable, num_unsol_na, to limit the number of unsolicited
         IPv6 Neighbor Advertisements that are sent on a failover event.
         Default is 1.
      
      - Creates two new IPv6 neighbor discovery functions:
      
         ndisc_build_skb()
         ndisc_send_skb()
      
         These were required to support VLANs since we have to be able to
         add the VLAN id to the skb since ndisc_send_na() and friends
         shouldn't be asked to do this.  These two routines are basically
         __ndisc_send() split into two pieces, in a slightly different order.
      
      - Updates Documentation/networking/bonding.txt and bumps the rev of bond
         support to 3.4.0.
      
      On failover, this new code will generate one packet:
      
      - An unsolicited IPv6 Neighbor Advertisement, which helps the switch
         learn that the address has moved to the new slave.
      
      Testing has shown that sending just the NA results in pretty good
      behavior when in active-back mode, I saw no lost ping packets for example.
      Signed-off-by: NBrian Haley <brian.haley@hp.com>
      Signed-off-by: NJay Vosburgh <fubar@us.ibm.com>
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      305d552a
  12. 05 11月, 2008 2 次提交
    • B
      ipv6: fix run pending DAD when interface becomes ready · e3ec6cfc
      Benjamin Thery 提交于
      With some net devices types, an IPv6 address configured while the
      interface was down can stay 'tentative' forever, even after the interface
      is set up. In some case, pending IPv6 DADs are not executed when the
      device becomes ready.
      
      I observed this while doing some tests with kvm. If I assign an IPv6 
      address to my interface eth0 (kvm driver rtl8139) when it is still down
      then the address is flagged tentative (IFA_F_TENTATIVE). Then, I set
      eth0 up, and to my surprise, the address stays 'tentative', no DAD is
      executed and the address can't be pinged.
      
      I also observed the same behaviour, without kvm, with virtual interfaces
      types macvlan and veth.
      
      Some easy steps to reproduce the issue with macvlan:
      
      1. ip link add link eth0 type macvlan
      2. ip -6 addr add 2003::ab32/64 dev macvlan0
      3. ip addr show dev macvlan0
         ... 
         inet6 2003::ab32/64 scope global tentative
         ...
      4. ip link set macvlan0 up
      5. ip addr show dev macvlan0
         ...
         inet6 2003::ab32/64 scope global tentative
         ...
         Address is still tentative
      
      I think there's a bug in net/ipv6/addrconf.c, addrconf_notify():
      addrconf_dad_run() is not always run when the interface is flagged IF_READY.
      Currently it is only run when receiving NETDEV_CHANGE event. Looks like
      some (virtual) devices doesn't send this event when becoming up.
      
      For both NETDEV_UP and NETDEV_CHANGE events, when the interface becomes
      ready, run_pending should be set to 1. Patch below.
      
      'run_pending = 1' could be moved below the if/else block but it makes 
      the code less readable.
      Signed-off-by: NBenjamin Thery <benjamin.thery@bull.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3ec6cfc
    • A
      xfrm: Have af-specific init_tempsel() initialize family field of temporary selector · 79654a76
      Andreas Steffen 提交于
      While adding MIGRATE support to strongSwan, Andreas Steffen noticed that
      the selectors provided in XFRM_MSG_ACQUIRE have their family field
      uninitialized (those in MIGRATE do have their family set).
      
      Looking at the code, this is because the af-specific init_tempsel()
      (called via afinfo->init_tempsel() in xfrm_init_tempsel()) do not set
      the value.
      Reported-by: NAndreas Steffen <andreas.steffen@strongswan.org>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NArnaud Ebalard <arno@natisbad.org>
      79654a76
  13. 04 11月, 2008 1 次提交
    • A
      net: '&' redux · 6d9f239a
      Alexey Dobriyan 提交于
      I want to compile out proc_* and sysctl_* handlers totally and
      stub them to NULL depending on config options, however usage of &
      will prevent this, since taking adress of NULL pointer will break
      compilation.
      
      So, drop & in front of every ->proc_handler and every ->strategy
      handler, it was never needed in fact.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d9f239a
  14. 03 11月, 2008 2 次提交
    • W
      udp: Fix the SNMP counter of UDP_MIB_INERRORS · 0856f939
      Wei Yongjun 提交于
      UDP packets received in udpv6_recvmsg() are not only IPv6 UDP packets, but
      also have IPv4 UDP packets, so when do the counter of UDP_MIB_INERRORS in
      udpv6_recvmsg(), we should check whether the packet is a IPv6 UDP packet
      or a IPv4 UDP packet.
      Signed-off-by: NWei Yongjun <yjwei@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0856f939
    • W
      udp: Fix the SNMP counter of UDP_MIB_INDATAGRAMS · f26ba175
      Wei Yongjun 提交于
      If UDP echo is sent to xinetd/echo-dgram, the UDP reply will be received
      at the sender. But the SNMP counter of UDP_MIB_INDATAGRAMS will be not
      increased, UDP6_MIB_INDATAGRAMS will be increased instead.
      
        Endpoint A                      Endpoint B
        UDP Echo request ----------->
        (IPv4, Dst port=7)
                         <----------    UDP Echo Reply
                                        (IPv4, Src port=7)
      
      This bug is come from this patch cb75994e.
      
      It do counter UDP[6]_MIB_INDATAGRAMS until udp[v6]_recvmsg. Because
      xinetd used IPv6 socket to receive UDP messages, thus, when received
      UDP packet, the UDP6_MIB_INDATAGRAMS will be increased in function
      udpv6_recvmsg() even if the packet is a IPv4 UDP packet.
      
      This patch fixed the problem.
      Signed-off-by: NWei Yongjun <yjwei@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f26ba175
  15. 02 11月, 2008 2 次提交
  16. 31 10月, 2008 1 次提交
  17. 30 10月, 2008 3 次提交
  18. 29 10月, 2008 5 次提交
    • E
      udp: RCU handling for Unicast packets. · 271b72c7
      Eric Dumazet 提交于
      Goals are :
      
      1) Optimizing handling of incoming Unicast UDP frames, so that no memory
       writes should happen in the fast path.
      
       Note: Multicasts and broadcasts still will need to take a lock,
       because doing a full lockless lookup in this case is difficult.
      
      2) No expensive operations in the socket bind/unhash phases :
        - No expensive synchronize_rcu() calls.
      
        - No added rcu_head in socket structure, increasing memory needs,
        but more important, forcing us to use call_rcu() calls,
        that have the bad property of making sockets structure cold.
        (rcu grace period between socket freeing and its potential reuse
         make this socket being cold in CPU cache).
        David did a previous patch using call_rcu() and noticed a 20%
        impact on TCP connection rates.
        Quoting Cristopher Lameter :
         "Right. That results in cacheline cooldown. You'd want to recycle
          the object as they are cache hot on a per cpu basis. That is screwed
          up by the delayed regular rcu processing. We have seen multiple
          regressions due to cacheline cooldown.
          The only choice in cacheline hot sensitive areas is to deal with the
          complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
      
        - Because udp sockets are allocated from dedicated kmem_cache,
        use of SLAB_DESTROY_BY_RCU can help here.
      
      Theory of operation :
      ---------------------
      
      As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
      special attention must be taken by readers and writers.
      
      Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
      reused, inserted in a different chain or in worst case in the same chain
      while readers could do lookups in the same time.
      
      In order to avoid loops, a reader must check each socket found in a chain
      really belongs to the chain the reader was traversing. If it finds a
      mismatch, lookup must start again at the begining. This *restart* loop
      is the reason we had to use rdlock for the multicast case, because
      we dont want to send same message several times to the same socket.
      
      We use RCU only for fast path.
      Thus, /proc/net/udp still takes spinlocks.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      271b72c7
    • E
      udp: introduce struct udp_table and multiple spinlocks · 645ca708
      Eric Dumazet 提交于
      UDP sockets are hashed in a 128 slots hash table.
      
      This hash table is protected by *one* rwlock.
      
      This rwlock is readlocked each time an incoming UDP message is handled.
      
      This rwlock is writelocked each time a socket must be inserted in
      hash table (bind time), or deleted from this table (close time)
      
      This is not scalable on SMP machines :
      
      1) Even in read mode, lock() and unlock() are atomic operations and
       must dirty a contended cache line, shared by all cpus.
      
      2) A writer might be starved if many readers are 'in flight'. This can
       happen on a machine with some NIC receiving many UDP messages. User
       process can be delayed a long time at socket creation/dismantle time.
      
      This patch prepares RCU migration, by introducing 'struct udp_table
      and struct udp_hslot', and using one spinlock per chain, to reduce
      contention on central rwlock.
      
      Introducing one spinlock per chain reduces latencies, for port
      randomization on heavily loaded UDP servers. This also speedup
      bindings to specific ports.
      
      udp_lib_unhash() was uninlined, becoming to big.
      
      Some cleanups were done to ease review of following patch
      (RCUification of UDP Unicast lookups)
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      645ca708
    • H
      0c6ce78a
    • H
      net: replace all current users of NIP6_SEQFMT with %#p6 · b071195d
      Harvey Harrison 提交于
      The define in kernel.h can be done away with at a later time.
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b071195d
    • A
      net: reduce structures when XFRM=n · def8b4fa
      Alexey Dobriyan 提交于
      ifdef out
      * struct sk_buff::sp		(pointer)
      * struct dst_entry::xfrm	(pointer)
      * struct sock::sk_policy	(2 pointers)
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      def8b4fa
  19. 20 10月, 2008 1 次提交
  20. 17 10月, 2008 1 次提交
  21. 16 10月, 2008 1 次提交
    • P
      IPV6: Fix default gateway criteria wrt. HIGH/LOW preference radv option · 22441cfa
      Pedro Ribeiro 提交于
      Problem observed:
                     In IPv6, in the presence of multiple routers candidates to
                     default gateway in one segment, each sending a different
                     value of preference, the Linux hosts connected to the
                     segment weren't selecting the right one in all the
                     combinations possible of LOW/MEDIUM/HIGH preference.
      
      This patch changes two files:
      include/linux/icmpv6.h
                     Get the "router_pref" bitfield in the right place
                     (as RFC4191 says), named the bit left with this fix as
                     "home_agent" (RFC3775 say that's his function)
      
      net/ipv6/ndisc.c
                     Corrects the binary logic behind the updating of the
                     router preference in the flags of the routing table
      
      Result:
                     With this two fixes applied, the default route used by
                     the system was to consistent with the rules mentioned
                     in RFC4191 in case of changes in the value of preference
                     in router advertisements
      Signed-off-by: NPedro Ribeiro <pribeiro@net.ipl.pt>
      Acked-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22441cfa