1. 29 4月, 2011 1 次提交
    • E
      inet: add RCU protection to inet->opt · f6d8bd05
      Eric Dumazet 提交于
      We lack proper synchronization to manipulate inet->opt ip_options
      
      Problem is ip_make_skb() calls ip_setup_cork() and
      ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
      without any protection against another thread manipulating inet->opt.
      
      Another thread can change inet->opt pointer and free old one under us.
      
      Use RCU to protect inet->opt (changed to inet->inet_opt).
      
      Instead of handling atomic refcounts, just copy ip_options when
      necessary, to avoid cache line dirtying.
      
      We cant insert an rcu_head in struct ip_options since its included in
      skb->cb[], so this patch is large because I had to introduce a new
      ip_options_rcu structure.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6d8bd05
  2. 28 4月, 2011 2 次提交
    • D
      ipv4: Remove erroneous check in igmpv3_newpack() and igmp_send_report(). · 2e97e980
      David S. Miller 提交于
      Output route resolution never returns a route with rt_src set to zero
      (which is INADDR_ANY).
      
      Even if the flow key for the output route lookup specifies INADDR_ANY
      for the source address, the output route resolution chooses a real
      source address to use in the final route.
      
      This test has existed forever in igmp_send_report() and David Stevens
      simply copied over the erroneous test when implementing support for
      IGMPv3.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Reviewed-by: NEric Dumazet <eric.dumazet@gmail.com>
      2e97e980
    • D
      ipv4: Sanitize and simplify ip_route_{connect,newports}() · 2d7192d6
      David S. Miller 提交于
      These functions are used together as a unit for route resolution
      during connect().  They address the chicken-and-egg problem that
      exists when ports need to be allocated during connect() processing,
      yet such port allocations require addressing information from the
      routing code.
      
      It's currently more heavy handed than it needs to be, and in
      particular we allocate and initialize a flow object twice.
      
      Let the callers provide the on-stack flow object.  That way we only
      need to initialize it once in the ip_route_connect() call.
      
      Later, if ip_route_newports() needs to do anything, it re-uses that
      flow object as-is except for the ports which it updates before the
      route re-lookup.
      
      Also, describe why this set of facilities are needed and how it works
      in a big comment.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Reviewed-by: NEric Dumazet <eric.dumazet@gmail.com>
      2d7192d6
  3. 26 4月, 2011 1 次提交
    • H
      net: provide cow_metrics() methods to blackhole dst_ops · 0972ddb2
      Held Bernhard 提交于
      Since commit 62fa8a84 (net: Implement read-only protection and COW'ing
      of metrics.) the kernel throws an oops.
      
      [  101.620985] BUG: unable to handle kernel NULL pointer dereference at
                 (null)
      [  101.621050] IP: [<          (null)>]           (null)
      [  101.621084] PGD 6e53c067 PUD 3dd6a067 PMD 0
      [  101.621122] Oops: 0010 [#1] SMP
      [  101.621153] last sysfs file: /sys/devices/virtual/ppp/ppp/uevent
      [  101.621192] CPU 2
      [  101.621206] Modules linked in: l2tp_ppp pppox ppp_generic slhc
      l2tp_netlink l2tp_core deflate zlib_deflate twofish_x86_64
      twofish_common des_generic cbc ecb sha1_generic hmac af_key
      iptable_filter snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device loop
      snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec
      snd_pcm snd_timer snd i2c_i801 iTCO_wdt psmouse soundcore snd_page_alloc
      evdev uhci_hcd ehci_hcd thermal
      [  101.621552]
      [  101.621567] Pid: 5129, comm: openl2tpd Not tainted 2.6.39-rc4-Quad #3
      Gigabyte Technology Co., Ltd. G33-DS3R/G33-DS3R
      [  101.621637] RIP: 0010:[<0000000000000000>]  [<          (null)>]   (null)
      [  101.621684] RSP: 0018:ffff88003ddeba60  EFLAGS: 00010202
      [  101.621716] RAX: ffff88003ddb5600 RBX: ffff88003ddb5600 RCX:
      0000000000000020
      [  101.621758] RDX: ffffffff81a69a00 RSI: ffffffff81b7ee61 RDI:
      ffff88003ddb5600
      [  101.621800] RBP: ffff8800537cd900 R08: 0000000000000000 R09:
      ffff88003ddb5600
      [  101.621840] R10: 0000000000000005 R11: 0000000000014b38 R12:
      ffff88003ddb5600
      [  101.621881] R13: ffffffff81b7e480 R14: ffffffff81b7e8b8 R15:
      ffff88003ddebad8
      [  101.621924] FS:  00007f06e4182700(0000) GS:ffff88007fd00000(0000)
      knlGS:0000000000000000
      [  101.621971] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  101.622005] CR2: 0000000000000000 CR3: 0000000045274000 CR4:
      00000000000006e0
      [  101.622046] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      0000000000000000
      [  101.622087] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
      0000000000000400
      [  101.622129] Process openl2tpd (pid: 5129, threadinfo
      ffff88003ddea000, task ffff88003de9a280)
      [  101.622177] Stack:
      [  101.622191]  ffffffff81447efa ffff88007d3ded80 ffff88003de9a280
      ffff88007d3ded80
      [  101.622245]  0000000000000001 ffff88003ddebbb8 ffffffff8148d5a7
      0000000000000212
      [  101.622299]  ffff88003dcea000 ffff88003dcea188 ffffffff00000001
      ffffffff81b7e480
      [  101.622353] Call Trace:
      [  101.622374]  [<ffffffff81447efa>] ? ipv4_blackhole_route+0x1ba/0x210
      [  101.622415]  [<ffffffff8148d5a7>] ? xfrm_lookup+0x417/0x510
      [  101.622450]  [<ffffffff8127672a>] ? extract_buf+0x9a/0x140
      [  101.622485]  [<ffffffff8144c6a0>] ? __ip_flush_pending_frames+0x70/0x70
      [  101.622526]  [<ffffffff8146fbbf>] ? udp_sendmsg+0x62f/0x810
      [  101.622562]  [<ffffffff813f98a6>] ? sock_sendmsg+0x116/0x130
      [  101.622599]  [<ffffffff8109df58>] ? find_get_page+0x18/0x90
      [  101.622633]  [<ffffffff8109fd6a>] ? filemap_fault+0x12a/0x4b0
      [  101.622668]  [<ffffffff813fb5c4>] ? move_addr_to_kernel+0x64/0x90
      [  101.622706]  [<ffffffff81405d5a>] ? verify_iovec+0x7a/0xf0
      [  101.622739]  [<ffffffff813fc772>] ? sys_sendmsg+0x292/0x420
      [  101.622774]  [<ffffffff810b994a>] ? handle_pte_fault+0x8a/0x7c0
      [  101.622810]  [<ffffffff810b76fe>] ? __pte_alloc+0xae/0x130
      [  101.622844]  [<ffffffff810ba2f8>] ? handle_mm_fault+0x138/0x380
      [  101.622880]  [<ffffffff81024af9>] ? do_page_fault+0x189/0x410
      [  101.622915]  [<ffffffff813fbe03>] ? sys_getsockname+0xf3/0x110
      [  101.622952]  [<ffffffff81450c4d>] ? ip_setsockopt+0x4d/0xa0
      [  101.622986]  [<ffffffff813f9932>] ? sockfd_lookup_light+0x22/0x90
      [  101.623024]  [<ffffffff814b61fb>] ? system_call_fastpath+0x16/0x1b
      [  101.623060] Code:  Bad RIP value.
      [  101.623090] RIP  [<          (null)>]           (null)
      [  101.623125]  RSP <ffff88003ddeba60>
      [  101.623146] CR2: 0000000000000000
      [  101.650871] ---[ end trace ca3856a7d8e8dad4 ]---
      [  101.651011] __sk_free: optmem leakage (160 bytes) detected.
      
      The oops happens in dst_metrics_write_ptr()
      include/net/dst.h:124: return dst->ops->cow_metrics(dst, p);
      
      dst->ops->cow_metrics is NULL and causes the oops.
      
      Provide cow_metrics() methods, like we did in commit 214f45c9
      (net: provide default_advmss() methods to blackhole dst_ops)
      Signed-off-by: NHeld Bernhard <berny156@gmx.de>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0972ddb2
  4. 23 4月, 2011 1 次提交
  5. 18 4月, 2011 1 次提交
  6. 15 4月, 2011 2 次提交
  7. 14 4月, 2011 1 次提交
  8. 13 4月, 2011 2 次提交
    • J
      net: Do not wrap sysctl igmp_max_memberships in IP_MULTICAST · 192910a6
      Joakim Tjernlund 提交于
      controlling igmp_max_membership is useful even when IP_MULTICAST
      is off.
      Quagga(an OSPF deamon) uses multicast addresses for all interfaces
      using a single socket and hits igmp_max_membership limit when
      there are 20 interfaces or more.
      Always export sysctl igmp_max_memberships in proc, just like
      igmp_max_msf
      Signed-off-by: NJoakim Tjernlund <Joakim.Tjernlund@transmode.se>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      192910a6
    • E
      inetpeer: reduce stack usage · 66944e1c
      Eric Dumazet 提交于
      On 64bit arches, we use 752 bytes of stack when cleanup_once() is called
      from inet_getpeer().
      
      Lets share the avl stack to save ~376 bytes.
      
      Before patch :
      
      # objdump -d net/ipv4/inetpeer.o | scripts/checkstack.pl
      
      0x000006c3 unlink_from_pool [inetpeer.o]:		376
      0x00000721 unlink_from_pool [inetpeer.o]:		376
      0x00000cb1 inet_getpeer [inetpeer.o]:			376
      0x00000e6d inet_getpeer [inetpeer.o]:			376
      0x0004 inet_initpeers [inetpeer.o]:			112
      # size net/ipv4/inetpeer.o
         text	   data	    bss	    dec	    hex	filename
         5320	    432	     21	   5773	   168d	net/ipv4/inetpeer.o
      
      After patch :
      
      objdump -d net/ipv4/inetpeer.o | scripts/checkstack.pl
      0x00000c11 inet_getpeer [inetpeer.o]:			376
      0x00000dcd inet_getpeer [inetpeer.o]:			376
      0x00000ab9 peer_check_expire [inetpeer.o]:		328
      0x00000b7f peer_check_expire [inetpeer.o]:		328
      0x0004 inet_initpeers [inetpeer.o]:			112
      # size net/ipv4/inetpeer.o
         text	   data	    bss	    dec	    hex	filename
         5163	    432	     21	   5616	   15f0	net/ipv4/inetpeer.o
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Scot Doyle <lkml@scotdoyle.com>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
      Reviewed-by: NHiroaki SHIMODA <shimoda.hiroaki@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66944e1c
  9. 11 4月, 2011 2 次提交
  10. 08 4月, 2011 1 次提交
  11. 05 4月, 2011 1 次提交
    • T
      net: Allow no-cache copy from user on transmit · c6e1a0d1
      Tom Herbert 提交于
      This patch uses __copy_from_user_nocache on transmit to bypass data
      cache for a performance improvement.  skb_add_data_nocache and
      skb_copy_to_page_nocache can be called by sendmsg functions to use
      this feature, initial support is in tcp_sendmsg.  This functionality is
      configurable per device using ethtool.
      
      Presumably, this feature would only be useful when the driver does
      not touch the data.  The feature is turned on by default if a device
      indicates that it does some form of checksum offload; it is off by
      default for devices that do no checksum offload or indicate no checksum
      is necessary.  For the former case copy-checksum is probably done
      anyway, in the latter case the device is likely loopback in which case
      the no cache copy is probably not beneficial.
      
      This patch was tested using 200 instances of netperf TCP_RR with
      1400 byte request and one byte reply.  Platform is 16 core AMD x86.
      
      No-cache copy disabled:
         672703 tps, 97.13% utilization
         50/90/99% latency:244.31 484.205 1028.41
      
      No-cache copy enabled:
         702113 tps, 96.16% utilization,
         50/90/99% latency 238.56 467.56 956.955
      
      Using 14000 byte request and response sizes demonstrate the
      effects more dramatically:
      
      No-cache copy disabled:
         79571 tps, 34.34 %utlization
         50/90/95% latency 1584.46 2319.59 5001.76
      
      No-cache copy enabled:
         83856 tps, 34.81% utilization
         50/90/95% latency 2508.42 2622.62 2735.88
      
      Note especially the effect on latency tail (95th percentile).
      
      This seems to provide a nice performance improvement and is
      consistent in the tests I ran.  Presumably, this would provide
      the greatest benfits in the presence of an application workload
      stressing the cache and a lot of transmit data happening.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c6e1a0d1
  12. 04 4月, 2011 3 次提交
  13. 02 4月, 2011 1 次提交
  14. 31 3月, 2011 8 次提交
  15. 30 3月, 2011 1 次提交
  16. 29 3月, 2011 1 次提交
  17. 28 3月, 2011 1 次提交
    • J
      ipv4: Fix IP timestamp option (IPOPT_TS_PRESPEC) handling in ip_options_echo() · 8628bd8a
      Jan Luebbe 提交于
      The current handling of echoed IP timestamp options with prespecified
      addresses is rather broken since the 2.2.x kernels. As far as i understand
      it, it should behave like when originating packets.
      
      Currently it will only timestamp the next free slot if:
       - there is space for *two* timestamps
       - some random data from the echoed packet taken as an IP is *not* a local IP
      
      This first is caused by an off-by-one error. 'soffset' points to the next
      free slot and so we only need to have 'soffset + 7 <= optlen'.
      
      The second bug is using sptr as the start of the option, when it really is
      set to 'skb_network_header(skb)'. I just use dptr instead which points to
      the timestamp option.
      
      Finally it would only timestamp for non-local IPs, which we shouldn't do.
      So instead we exclude all unicast destinations, similar to what we do in
      ip_options_compile().
      Signed-off-by: NJan Luebbe <jluebbe@debian.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8628bd8a
  18. 26 3月, 2011 1 次提交
  19. 25 3月, 2011 3 次提交
  20. 24 3月, 2011 2 次提交
  21. 23 3月, 2011 2 次提交
  22. 22 3月, 2011 2 次提交
    • J
      ipv4: optimize route adding on secondary promotion · 04024b93
      Julian Anastasov 提交于
      Optimize the calling of fib_add_ifaddr for all
      secondary addresses after the promoted one to start from
      their place, not from the new place of the promoted
      secondary. It will save some CPU cycles because we
      are sure the promoted secondary was first for the subnet
      and all next secondaries do not change their place.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04024b93
    • J
      ipv4: remove the routes on secondary promotion · 2d230e2b
      Julian Anastasov 提交于
      The secondary address promotion relies on fib_sync_down_addr
      to remove all routes created for the secondary addresses when
      the old primary address is deleted. It does not happen for cases
      when the primary address is also in another subnet. Fix that
      by deleting local and broadcast routes for all secondaries while
      they are on device list and by faking that all addresses from
      this subnet are to be deleted. It relies on fib_del_ifaddr being
      able to ignore the IPs from the concerned subnet while checking
      for duplication.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d230e2b