1. 13 10月, 2009 2 次提交
    • E
      udp: Fix udp_poll() and ioctl() · 85584672
      Eric Dumazet 提交于
      udp_poll() can in some circumstances drop frames with incorrect checksums.
      
      Problem is we now have to lock the socket while dropping frames, or risk
      sk_forward corruption.
      
      This bug is present since commit 95766fff
      ([UDP]: Add memory accounting.)
      
      While we are at it, we can correct ioctl(SIOCINQ) to also drop bad frames.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85584672
    • N
      net: Generalize socket rx gap / receive queue overflow cmsg · 3b885787
      Neil Horman 提交于
      Create a new socket level option to report number of queue overflows
      
      Recently I augmented the AF_PACKET protocol to report the number of frames lost
      on the socket receive queue between any two enqueued frames.  This value was
      exported via a SOL_PACKET level cmsg.  AFter I completed that work it was
      requested that this feature be generalized so that any datagram oriented socket
      could make use of this option.  As such I've created this patch, It creates a
      new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
      SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
      overflowed between any two given frames.  It also augments the AF_PACKET
      protocol to take advantage of this new feature (as it previously did not touch
      sk->sk_drops, which this patch uses to record the overflow count).  Tested
      successfully by me.
      
      Notes:
      
      1) Unlike my previous patch, this patch simply records the sk_drops value, which
      is not a number of drops between packets, but rather a total number of drops.
      Deltas must be computed in user space.
      
      2) While this patch currently works with datagram oriented protocols, it will
      also be accepted by non-datagram oriented protocols. I'm not sure if thats
      agreeable to everyone, but my argument in favor of doing so is that, for those
      protocols which aren't applicable to this option, sk_drops will always be zero,
      and reporting no drops on a receive queue that isn't used for those
      non-participating protocols seems reasonable to me.  This also saves us having
      to code in a per-protocol opt in mechanism.
      
      3) This applies cleanly to net-next assuming that commit
      97775007 (my af packet cmsg patch) is reverted
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b885787
  2. 08 10月, 2009 1 次提交
    • E
      udp: dynamically size hash tables at boot time · f86dcc5a
      Eric Dumazet 提交于
      UDP_HTABLE_SIZE was initialy defined to 128, which is a bit small for
      several setups.
      
      4000 active UDP sockets -> 32 sockets per chain in average. An
      incoming frame has to lookup all sockets to find best match, so long
      chains hurt latency.
      
      Instead of a fixed size hash table that cant be perfect for every
      needs, let UDP stack choose its table size at boot time like tcp/ip
      route, using alloc_large_system_hash() helper
      
      Add an optional boot parameter, uhash_entries=x so that an admin can
      force a size between 256 and 65536 if needed, like thash_entries and
      rhash_entries.
      
      dmesg logs two new lines :
      [    0.647039] UDP hash table entries: 512 (order: 0, 4096 bytes)
      [    0.647099] UDP Lite hash table entries: 512 (order: 0, 4096 bytes)
      
      Maximal size on 64bit arches would be 65536 slots, ie 1 MBytes for non
      debugging spinlocks.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f86dcc5a
  3. 02 10月, 2009 1 次提交
  4. 01 10月, 2009 1 次提交
  5. 03 9月, 2009 1 次提交
    • E
      ip: Report qdisc packet drops · 6ce9e7b5
      Eric Dumazet 提交于
      Christoph Lameter pointed out that packet drops at qdisc level where not
      accounted in SNMP counters. Only if application sets IP_RECVERR, drops
      are reported to user (-ENOBUFS errors) and SNMP counters updated.
      
      IP_RECVERR is used to enable extended reliable error message passing,
      but these are not needed to update system wide SNMP stats.
      
      This patch changes things a bit to allow SNMP counters to be updated,
      regardless of IP_RECVERR being set or not on the socket.
      
      Example after an UDP tx flood
      # netstat -s 
      ...
      IP:
          1487048 outgoing packets dropped
      ...
      Udp:
      ...
          SndbufErrors: 1487048
      
      
      send() syscalls, do however still return an OK status, to not
      break applications.
      
      Note : send() manual page explicitly says for -ENOBUFS error :
      
       "The output queue for a network interface was full.
        This generally indicates that the interface has stopped sending,
        but may be caused by transient congestion.
        (Normally, this does not occur in Linux. Packets are just silently
        dropped when a device queue overflows.) "
      
      This is not true for IP_RECVERR enabled sockets : a send() syscall
      that hit a qdisc drop returns an ENOBUFS error.
      
      Many thanks to Christoph, David, and last but not least, Alexey !
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ce9e7b5
  6. 18 7月, 2009 1 次提交
  7. 13 7月, 2009 1 次提交
  8. 18 6月, 2009 1 次提交
  9. 03 6月, 2009 1 次提交
  10. 11 4月, 2009 1 次提交
    • V
      ipv6: Fix NULL pointer dereference with time-wait sockets · 499923c7
      Vlad Yasevich 提交于
      Commit b2f5e7cd
      (ipv6: Fix conflict resolutions during ipv6 binding)
      introduced a regression where time-wait sockets were
      not treated correctly.  This resulted in the following:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000062
      IP: [<ffffffff805d7d61>] ipv4_rcv_saddr_equal+0x61/0x70
      ...
      Call Trace:
      [<ffffffffa033847b>] ipv6_rcv_saddr_equal+0x1bb/0x250 [ipv6]
      [<ffffffffa03505a8>] inet6_csk_bind_conflict+0x88/0xd0 [ipv6]
      [<ffffffff805bb18e>] inet_csk_get_port+0x1ee/0x400
      [<ffffffffa0319b7f>] inet6_bind+0x1cf/0x3a0 [ipv6]
      [<ffffffff8056d17c>] ? sockfd_lookup_light+0x3c/0xd0
      [<ffffffff8056ed49>] sys_bind+0x89/0x100
      [<ffffffff80613ea2>] ? trace_hardirqs_on_thunk+0x3a/0x3c
      [<ffffffff8020bf9b>] system_call_fastpath+0x16/0x1b
      Tested-by: NBrian Haley <brian.haley@hp.com>
      Tested-by: NEd Tomlinson <edt@aei.ca>
      Signed-off-by: NVlad Yasevich <vladislav.yasevich@hp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      499923c7
  11. 25 3月, 2009 1 次提交
  12. 24 3月, 2009 1 次提交
    • V
      udp: Wrong locking code in udp seq_file infrastructure · 30842f29
      Vitaly Mayatskikh 提交于
      Reading zero bytes from /proc/net/udp or other similar files which use
      the same seq_file udp infrastructure panics kernel in that way:
      
      =====================================
      [ BUG: bad unlock balance detected! ]
      -------------------------------------
      read/1985 is trying to release lock (&table->hash[i].lock) at:
      [<ffffffff81321d83>] udp_seq_stop+0x27/0x29
      but there are no more locks to release!
      
      other info that might help us debug this:
      1 lock held by read/1985:
       #0:  (&p->lock){--..}, at: [<ffffffff810eefb6>] seq_read+0x38/0x348
      
      stack backtrace:
      Pid: 1985, comm: read Not tainted 2.6.29-rc8 #9
      Call Trace:
       [<ffffffff81321d83>] ? udp_seq_stop+0x27/0x29
       [<ffffffff8106dab9>] print_unlock_inbalance_bug+0xd6/0xe1
       [<ffffffff8106db62>] lock_release_non_nested+0x9e/0x1c6
       [<ffffffff810ef030>] ? seq_read+0xb2/0x348
       [<ffffffff8106bdba>] ? mark_held_locks+0x68/0x86
       [<ffffffff81321d83>] ? udp_seq_stop+0x27/0x29
       [<ffffffff8106dde7>] lock_release+0x15d/0x189
       [<ffffffff8137163c>] _spin_unlock_bh+0x1e/0x34
       [<ffffffff81321d83>] udp_seq_stop+0x27/0x29
       [<ffffffff810ef239>] seq_read+0x2bb/0x348
       [<ffffffff810eef7e>] ? seq_read+0x0/0x348
       [<ffffffff8111aedd>] proc_reg_read+0x90/0xaf
       [<ffffffff810d878f>] vfs_read+0xa6/0x103
       [<ffffffff8106bfac>] ? trace_hardirqs_on_caller+0x12f/0x153
       [<ffffffff810d88a2>] sys_read+0x45/0x69
       [<ffffffff8101123a>] system_call_fastpath+0x16/0x1b
      BUG: scheduling while atomic: read/1985/0xffffff00
      INFO: lockdep is turned off.
      Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table dm_multipath kvm ppdev snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep snd_seq_dummy snd_seq_oss snd_seq_midi_event arc4 snd_s
      eq ecb thinkpad_acpi snd_seq_device iwl3945 hwmon sdhci_pci snd_pcm_oss sdhci rfkill mmc_core snd_mixer_oss i2c_i801 mac80211 yenta_socket ricoh_mmc i2c_core iTCO_wdt snd_pcm iTCO_vendor_support rs
      rc_nonstatic snd_timer snd lib80211 cfg80211 soundcore snd_page_alloc video parport_pc output parport e1000e [last unloaded: scsi_wait_scan]
      Pid: 1985, comm: read Not tainted 2.6.29-rc8 #9
      Call Trace:
       [<ffffffff8106b456>] ? __debug_show_held_locks+0x1b/0x24
       [<ffffffff81043660>] __schedule_bug+0x7e/0x83
       [<ffffffff8136ede9>] schedule+0xce/0x838
       [<ffffffff810d7972>] ? fsnotify_access+0x5f/0x67
       [<ffffffff810112d0>] ? sysret_careful+0xb/0x37
       [<ffffffff8106be9c>] ? trace_hardirqs_on_caller+0x1f/0x153
       [<ffffffff8137127b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
       [<ffffffff810112f6>] sysret_careful+0x31/0x37
      read[1985]: segfault at 7fffc479bfe8 ip 0000003e7420a180 sp 00007fffc479bfa0 error 6
      Kernel panic - not syncing: Aiee, killing interrupt handler!
      
      udp_seq_stop() tries to unlock not yet locked spinlock. The lock was lost
      during splitting global udp_hash_lock to subsequent spinlocks.
      
      Signed-off by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
      Acked-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30842f29
  13. 14 3月, 2009 1 次提交
  14. 16 2月, 2009 1 次提交
  15. 06 2月, 2009 2 次提交
  16. 03 2月, 2009 1 次提交
  17. 27 1月, 2009 1 次提交
  18. 26 11月, 2008 1 次提交
  19. 25 11月, 2008 1 次提交
    • E
      net: avoid a pair of dst_hold()/dst_release() in ip_append_data() · 2e77d89b
      Eric Dumazet 提交于
      We can reduce pressure on dst entry refcount that slowdown UDP transmit
      path on SMP machines. This pressure is visible on RTP servers when
      delivering content to mediagateways, especially big ones, handling
      thousand of streams. Several cpus send UDP frames to the same
      destination, hence use the same dst entry.
      
      This patch makes ip_append_data() eventually steal the refcount its
      callers had to take on the dst entry.
      
      This doesnt avoid all refcounting, but still gives speedups on SMP,
      on UDP/RAW transmit path
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e77d89b
  20. 20 11月, 2008 1 次提交
  21. 17 11月, 2008 1 次提交
  22. 02 11月, 2008 2 次提交
  23. 31 10月, 2008 2 次提交
  24. 30 10月, 2008 2 次提交
  25. 29 10月, 2008 3 次提交
    • E
      udp: calculate udp_mem based on low memory instead of all memory · 8203efb3
      Eric Dumazet 提交于
      This patch mimics commit 57413ebc
      (tcp: calculate tcp_mem based on low memory instead of all memory)
      
      The udp_mem array which contains limits on the total amount of memory
      used by UDP sockets is calculated based on nr_all_pages.  On a 32 bits
      x86 system, we should base this on the number of lowmem pages.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8203efb3
    • E
      udp: RCU handling for Unicast packets. · 271b72c7
      Eric Dumazet 提交于
      Goals are :
      
      1) Optimizing handling of incoming Unicast UDP frames, so that no memory
       writes should happen in the fast path.
      
       Note: Multicasts and broadcasts still will need to take a lock,
       because doing a full lockless lookup in this case is difficult.
      
      2) No expensive operations in the socket bind/unhash phases :
        - No expensive synchronize_rcu() calls.
      
        - No added rcu_head in socket structure, increasing memory needs,
        but more important, forcing us to use call_rcu() calls,
        that have the bad property of making sockets structure cold.
        (rcu grace period between socket freeing and its potential reuse
         make this socket being cold in CPU cache).
        David did a previous patch using call_rcu() and noticed a 20%
        impact on TCP connection rates.
        Quoting Cristopher Lameter :
         "Right. That results in cacheline cooldown. You'd want to recycle
          the object as they are cache hot on a per cpu basis. That is screwed
          up by the delayed regular rcu processing. We have seen multiple
          regressions due to cacheline cooldown.
          The only choice in cacheline hot sensitive areas is to deal with the
          complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
      
        - Because udp sockets are allocated from dedicated kmem_cache,
        use of SLAB_DESTROY_BY_RCU can help here.
      
      Theory of operation :
      ---------------------
      
      As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
      special attention must be taken by readers and writers.
      
      Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
      reused, inserted in a different chain or in worst case in the same chain
      while readers could do lookups in the same time.
      
      In order to avoid loops, a reader must check each socket found in a chain
      really belongs to the chain the reader was traversing. If it finds a
      mismatch, lookup must start again at the begining. This *restart* loop
      is the reason we had to use rdlock for the multicast case, because
      we dont want to send same message several times to the same socket.
      
      We use RCU only for fast path.
      Thus, /proc/net/udp still takes spinlocks.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      271b72c7
    • E
      udp: introduce struct udp_table and multiple spinlocks · 645ca708
      Eric Dumazet 提交于
      UDP sockets are hashed in a 128 slots hash table.
      
      This hash table is protected by *one* rwlock.
      
      This rwlock is readlocked each time an incoming UDP message is handled.
      
      This rwlock is writelocked each time a socket must be inserted in
      hash table (bind time), or deleted from this table (close time)
      
      This is not scalable on SMP machines :
      
      1) Even in read mode, lock() and unlock() are atomic operations and
       must dirty a contended cache line, shared by all cpus.
      
      2) A writer might be starved if many readers are 'in flight'. This can
       happen on a machine with some NIC receiving many UDP messages. User
       process can be delayed a long time at socket creation/dismantle time.
      
      This patch prepares RCU migration, by introducing 'struct udp_table
      and struct udp_hslot', and using one spinlock per chain, to reduce
      contention on central rwlock.
      
      Introducing one spinlock per chain reduces latencies, for port
      randomization on heavily loaded UDP servers. This also speedup
      bindings to specific ports.
      
      udp_lib_unhash() was uninlined, becoming to big.
      
      Some cleanups were done to ease review of following patch
      (RCUification of UDP Unicast lookups)
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      645ca708
  26. 14 10月, 2008 1 次提交
  27. 10 10月, 2008 1 次提交
  28. 09 10月, 2008 1 次提交
    • E
      udp: Improve port randomization · 9088c560
      Eric Dumazet 提交于
      Current UDP port allocation is suboptimal.
      We select the shortest chain to chose a port (out of 512)
      that will hash in this shortest chain.
      
      First, it can lead to give not so ramdom ports and ease
      give attackers more opportunities to break the system.
      
      Second, it can consume a lot of CPU to scan all table
      in order to find the shortest chain.
      
      Third, in some pathological cases we can fail to find
      a free port even if they are plenty of them.
      
      This patch zap the search for a short chain and only
      use one random seed. Problem of getting long chains
      should be addressed in another way, since we can
      obtain long chains with non random ports.
      
      Based on a report and patch from Vitaly Mayatskikh
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9088c560
  29. 08 10月, 2008 3 次提交
  30. 01 10月, 2008 1 次提交
  31. 16 9月, 2008 1 次提交