1. 12 7月, 2011 1 次提交
    • E
      inetpeer: kill inet_putpeer race · 6d1a3e04
      Eric Dumazet 提交于
      We currently can free inetpeer entries too early :
      
      [  782.636674] WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (f130f44c)
      [  782.636677] 1f7b13c100000000000000000000000002000000000000000000000000000000
      [  782.636686]  i i i i u u u u i i i i u u u u i i i i u u u u u u u u u u u u
      [  782.636694]                          ^
      [  782.636696]
      [  782.636698] Pid: 4638, comm: ssh Not tainted 3.0.0-rc5+ #270 Hewlett-Packard HP Compaq 6005 Pro SFF PC/3047h
      [  782.636702] EIP: 0060:[<c13fefbb>] EFLAGS: 00010286 CPU: 0
      [  782.636707] EIP is at inet_getpeer+0x25b/0x5a0
      [  782.636709] EAX: 00000002 EBX: 00010080 ECX: f130f3c0 EDX: f0209d30
      [  782.636711] ESI: 0000bc87 EDI: 0000ea60 EBP: f0209ddc ESP: c173134c
      [  782.636712]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
      [  782.636714] CR0: 8005003b CR2: f0beca80 CR3: 30246000 CR4: 000006d0
      [  782.636716] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
      [  782.636717] DR6: ffff4ff0 DR7: 00000400
      [  782.636718]  [<c13fbf76>] rt_set_nexthop.clone.45+0x56/0x220
      [  782.636722]  [<c13fc449>] __ip_route_output_key+0x309/0x860
      [  782.636724]  [<c141dc54>] tcp_v4_connect+0x124/0x450
      [  782.636728]  [<c142ce43>] inet_stream_connect+0xa3/0x270
      [  782.636731]  [<c13a8da1>] sys_connect+0xa1/0xb0
      [  782.636733]  [<c13a99dd>] sys_socketcall+0x25d/0x2a0
      [  782.636736]  [<c149deb8>] sysenter_do_call+0x12/0x28
      [  782.636738]  [<ffffffff>] 0xffffffff
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d1a3e04
  2. 11 7月, 2011 1 次提交
  3. 06 7月, 2011 1 次提交
  4. 05 7月, 2011 1 次提交
    • M
      net: bind() fix error return on wrong address family · c349a528
      Marcus Meissner 提交于
      Hi,
      
      Reinhard Max also pointed out that the error should EAFNOSUPPORT according
      to POSIX.
      
      The Linux manpages have it as EINVAL, some other OSes (Minix, HPUX, perhaps BSD) use
      EAFNOSUPPORT. Windows uses WSAEFAULT according to MSDN.
      
      Other protocols error values in their af bind() methods in current mainline git as far
      as a brief look shows:
      	EAFNOSUPPORT: atm, appletalk, l2tp, llc, phonet, rxrpc
      	EINVAL: ax25, bluetooth, decnet, econet, ieee802154, iucv, netlink, netrom, packet, rds, rose, unix, x25,
      	No check?: can/raw, ipv6/raw, irda, l2tp/l2tp_ip
      
      Ciao, Marcus
      Signed-off-by: NMarcus Meissner <meissner@suse.de>
      Cc: Reinhard Max <max@suse.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c349a528
  5. 02 7月, 2011 5 次提交
  6. 29 6月, 2011 1 次提交
    • J
      netfilter: Fix ip_route_me_harder triggering ip_rt_bug · ed6e4ef8
      Julian Anastasov 提交于
      	Avoid creating input routes with ip_route_me_harder.
      It does not work for locally generated packets. Instead,
      restrict sockets to provide valid saddr for output route (or
      unicast saddr for transparent proxy). For other traffic
      allow saddr to be unicast or local but if callers forget
      to check saddr type use 0 for the output route.
      
      	The resulting handling should be:
      
      - REJECT TCP:
      	- in INPUT we can provide addr_type = RTN_LOCAL but
      	better allow rejecting traffic delivered with
      	local route (no IP address => use RTN_UNSPEC to
      	allow also RTN_UNICAST).
      	- FORWARD: RTN_UNSPEC => allow RTN_LOCAL/RTN_UNICAST
      	saddr, add fix to ignore RTN_BROADCAST and RTN_MULTICAST
      	- OUTPUT: RTN_UNSPEC
      
      - NAT, mangle, ip_queue, nf_ip_reroute: RTN_UNSPEC in LOCAL_OUT
      
      - IPVS:
      	- use RTN_LOCAL in LOCAL_OUT and FORWARD after SNAT
      	to restrict saddr to be local
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed6e4ef8
  7. 28 6月, 2011 2 次提交
  8. 22 6月, 2011 4 次提交
  9. 21 6月, 2011 1 次提交
  10. 19 6月, 2011 1 次提交
  11. 18 6月, 2011 2 次提交
    • E
      inet_diag: fix inet_diag_bc_audit() · eeb14972
      Eric Dumazet 提交于
      A malicious user or buggy application can inject code and trigger an
      infinite loop in inet_diag_bc_audit()
      
      Also make sure each instruction is aligned on 4 bytes boundary, to avoid
      unaligned accesses.
      Reported-by: NDan Rosenberg <drosenberg@vsecurity.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eeb14972
    • E
      net: rfs: enable RFS before first data packet is received · 1eddcead
      Eric Dumazet 提交于
      Le jeudi 16 juin 2011 à 23:38 -0400, David Miller a écrit :
      > From: Ben Hutchings <bhutchings@solarflare.com>
      > Date: Fri, 17 Jun 2011 00:50:46 +0100
      >
      > > On Wed, 2011-06-15 at 04:15 +0200, Eric Dumazet wrote:
      > >> @@ -1594,6 +1594,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
      > >>  			goto discard;
      > >>
      > >>  		if (nsk != sk) {
      > >> +			sock_rps_save_rxhash(nsk, skb->rxhash);
      > >>  			if (tcp_child_process(sk, nsk, skb)) {
      > >>  				rsk = nsk;
      > >>  				goto reset;
      > >>
      > >
      > > I haven't tried this, but it looks reasonable to me.
      > >
      > > What about IPv6?  The logic in tcp_v6_do_rcv() looks very similar.
      >
      > Indeed ipv6 side needs the same fix.
      >
      > Eric please add that part and resubmit.  And in fact I might stick
      > this into net-2.6 instead of net-next-2.6
      >
      
      OK, here is the net-2.6 based one then, thanks !
      
      [PATCH v2] net: rfs: enable RFS before first data packet is received
      
      First packet received on a passive tcp flow is not correctly RFS
      steered.
      
      One sock_rps_record_flow() call is missing in inet_accept()
      
      But before that, we also must record rxhash when child socket is setup.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Tom Herbert <therbert@google.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      CC: Jamal Hadi Salim <hadi@cyberus.ca>
      Signed-off-by: NDavid S. Miller <davem@conan.davemloft.net>
      1eddcead
  12. 16 6月, 2011 5 次提交
  13. 12 6月, 2011 1 次提交
    • E
      snmp: reduce percpu needs by 50% · 8f0ea0fe
      Eric Dumazet 提交于
      SNMP mibs use two percpu arrays, one used in BH context, another in USER
      context. With increasing number of cpus in machines, and fact that ipv6
      uses per network device ipstats_mib, this is consuming a lot of memory
      if many network devices are registered.
      
      commit be281e55 (ipv6: reduce per device ICMP mib sizes) shrinked
      percpu needs for ipv6, but we can reduce memory use a bit more.
      
      With recent percpu infrastructure (irqsafe_cpu_inc() ...), we no longer
      need this BH/USER separation since we can update counters in a single
      x86 instruction, regardless of the BH/USER context.
      
      Other arches than x86 might need to disable irq in their
      irqsafe_cpu_inc() implementation : If this happens to be a problem, we
      can make SNMP_ARRAY_SZ arch dependent, but a previous poll
      ( https://lkml.org/lkml/2011/3/17/174 ) to arch maintainers did not
      raise strong opposition.
      
      Only on 32bit arches, we need to disable BH for 64bit counters updates
      done from USER context (currently used for IP MIB)
      
      This also reduces vmlinux size :
      
      1) x86_64 build
      $ size vmlinux.before vmlinux.after
         text	   data	    bss	    dec	    hex	filename
      7853650	1293772	1896448	11043870	 a8841e	vmlinux.before
      7850578	1293772	1896448	11040798	 a8781e	vmlinux.after
      
      2) i386  build
      $ size vmlinux.before vmlinux.afterpatch
         text	   data	    bss	    dec	    hex	filename
      6039335	 635076	3670016	10344427	 9dd7eb	vmlinux.before
      6037342	 635076	3670016	10342434	 9dd022	vmlinux.afterpatch
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Andi Kleen <andi@firstfloor.org>
      CC: Ingo Molnar <mingo@elte.hu>
      CC: Tejun Heo <tj@kernel.org>
      CC: Christoph Lameter <cl@linux-foundation.org>
      CC: Benjamin Herrenschmidt <benh@kernel.crashing.org
      CC: linux-arch@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f0ea0fe
  14. 10 6月, 2011 2 次提交
  15. 09 6月, 2011 3 次提交
    • E
      net: pmtu_expires fixes · fe6fe792
      Eric Dumazet 提交于
      commit 2c8cec5c (ipv4: Cache learned PMTU information in inetpeer)
      added some racy peer->pmtu_expires accesses.
      
      As its value can be changed by another cpu/thread, we should be more
      careful, reading its value once.
      
      Add peer_pmtu_expired() and peer_pmtu_cleaned() helpers
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe6fe792
    • E
      inetpeer: remove unused list · 4b9d9be8
      Eric Dumazet 提交于
      Andi Kleen and Tim Chen reported huge contention on inetpeer
      unused_peers.lock, on memcached workload on a 40 core machine, with
      disabled route cache.
      
      It appears we constantly flip peers refcnt between 0 and 1 values, and
      we must insert/remove peers from unused_peers.list, holding a contended
      spinlock.
      
      Remove this list completely and perform a garbage collection on-the-fly,
      at lookup time, using the expired nodes we met during the tree
      traversal.
      
      This removes a lot of code, makes locking more standard, and obsoletes
      two sysctls (inet_peer_gc_mintime and inet_peer_gc_maxtime). This also
      removes two pointers in inet_peer structure.
      
      There is still a false sharing effect because refcnt is in first cache
      line of object [were the links and keys used by lookups are located], we
      might move it at the end of inet_peer structure to let this first cache
      line mostly read by cpus.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Andi Kleen <andi@firstfloor.org>
      CC: Tim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b9d9be8
    • J
      tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side · 9ad7c049
      Jerry Chu 提交于
      This patch lowers the default initRTO from 3secs to 1sec per
      RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
      has been retransmitted, AND the TCP timestamp option is not on.
      
      It also adds support to take RTT sample during 3WHS on the passive
      open side, just like its active open counterpart, and uses it, if
      valid, to seed the initRTO for the data transmission phase.
      
      The patch also resets ssthresh to its initial default at the
      beginning of the data transmission phase, and reduces cwnd to 1 if
      there has been MORE THAN ONE retransmission during 3WHS per RFC5681.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ad7c049
  16. 06 6月, 2011 4 次提交
  17. 02 6月, 2011 1 次提交
  18. 01 6月, 2011 1 次提交
    • C
      ip_options_compile: properly handle unaligned pointer · 48bdf072
      Chris Metcalf 提交于
      The current code takes an unaligned pointer and does htonl() on it to
      make it big-endian, then does a memcpy().  The problem is that the
      compiler decides that since the pointer is to a __be32, it is legal
      to optimize the copy into a processor word store.  However, on an
      architecture that does not handled unaligned writes in kernel space,
      this produces an unaligned exception fault.
      
      The solution is to track the pointer as a "char *" (which removes a bunch
      of unpleasant casts in any case), and then just use put_unaligned_be32()
      to write the value to memory.
      Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NDavid S. Miller <davem@zippy.davemloft.net>
      48bdf072
  19. 28 5月, 2011 1 次提交
  20. 25 5月, 2011 1 次提交
  21. 24 5月, 2011 1 次提交