1. 03 5月, 2010 1 次提交
  2. 02 5月, 2010 1 次提交
    • E
      net: sock_def_readable() and friends RCU conversion · 43815482
      Eric Dumazet 提交于
      sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
      need two atomic operations (and associated dirtying) per incoming
      packet.
      
      RCU conversion is pretty much needed :
      
      1) Add a new structure, called "struct socket_wq" to hold all fields
      that will need rcu_read_lock() protection (currently: a
      wait_queue_head_t and a struct fasync_struct pointer).
      
      [Future patch will add a list anchor for wakeup coalescing]
      
      2) Attach one of such structure to each "struct socket" created in
      sock_alloc_inode().
      
      3) Respect RCU grace period when freeing a "struct socket_wq"
      
      4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
      socket_wq"
      
      5) Change sk_sleep() function to use new sk->sk_wq instead of
      sk->sk_sleep
      
      6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
      a rcu_read_lock() section.
      
      7) Change all sk_has_sleeper() callers to :
        - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
        - Use wq_has_sleeper() to eventually wakeup tasks.
        - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)
      
      8) sock_wake_async() is modified to use rcu protection as well.
      
      9) Exceptions :
        macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
      instead of dynamically allocated ones. They dont need rcu freeing.
      
      Some cleanups or followups are probably needed, (possible
      sk_callback_lock conversion to a spinlock for example...).
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43815482
  3. 01 5月, 2010 7 次提交
  4. 29 4月, 2010 8 次提交
    • E
      net: ip_queue_rcv_skb() helper · f84af32c
      Eric Dumazet 提交于
      When queueing a skb to socket, we can immediately release its dst if
      target socket do not use IP_CMSG_PKTINFO.
      
      tcp_data_queue() can drop dst too.
      
      This to benefit from a hot cache line and avoid the receiver, possibly
      on another cpu, to dirty this cache line himself.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f84af32c
    • E
      net: speedup udp receive path · 4b0b72f7
      Eric Dumazet 提交于
      Since commit 95766fff ([UDP]: Add memory accounting.), 
      each received packet needs one extra sock_lock()/sock_release() pair.
      
      This added latency because of possible backlog handling. Then later,
      ticket spinlocks added yet another latency source in case of DDOS.
      
      This patch introduces lock_sock_bh() and unlock_sock_bh()
      synchronization primitives, avoiding one atomic operation and backlog
      processing.
      
      skb_free_datagram_locked() uses them instead of full blown
      lock_sock()/release_sock(). skb is orphaned inside locked section for
      proper socket memory reclaim, and finally freed outside of it.
      
      UDP receive path now take the socket spinlock only once.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b0b72f7
    • N
      sctp: Fix skb_over_panic resulting from multiple invalid parameter errors (CVE-2010-1173) (v4) · 5fa782c2
      Neil Horman 提交于
      Ok, version 4
      
      Change Notes:
      1) Minor cleanups, from Vlads notes
      
      Summary:
      
      Hey-
      	Recently, it was reported to me that the kernel could oops in the
      following way:
      
      <5> kernel BUG at net/core/skbuff.c:91!
      <5> invalid operand: 0000 [#1]
      <5> Modules linked in: sctp netconsole nls_utf8 autofs4 sunrpc iptable_filter
      ip_tables cpufreq_powersave parport_pc lp parport vmblock(U) vsock(U) vmci(U)
      vmxnet(U) vmmemctl(U) vmhgfs(U) acpiphp dm_mirror dm_mod button battery ac md5
      ipv6 uhci_hcd ehci_hcd snd_ens1371 snd_rawmidi snd_seq_device snd_pcm_oss
      snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_ac97_codec snd soundcore
      pcnet32 mii floppy ext3 jbd ata_piix libata mptscsih mptsas mptspi mptscsi
      mptbase sd_mod scsi_mod
      <5> CPU:    0
      <5> EIP:    0060:[<c02bff27>]    Not tainted VLI
      <5> EFLAGS: 00010216   (2.6.9-89.0.25.EL)
      <5> EIP is at skb_over_panic+0x1f/0x2d
      <5> eax: 0000002c   ebx: c033f461   ecx: c0357d96   edx: c040fd44
      <5> esi: c033f461   edi: df653280   ebp: 00000000   esp: c040fd40
      <5> ds: 007b   es: 007b   ss: 0068
      <5> Process swapper (pid: 0, threadinfo=c040f000 task=c0370be0)
      <5> Stack: c0357d96 e0c29478 00000084 00000004 c033f461 df653280 d7883180
      e0c2947d
      <5>        00000000 00000080 df653490 00000004 de4f1ac0 de4f1ac0 00000004
      df653490
      <5>        00000001 e0c2877a 08000800 de4f1ac0 df653490 00000000 e0c29d2e
      00000004
      <5> Call Trace:
      <5>  [<e0c29478>] sctp_addto_chunk+0xb0/0x128 [sctp]
      <5>  [<e0c2947d>] sctp_addto_chunk+0xb5/0x128 [sctp]
      <5>  [<e0c2877a>] sctp_init_cause+0x3f/0x47 [sctp]
      <5>  [<e0c29d2e>] sctp_process_unk_param+0xac/0xb8 [sctp]
      <5>  [<e0c29e90>] sctp_verify_init+0xcc/0x134 [sctp]
      <5>  [<e0c20322>] sctp_sf_do_5_1B_init+0x83/0x28e [sctp]
      <5>  [<e0c25333>] sctp_do_sm+0x41/0x77 [sctp]
      <5>  [<c01555a4>] cache_grow+0x140/0x233
      <5>  [<e0c26ba1>] sctp_endpoint_bh_rcv+0xc5/0x108 [sctp]
      <5>  [<e0c2b863>] sctp_inq_push+0xe/0x10 [sctp]
      <5>  [<e0c34600>] sctp_rcv+0x454/0x509 [sctp]
      <5>  [<e084e017>] ipt_hook+0x17/0x1c [iptable_filter]
      <5>  [<c02d005e>] nf_iterate+0x40/0x81
      <5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
      <5>  [<c02e0c7f>] ip_local_deliver_finish+0xc6/0x151
      <5>  [<c02d0362>] nf_hook_slow+0x83/0xb5
      <5>  [<c02e0bb2>] ip_local_deliver+0x1a2/0x1a9
      <5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
      <5>  [<c02e103e>] ip_rcv+0x334/0x3b4
      <5>  [<c02c66fd>] netif_receive_skb+0x320/0x35b
      <5>  [<e0a0928b>] init_stall_timer+0x67/0x6a [uhci_hcd]
      <5>  [<c02c67a4>] process_backlog+0x6c/0xd9
      <5>  [<c02c690f>] net_rx_action+0xfe/0x1f8
      <5>  [<c012a7b1>] __do_softirq+0x35/0x79
      <5>  [<c0107efb>] handle_IRQ_event+0x0/0x4f
      <5>  [<c01094de>] do_softirq+0x46/0x4d
      
      Its an skb_over_panic BUG halt that results from processing an init chunk in
      which too many of its variable length parameters are in some way malformed.
      
      The problem is in sctp_process_unk_param:
      if (NULL == *errp)
      	*errp = sctp_make_op_error_space(asoc, chunk,
      					 ntohs(chunk->chunk_hdr->length));
      
      	if (*errp) {
      		sctp_init_cause(*errp, SCTP_ERROR_UNKNOWN_PARAM,
      				 WORD_ROUND(ntohs(param.p->length)));
      		sctp_addto_chunk(*errp,
      			WORD_ROUND(ntohs(param.p->length)),
      				  param.v);
      
      When we allocate an error chunk, we assume that the worst case scenario requires
      that we have chunk_hdr->length data allocated, which would be correct nominally,
      given that we call sctp_addto_chunk for the violating parameter.  Unfortunately,
      we also, in sctp_init_cause insert a sctp_errhdr_t structure into the error
      chunk, so the worst case situation in which all parameters are in violation
      requires chunk_hdr->length+(sizeof(sctp_errhdr_t)*param_count) bytes of data.
      
      The result of this error is that a deliberately malformed packet sent to a
      listening host can cause a remote DOS, described in CVE-2010-1173:
      http://cve.mitre.org/cgi-bin/cvename.cgi?name=2010-1173
      
      I've tested the below fix and confirmed that it fixes the issue.  We move to a
      strategy whereby we allocate a fixed size error chunk and ignore errors we don't
      have space to report.  Tested by me successfully
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NVlad Yasevich <vladislav.yasevich@hp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5fa782c2
    • S
      caif: Disconnect without waiting for response · 8d545c8f
      Sjur Braendeland 提交于
      Changes:
      o Function cfcnfg_disconn_adapt_layer is changed to do asynchronous
        disconnect, not waiting for any response from the modem. Due to this
        the function cfcnfg_linkdestroy_rsp does nothing anymore.
      o Because disconnect may take down a connection before a connect response
        is received the function cfcnfg_linkup_rsp is checking if the client is
        still waiting for the response, if not a disconnect request is sent to
        the modem.
      o cfctrl is no longer keeping track of pending disconnect requests.
      o Added function cfctrl_cancel_req, which is used for deleting a pending
        connect request if disconnect is done before connect response is received.
      o Removed unused function cfctrl_insert_req2
      o Added better handling of connect reject from modem.
      Signed-off-by: NSjur Braendeland <sjur.brandeland@stericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d545c8f
    • S
      caif: Add reference counting to service layer · 5b208656
      Sjur Braendeland 提交于
      Changes:
      o Added functions cfsrvl_get and cfsrvl_put.
      o Added support release_client to use by socket and net device.
      o Increase reference counting for in-flight packets from cfmuxl
      Signed-off-by: NSjur Braendeland <sjur.brandeland@stericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b208656
    • S
      caif: Rename functions in cfcnfg and caif_dev · e539d83c
      Sjur Braendeland 提交于
      Changes:
       o Renamed cfcnfg_del_adapt_layer to cfcnfg_disconn_adapt_layer
       o Fixed typo cfcfg to cfcnfg
       o Renamed linkid to channel_id
       o Updated documentation in caif_dev.h
       o Minor formatting changes
      Signed-off-by: NSjur Braendeland <sjur.brandeland@stericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e539d83c
    • V
      sctp: Fix oops when sending queued ASCONF chunks · c0786693
      Vlad Yasevich 提交于
      When we finish processing ASCONF_ACK chunk, we try to send
      the next queued ASCONF.  This action runs the sctp state
      machine recursively and it's not prepared to do so.
      
      kernel BUG at kernel/timer.c:790!
      invalid opcode: 0000 [#1] SMP
      last sysfs file: /sys/module/ipv6/initstate
      Modules linked in: sha256_generic sctp libcrc32c ipv6 dm_multipath
      uinput 8139too i2c_piix4 8139cp mii i2c_core pcspkr virtio_net joydev
      floppy virtio_blk virtio_pci [last unloaded: scsi_wait_scan]
      
      Pid: 0, comm: swapper Not tainted 2.6.34-rc4 #15 /Bochs
      EIP: 0060:[<c044a2ef>] EFLAGS: 00010286 CPU: 0
      EIP is at add_timer+0xd/0x1b
      EAX: cecbab14 EBX: 000000f0 ECX: c0957b1c EDX: 03595cf4
      ESI: cecba800 EDI: cf276f00 EBP: c0957aa0 ESP: c0957aa0
       DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
      Process swapper (pid: 0, ti=c0956000 task=c0988ba0 task.ti=c0956000)
      Stack:
       c0957ae0 d1851214 c0ab62e4 c0ab5f26 0500ffff 00000004 00000005 00000004
      <0> 00000000 d18694fd 00000004 1666b892 cecba800 cecba800 c0957b14
      00000004
      <0> c0957b94 d1851b11 ceda8b00 cecba800 cf276f00 00000001 c0957b14
      000000d0
      Call Trace:
       [<d1851214>] ? sctp_side_effects+0x607/0xdfc [sctp]
       [<d1851b11>] ? sctp_do_sm+0x108/0x159 [sctp]
       [<d1863386>] ? sctp_pname+0x0/0x1d [sctp]
       [<d1861a56>] ? sctp_primitive_ASCONF+0x36/0x3b [sctp]
       [<d185657c>] ? sctp_process_asconf_ack+0x2a4/0x2d3 [sctp]
       [<d184e35c>] ? sctp_sf_do_asconf_ack+0x1dd/0x2b4 [sctp]
       [<d1851ac1>] ? sctp_do_sm+0xb8/0x159 [sctp]
       [<d1863334>] ? sctp_cname+0x0/0x52 [sctp]
       [<d1854377>] ? sctp_assoc_bh_rcv+0xac/0xe1 [sctp]
       [<d1858f0f>] ? sctp_inq_push+0x2d/0x30 [sctp]
       [<d186329d>] ? sctp_rcv+0x797/0x82e [sctp]
      Tested-by: NWei Yongjun <yjwei@cn.fujitsu.com>
      Signed-off-by: NYuansong Qiao <ysqiao@research.ait.ie>
      Signed-off-by: NShuaijun Zhang <szhang@research.ait.ie>
      Signed-off-by: NVlad Yasevich <vladislav.yasevich@hp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0786693
    • W
      sctp: avoid irq lock inversion while call sk->sk_data_ready() · 561b1733
      Wei Yongjun 提交于
      sk->sk_data_ready() of sctp socket can be called from both BH and non-BH
      contexts, but the default sk->sk_data_ready(), sock_def_readable(), can
      not be used in this case. Therefore, we have to make a new function
      sctp_data_ready() to grab sk->sk_data_ready() with BH disabling.
      
      =========================================================
      [ INFO: possible irq lock inversion dependency detected ]
      2.6.33-rc6 #129
      ---------------------------------------------------------
      sctp_darn/1517 just changed the state of lock:
       (clock-AF_INET){++.?..}, at: [<c06aab60>] sock_def_readable+0x20/0x80
      but this lock took another, SOFTIRQ-unsafe lock in the past:
       (slock-AF_INET){+.-...}
      
      and interrupts could create inverse lock ordering between them.
      
      other info that might help us debug this:
      1 lock held by sctp_darn/1517:
       #0:  (sk_lock-AF_INET){+.+.+.}, at: [<cdfe363d>] sctp_sendmsg+0x23d/0xc00 [sctp]
      Signed-off-by: NWei Yongjun <yjwei@cn.fujitsu.com>
      Signed-off-by: NVlad Yasevich <vladislav.yasevich@hp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      561b1733
  5. 28 4月, 2010 7 次提交
  6. 26 4月, 2010 1 次提交
  7. 24 4月, 2010 2 次提交
  8. 23 4月, 2010 4 次提交
    • Y
    • A
      X25: Add if_x25.h and x25 to device identifiers · 5ebfbc06
      Andrew Hendry 提交于
      V2 Feedback from John Hughes.
      - Add header for userspace implementations such as xot/xoe to use
      - Use explicit values for interface stability
      - No changes to driver patches
      
      V1
      - Use identifiers instead of magic numbers for X25 layer 3 to device interface.
      - Also fixed checkpatch notes on updated code.
      
      [ Add new user header to include/linux/Kbuild  -DaveM ]
      Signed-off-by: NAndrew Hendry <andrew.hendry@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ebfbc06
    • E
      dst: rcu check refinement · f68c224f
      Eric Dumazet 提交于
      __sk_dst_get() might be called from softirq, with socket lock held.
      
      [  159.026180] include/net/sock.h:1200 invoked rcu_dereference_check()
      without protection!
      [  159.026261] 
      [  159.026261] other info that might help us debug this:
      [  159.026263] 
      [  159.026425] 
      [  159.026426] rcu_scheduler_active = 1, debug_locks = 0
      [  159.026552] 2 locks held by swapper/0:
      [  159.026609]  #0:  (&icsk->icsk_retransmit_timer){+.-...}, at:
      [<ffffffff8104fc15>] run_timer_softirq+0x105/0x350
      [  159.026839]  #1:  (slock-AF_INET){+.-...}, at: [<ffffffff81392b8f>]
      tcp_write_timer+0x2f/0x1e0
      [  159.027063] 
      [  159.027064] stack backtrace:
      [  159.027172] Pid: 0, comm: swapper Not tainted
      2.6.34-rc5-03707-gde498c89-dirty #36
      [  159.027252] Call Trace:
      [  159.027306]  <IRQ>  [<ffffffff810718ef>] lockdep_rcu_dereference
      +0xaf/0xc0
      [  159.027411]  [<ffffffff8138e4f7>] tcp_current_mss+0xa7/0xb0
      [  159.027537]  [<ffffffff8138fa49>] tcp_write_wakeup+0x89/0x190
      [  159.027600]  [<ffffffff81391936>] tcp_send_probe0+0x16/0x100
      [  159.027726]  [<ffffffff81392cd9>] tcp_write_timer+0x179/0x1e0
      [  159.027790]  [<ffffffff8104fca1>] run_timer_softirq+0x191/0x350
      [  159.027980]  [<ffffffff810477ed>] __do_softirq+0xcd/0x200
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f68c224f
    • T
      tcp: fix outsegs stat for TSO segments · aa2ea058
      Tom Herbert 提交于
      Account for TSO segments of an skb in TCP_MIB_OUTSEGS counter.  Without
      doing this, the counter can be off by orders of magnitude from the
      actual number of segments sent.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa2ea058
  9. 22 4月, 2010 1 次提交
  10. 21 4月, 2010 1 次提交
  11. 20 4月, 2010 5 次提交
  12. 17 4月, 2010 2 次提交
    • T
      rfs: Receive Flow Steering · fec5e652
      Tom Herbert 提交于
      This patch implements receive flow steering (RFS).  RFS steers
      received packets for layer 3 and 4 processing to the CPU where
      the application for the corresponding flow is running.  RFS is an
      extension of Receive Packet Steering (RPS).
      
      The basic idea of RFS is that when an application calls recvmsg
      (or sendmsg) the application's running CPU is stored in a hash
      table that is indexed by the connection's rxhash which is stored in
      the socket structure.  The rxhash is passed in skb's received on
      the connection from netif_receive_skb.  For each received packet,
      the associated rxhash is used to look up the CPU in the hash table,
      if a valid CPU is set then the packet is steered to that CPU using
      the RPS mechanisms.
      
      The convolution of the simple approach is that it would potentially
      allow OOO packets.  If threads are thrashing around CPUs or multiple
      threads are trying to read from the same sockets, a quickly changing
      CPU value in the hash table could cause rampant OOO packets--
      we consider this a non-starter.
      
      To avoid OOO packets, this solution implements two types of hash
      tables: rps_sock_flow_table and rps_dev_flow_table.
      
      rps_sock_table is a global hash table.  Each entry is just a CPU
      number and it is populated in recvmsg and sendmsg as described above.
      This table contains the "desired" CPUs for flows.
      
      rps_dev_flow_table is specific to each device queue.  Each entry
      contains a CPU and a tail queue counter.  The CPU is the "current"
      CPU for a matching flow.  The tail queue counter holds the value
      of a tail queue counter for the associated CPU's backlog queue at
      the time of last enqueue for a flow matching the entry.
      
      Each backlog queue has a queue head counter which is incremented
      on dequeue, and so a queue tail counter is computed as queue head
      count + queue length.  When a packet is enqueued on a backlog queue,
      the current value of the queue tail counter is saved in the hash
      entry of the rps_dev_flow_table.
      
      And now the trick: when selecting the CPU for RPS (get_rps_cpu)
      the rps_sock_flow table and the rps_dev_flow table for the RX queue
      are consulted.  When the desired CPU for the flow (found in the
      rps_sock_flow table) does not match the current CPU (found in the
      rps_dev_flow table), the current CPU is changed to the desired CPU
      if one of the following is true:
      
      - The current CPU is unset (equal to RPS_NO_CPU)
      - Current CPU is offline
      - The current CPU's queue head counter >= queue tail counter in the
      rps_dev_flow table.  This checks if the queue tail has advanced
      beyond the last packet that was enqueued using this table entry.
      This guarantees that all packets queued using this entry have been
      dequeued, thus preserving in order delivery.
      
      Making each queue have its own rps_dev_flow table has two advantages:
      1) the tail queue counters will be written on each receive, so
      keeping the table local to interrupting CPU s good for locality.  2)
      this allows lockless access to the table-- the CPU number and queue
      tail counter need to be accessed together under mutual exclusion
      from netif_receive_skb, we assume that this is only called from
      device napi_poll which is non-reentrant.
      
      This patch implements RFS for TCP and connected UDP sockets.
      It should be usable for other flow oriented protocols.
      
      There are two configuration parameters for RFS.  The
      "rps_flow_entries" kernel init parameter sets the number of
      entries in the rps_sock_flow_table, the per rxqueue sysfs entry
      "rps_flow_cnt" contains the number of entries in the rps_dev_flow
      table for the rxqueue.  Both are rounded to power of two.
      
      The obvious benefit of RFS (over just RPS) is that it achieves
      CPU locality between the receive processing for a flow and the
      applications processing; this can result in increased performance
      (higher pps, lower latency).
      
      The benefits of RFS are dependent on cache hierarchy, application
      load, and other factors.  On simple benchmarks, we don't necessarily
      see improvement and sometimes see degradation.  However, for more
      complex benchmarks and for applications where cache pressure is
      much higher this technique seems to perform very well.
      
      Below are some benchmark results which show the potential benfit of
      this patch.  The netperf test has 500 instances of netperf TCP_RR
      test with 1 byte req. and resp.  The RPC test is an request/response
      test similar in structure to netperf RR test ith 100 threads on
      each host, but does more work in userspace that netperf.
      
      e1000e on 8 core Intel
         No RFS or RPS		104K tps at 30% CPU
         No RFS (best RPS config):    290K tps at 63% CPU
         RFS				303K tps at 61% CPU
      
      RPC test	tps	CPU%	50/90/99% usec latency	Latency StdDev
        No RFS/RPS	103K	48%	757/900/3185		4472.35
        RPS only:	174K	73%	415/993/2468		491.66
        RFS		223K	73%	379/651/1382		315.61
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fec5e652
    • L
      mac80211: add LDPC control flag · 0a56bd0a
      Luis R. Rodriguez 提交于
      LDPC will be enabled through the rate control algorithm
      for each buffer the the tx_info flags.
      Signed-off-by: NLuis R. Rodriguez <lrodriguez@atheros.com>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      0a56bd0a