1. 25 9月, 2012 1 次提交
    • E
      net: use a per task frag allocator · 5640f768
      Eric Dumazet 提交于
      We currently use a per socket order-0 page cache for tcp_sendmsg()
      operations.
      
      This page is used to build fragments for skbs.
      
      Its done to increase probability of coalescing small write() into
      single segments in skbs still in write queue (not yet sent)
      
      But it wastes a lot of memory for applications handling many mostly
      idle sockets, since each socket holds one page in sk->sk_sndmsg_page
      
      Its also quite inefficient to build TSO 64KB packets, because we need
      about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
      page allocator more than wanted.
      
      This patch adds a per task frag allocator and uses bigger pages,
      if available. An automatic fallback is done in case of memory pressure.
      
      (up to 32768 bytes per frag, thats order-3 pages on x86)
      
      This increases TCP stream performance by 20% on loopback device,
      but also benefits on other network devices, since 8x less frags are
      mapped on transmit and unmapped on tx completion. Alexander Duyck
      mentioned a probable performance win on systems with IOMMU enabled.
      
      Its possible some SG enabled hardware cant cope with bigger fragments,
      but their ndo_start_xmit() should already handle this, splitting a
      fragment in sub fragments, since some arches have PAGE_SIZE=65536
      
      Successfully tested on various ethernet devices.
      (ixgbe, igb, bnx2x, tg3, mellanox mlx4)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5640f768
  2. 15 8月, 2012 2 次提交
  3. 02 8月, 2012 1 次提交
  4. 01 8月, 2012 6 次提交
    • M
      netvm: prevent a stream-specific deadlock · c76562b6
      Mel Gorman 提交于
      This patch series is based on top of "Swap-over-NBD without deadlocking
      v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.
      
      When a user or administrator requires swap for their application, they
      create a swap partition and file, format it with mkswap and activate it
      with swapon.  In diskless systems this is not an option so if swap if
      required then swapping over the network is considered.  The two likely
      scenarios are when blade servers are used as part of a cluster where the
      form factor or maintenance costs do not allow the use of disks and thin
      clients.
      
      The Linux Terminal Server Project recommends the use of the Network Block
      Device (NBD) for swap but this is not always an option.  There is no
      guarantee that the network attached storage (NAS) device is running Linux
      or supports NBD.  However, it is likely that it supports NFS so there are
      users that want support for swapping over NFS despite any performance
      concern.  Some distributions currently carry patches that support swapping
      over NFS but it would be preferable to support it in the mainline kernel.
      
      Patch 1 avoids a stream-specific deadlock that potentially affects TCP.
      
      Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
      	reserves.
      
      Patch 3 adds three helpers for filesystems to handle swap cache pages.
      	For example, page_file_mapping() returns page->mapping for
      	file-backed pages and the address_space of the underlying
      	swap file for swap cache pages.
      
      Patch 4 adds two address_space_operations to allow a filesystem
      	to pin all metadata relevant to a swapfile in memory. Upon
      	successful activation, the swapfile is marked SWP_FILE and
      	the address space operation ->direct_IO is used for writing
      	and ->readpage for reading in swap pages.
      
      Patch 5 notes that patch 3 is bolting
      	filesystem-specific-swapfile-support onto the side and that
      	the default handlers have different information to what
      	is available to the filesystem. This patch refactors the
      	code so that there are generic handlers for each of the new
      	address_space operations.
      
      Patch 6 adds an API to allow a vector of kernel addresses to be
      	translated to struct pages and pinned for IO.
      
      Patch 7 adds support for using highmem pages for swap by kmapping
      	the pages before calling the direct_IO handler.
      
      Patch 8 updates NFS to use the helpers from patch 3 where necessary.
      
      Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.
      
      Patch 10 implements the new swapfile-related address_space operations
      	for NFS and teaches the direct IO handler how to manage
      	kernel addresses.
      
      Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
      	where appropriate.
      
      Patch 12 fixes a NULL pointer dereference that occurs when using
      	swap-over-NFS.
      
      With the patches applied, it is possible to mount a swapfile that is on an
      NFS filesystem.  Swap performance is not great with a swap stress test
      taking roughly twice as long to complete than if the swap device was
      backed by NBD.
      
      This patch: netvm: prevent a stream-specific deadlock
      
      It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
      that we're over the global rmem limit.  This will prevent SOCK_MEMALLOC
      buffers from receiving data, which will prevent userspace from running,
      which is needed to reduce the buffered data.
      
      Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.  Once
      this change it applied, it is important that sockets that set
      SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
      If this happens, a warning is generated and the tokens reclaimed to avoid
      accounting errors until the bug is fixed.
      
      [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c76562b6
    • M
      netvm: set PF_MEMALLOC as appropriate during SKB processing · b4b9e355
      Mel Gorman 提交于
      In order to make sure pfmemalloc packets receive all memory needed to
      proceed, ensure processing of pfmemalloc SKBs happens under PF_MEMALLOC.
      This is limited to a subset of protocols that are expected to be used for
      writing to swap.  Taps are not allowed to use PF_MEMALLOC as these are
      expected to communicate with userspace processes which could be paged out.
      
      [a.p.zijlstra@chello.nl: Ideas taken from various patches]
      [jslaby@suse.cz: Lock imbalance fix]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4b9e355
    • M
      netvm: allow skb allocation to use PFMEMALLOC reserves · c93bdd0e
      Mel Gorman 提交于
      Change the skb allocation API to indicate RX usage and use this to fall
      back to the PFMEMALLOC reserve when needed.  SKBs allocated from the
      reserve are tagged in skb->pfmemalloc.  If an SKB is allocated from the
      reserve and the socket is later found to be unrelated to page reclaim, the
      packet is dropped so that the memory remains available for page reclaim.
      Network protocols are expected to recover from this packet loss.
      
      [a.p.zijlstra@chello.nl: Ideas taken from various patches]
      [davem@davemloft.net: Use static branches, coding style corrections]
      [sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c93bdd0e
    • M
      netvm: allow the use of __GFP_MEMALLOC by specific sockets · 7cb02404
      Mel Gorman 提交于
      Allow specific sockets to be tagged SOCK_MEMALLOC and use __GFP_MEMALLOC
      for their allocations.  These sockets will be able to go below watermarks
      and allocate from the emergency reserve.  Such sockets are to be used to
      service the VM (iow.  to swap over).  They must be handled kernel side,
      exposing such a socket to user-space is a bug.
      
      There is a risk that the reserves be depleted so for now, the
      administrator is responsible for increasing min_free_kbytes as necessary
      to prevent deadlock for their workloads.
      
      [a.p.zijlstra@chello.nl: Original patches]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cb02404
    • M
      net: introduce sk_gfp_atomic() to allow addition of GFP flags depending on the individual socket · 99a1dec7
      Mel Gorman 提交于
      Introduce sk_gfp_atomic(), this function allows to inject sock specific
      flags to each sock related allocation.  It is only used on allocation
      paths that may be required for writing pages back to network storage.
      
      [davem@davemloft.net: Use sk_gfp_atomic only when necessary]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99a1dec7
    • A
      memcg: rename config variables · c255a458
      Andrew Morton 提交于
      Sanity:
      
      CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
      CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM
      
      [mhocko@suse.cz: fix missed bits]
      Cc: Glauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c255a458
  5. 23 7月, 2012 1 次提交
    • E
      tcp: dont drop MTU reduction indications · 563d34d0
      Eric Dumazet 提交于
      ICMP messages generated in output path if frame length is bigger than
      mtu are actually lost because socket is owned by user (doing the xmit)
      
      One example is the ipgre_tunnel_xmit() calling
      icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));
      
      We had a similar case fixed in commit a34a101e (ipv6: disable GSO on
      sockets hitting dst_allfrag).
      
      Problem of such fix is that it relied on retransmit timers, so short tcp
      sessions paid a too big latency increase price.
      
      This patch uses the tcp_release_cb() infrastructure so that MTU
      reduction messages (ICMP messages) are not lost, and no extra delay
      is added in TCP transmits.
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Diagnosed-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Tore Anderson <tore@fud.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      563d34d0
  6. 12 7月, 2012 1 次提交
    • E
      tcp: TCP Small Queues · 46d3ceab
      Eric Dumazet 提交于
      This introduce TSQ (TCP Small Queues)
      
      TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
      device queues), to reduce RTT and cwnd bias, part of the bufferbloat
      problem.
      
      sk->sk_wmem_alloc not allowed to grow above a given limit,
      allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
      given time.
      
      TSO packets are sized/capped to half the limit, so that we have two
      TSO packets in flight, allowing better bandwidth use.
      
      As a side effect, setting the limit to 40000 automatically reduces the
      standard gso max limit (65536) to 40000/2 : It can help to reduce
      latencies of high prio packets, having smaller TSO packets.
      
      This means we divert sock_wfree() to a tcp_wfree() handler, to
      queue/send following frames when skb_orphan() [2] is called for the
      already queued skbs.
      
      Results on my dev machines (tg3/ixgbe nics) are really impressive,
      using standard pfifo_fast, and with or without TSO/GSO.
      
      Without reduction of nominal bandwidth, we have reduction of buffering
      per bulk sender :
      < 1ms on Gbit (instead of 50ms with TSO)
      < 8ms on 100Mbit (instead of 132 ms)
      
      I no longer have 4 MBytes backlogged in qdisc by a single netperf
      session, and both side socket autotuning no longer use 4 Mbytes.
      
      As skb destructor cannot restart xmit itself ( as qdisc lock might be
      taken at this point ), we delegate the work to a tasklet. We use one
      tasklest per cpu for performance reasons.
      
      If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
      This flag is tested in a new protocol method called from release_sock(),
      to eventually send new segments.
      
      [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
      [2] skb_orphan() is usually called at TX completion time,
        but some drivers call it in their start_xmit() handler.
        These drivers should at least use BQL, or else a single TCP
        session can still fill the whole NIC TX ring, since TSQ will
        have no effect.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Dave Taht <dave.taht@bufferbloat.net>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46d3ceab
  7. 26 6月, 2012 2 次提交
  8. 20 6月, 2012 1 次提交
    • D
      ipv4: Early TCP socket demux. · 41063e9d
      David S. Miller 提交于
      Input packet processing for local sockets involves two major demuxes.
      One for the route and one for the socket.
      
      But we can optimize this down to one demux for certain kinds of local
      sockets.
      
      Currently we only do this for established TCP sockets, but it could
      at least in theory be expanded to other kinds of connections.
      
      If a TCP socket is established then it's identity is fully specified.
      
      This means that whatever input route was used during the three-way
      handshake must work equally well for the rest of the connection since
      the keys will not change.
      
      Once we move to established state, we cache the receive packet's input
      route to use later.
      
      Like the existing cached route in sk->sk_dst_cache used for output
      packets, we have to check for route invalidations using dst->obsolete
      and dst->ops->check().
      
      Early demux occurs outside of a socket locked section, so when a route
      invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
      actually inside of established state packet processing and thus have
      the socket locked.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41063e9d
  9. 30 5月, 2012 1 次提交
    • G
      memcg: decrement static keys at real destroy time · 3f134619
      Glauber Costa 提交于
      We call the destroy function when a cgroup starts to be removed, such as
      by a rmdir event.
      
      However, because of our reference counters, some objects are still
      inflight.  Right now, we are decrementing the static_keys at destroy()
      time, meaning that if we get rid of the last static_key reference, some
      objects will still have charges, but the code to properly uncharge them
      won't be run.
      
      This becomes a problem specially if it is ever enabled again, because now
      new charges will be added to the staled charges making keeping it pretty
      much impossible.
      
      We just need to be careful with the static branch activation: since there
      is no particular preferred order of their activation, we need to make sure
      that we only start using it after all call sites are active.  This is
      achieved by having a per-memcg flag that is only updated after
      static_key_slow_inc() returns.  At this time, we are sure all sites are
      active.
      
      This is made per-memcg, not global, for a reason: it also has the effect
      of making socket accounting more consistent.  The first memcg to be
      limited will trigger static_key() activation, therefore, accounting.  But
      all the others will then be accounted no matter what.  After this patch,
      only limited memcgs will have its sockets accounted.
      
      [akpm@linux-foundation.org: move enum sock_flag_bits into sock.h,
                                  document enum sock_flag_bits,
                                  convert memcg_proto_active() and memcg_proto_activated() to test_bit(),
                                  redo tcp_update_limit() comment to 80 cols]
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Acked-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f134619
  10. 17 5月, 2012 2 次提交
  11. 01 5月, 2012 1 次提交
    • E
      net: fix sk_sockets_allocated_read_positive · 518fbf9c
      Eric Dumazet 提交于
      Denys Fedoryshchenko reported frequent crashes on a proxy server and kindly
      provided a lockdep report that explains it all :
      
        [  762.903868]
        [  762.903880] =================================
        [  762.903890] [ INFO: inconsistent lock state ]
        [  762.903903] 3.3.4-build-0061 #8 Not tainted
        [  762.904133] ---------------------------------
        [  762.904344] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
        [  762.904542] squid/1603 [HC0[0]:SC0[0]:HE1:SE1] takes:
        [  762.904542]  (key#3){+.?...}, at: [<c0232cc4>]
      __percpu_counter_sum+0xd/0x58
        [  762.904542] {IN-SOFTIRQ-W} state was registered at:
        [  762.904542]   [<c0158b84>] __lock_acquire+0x284/0xc26
        [  762.904542]   [<c01598e8>] lock_acquire+0x71/0x85
        [  762.904542]   [<c0349765>] _raw_spin_lock+0x33/0x40
        [  762.904542]   [<c0232c93>] __percpu_counter_add+0x58/0x7c
        [  762.904542]   [<c02cfde1>] sk_clone_lock+0x1e5/0x200
        [  762.904542]   [<c0303ee4>] inet_csk_clone_lock+0xe/0x78
        [  762.904542]   [<c0315778>] tcp_create_openreq_child+0x1b/0x404
        [  762.904542]   [<c031339c>] tcp_v4_syn_recv_sock+0x32/0x1c1
        [  762.904542]   [<c031615a>] tcp_check_req+0x1fd/0x2d7
        [  762.904542]   [<c0313f77>] tcp_v4_do_rcv+0xab/0x194
        [  762.904542]   [<c03153bb>] tcp_v4_rcv+0x3b3/0x5cc
        [  762.904542]   [<c02fc0c4>] ip_local_deliver_finish+0x13a/0x1e9
        [  762.904542]   [<c02fc539>] NF_HOOK.clone.11+0x46/0x4d
        [  762.904542]   [<c02fc652>] ip_local_deliver+0x41/0x45
        [  762.904542]   [<c02fc4d1>] ip_rcv_finish+0x31a/0x33c
        [  762.904542]   [<c02fc539>] NF_HOOK.clone.11+0x46/0x4d
        [  762.904542]   [<c02fc857>] ip_rcv+0x201/0x23e
        [  762.904542]   [<c02daa3a>] __netif_receive_skb+0x319/0x368
        [  762.904542]   [<c02dac07>] netif_receive_skb+0x4e/0x7d
        [  762.904542]   [<c02dacf6>] napi_skb_finish+0x1e/0x34
        [  762.904542]   [<c02db122>] napi_gro_receive+0x20/0x24
        [  762.904542]   [<f85d1743>] e1000_receive_skb+0x3f/0x45 [e1000e]
        [  762.904542]   [<f85d3464>] e1000_clean_rx_irq+0x1f9/0x284 [e1000e]
        [  762.904542]   [<f85d3926>] e1000_clean+0x62/0x1f4 [e1000e]
        [  762.904542]   [<c02db228>] net_rx_action+0x90/0x160
        [  762.904542]   [<c012a445>] __do_softirq+0x7b/0x118
        [  762.904542] irq event stamp: 156915469
        [  762.904542] hardirqs last  enabled at (156915469): [<c019b4f4>]
      __slab_alloc.clone.58.clone.63+0xc4/0x2de
        [  762.904542] hardirqs last disabled at (156915468): [<c019b452>]
      __slab_alloc.clone.58.clone.63+0x22/0x2de
        [  762.904542] softirqs last  enabled at (156915466): [<c02ce677>]
      lock_sock_nested+0x64/0x6c
        [  762.904542] softirqs last disabled at (156915464): [<c0349914>]
      _raw_spin_lock_bh+0xe/0x45
        [  762.904542]
        [  762.904542] other info that might help us debug this:
        [  762.904542]  Possible unsafe locking scenario:
        [  762.904542]
        [  762.904542]        CPU0
        [  762.904542]        ----
        [  762.904542]   lock(key#3);
        [  762.904542]   <Interrupt>
        [  762.904542]     lock(key#3);
        [  762.904542]
        [  762.904542]  *** DEADLOCK ***
        [  762.904542]
        [  762.904542] 1 lock held by squid/1603:
        [  762.904542]  #0:  (sk_lock-AF_INET){+.+.+.}, at: [<c03055c0>]
      lock_sock+0xa/0xc
        [  762.904542]
        [  762.904542] stack backtrace:
        [  762.904542] Pid: 1603, comm: squid Not tainted 3.3.4-build-0061 #8
        [  762.904542] Call Trace:
        [  762.904542]  [<c0347b73>] ? printk+0x18/0x1d
        [  762.904542]  [<c015873a>] valid_state+0x1f6/0x201
        [  762.904542]  [<c0158816>] mark_lock+0xd1/0x1bb
        [  762.904542]  [<c015876b>] ? mark_lock+0x26/0x1bb
        [  762.904542]  [<c015805d>] ? check_usage_forwards+0x77/0x77
        [  762.904542]  [<c0158bf8>] __lock_acquire+0x2f8/0xc26
        [  762.904542]  [<c0159b8e>] ? mark_held_locks+0x5d/0x7b
        [  762.904542]  [<c0159cf6>] ? trace_hardirqs_on+0xb/0xd
        [  762.904542]  [<c0158dd4>] ? __lock_acquire+0x4d4/0xc26
        [  762.904542]  [<c01598e8>] lock_acquire+0x71/0x85
        [  762.904542]  [<c0232cc4>] ? __percpu_counter_sum+0xd/0x58
        [  762.904542]  [<c0349765>] _raw_spin_lock+0x33/0x40
        [  762.904542]  [<c0232cc4>] ? __percpu_counter_sum+0xd/0x58
        [  762.904542]  [<c0232cc4>] __percpu_counter_sum+0xd/0x58
        [  762.904542]  [<c02cebc4>] __sk_mem_schedule+0xdd/0x1c7
        [  762.904542]  [<c02d178d>] ? __alloc_skb+0x76/0x100
        [  762.904542]  [<c0305e8e>] sk_wmem_schedule+0x21/0x2d
        [  762.904542]  [<c0306370>] sk_stream_alloc_skb+0x42/0xaa
        [  762.904542]  [<c0306567>] tcp_sendmsg+0x18f/0x68b
        [  762.904542]  [<c031f3dc>] ? ip_fast_csum+0x30/0x30
        [  762.904542]  [<c0320193>] inet_sendmsg+0x53/0x5a
        [  762.904542]  [<c02cb633>] sock_aio_write+0xd2/0xda
        [  762.904542]  [<c015876b>] ? mark_lock+0x26/0x1bb
        [  762.904542]  [<c01a1017>] do_sync_write+0x9f/0xd9
        [  762.904542]  [<c01a2111>] ? file_free_rcu+0x2f/0x2f
        [  762.904542]  [<c01a17a1>] vfs_write+0x8f/0xab
        [  762.904542]  [<c01a284d>] ? fget_light+0x75/0x7c
        [  762.904542]  [<c01a1900>] sys_write+0x3d/0x5e
        [  762.904542]  [<c0349ec9>] syscall_call+0x7/0xb
        [  762.904542]  [<c0340000>] ? rp_sidt+0x41/0x83
      
      Bug is that sk_sockets_allocated_read_positive() calls
      percpu_counter_sum_positive() without BH being disabled.
      
      This bug was added in commit 180d8cd9
      (foundations of per-cgroup memory pressure controlling.), since previous
      code was using percpu_counter_read_positive() which is IRQ safe.
      
      In __sk_mem_schedule() we dont need the precise count of allocated
      sockets and can revert to previous behavior.
      Reported-by: NDenys Fedoryshchenko <denys@visp.net.lb>
      Sined-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      518fbf9c
  12. 24 4月, 2012 1 次提交
    • E
      net: add a limit parameter to sk_add_backlog() · f545a38f
      Eric Dumazet 提交于
      sk_add_backlog() & sk_rcvqueues_full() hard coded sk_rcvbuf as the
      memory limit. We need to make this limit a parameter for TCP use.
      
      No functional change expected in this patch, all callers still using the
      old sk_rcvbuf limit.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Rick Jones <rick.jones2@hp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f545a38f
  13. 22 4月, 2012 1 次提交
  14. 18 4月, 2012 1 次提交
  15. 11 4月, 2012 1 次提交
    • G
      cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg · 1d62e436
      Glauber Costa 提交于
      The only reason cgroup was used, was to be consistent with the populate()
      interface. Now that we're getting rid of it, not only we no longer need
      it, but we also *can't* call it this way.
      
      Since we will no longer rely on populate(), this will be called from
      create(). During create, the association between struct mem_cgroup
      and struct cgroup does not yet exist, since cgroup internals hasn't
      yet initialized its bookkeeping. This means we would not be able
      to draw the memcg pointer from the cgroup pointer in these
      functions, which is highly undesirable.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      CC: Li Zefan <lizefan@huawei.com>
      CC: Johannes Weiner <hannes@cmpxchg.org>
      CC: Michal Hocko <mhocko@suse.cz>
      1d62e436
  16. 24 3月, 2012 1 次提交
    • H
      poll: add poll_requested_events() and poll_does_not_wait() functions · 626cf236
      Hans Verkuil 提交于
      In some cases the poll() implementation in a driver has to do different
      things depending on the events the caller wants to poll for.  An example
      is when a driver needs to start a DMA engine if the caller polls for
      POLLIN, but doesn't want to do that if POLLIN is not requested but instead
      only POLLOUT or POLLPRI is requested.  This is something that can happen
      in the video4linux subsystem among others.
      
      Unfortunately, the current epoll/poll/select implementation doesn't
      provide that information reliably.  The poll_table_struct does have it: it
      has a key field with the event mask.  But once a poll() call matches one
      or more bits of that mask any following poll() calls are passed a NULL
      poll_table pointer.
      
      Also, the eventpoll implementation always left the key field at ~0 instead
      of using the requested events mask.
      
      This was changed in eventpoll.c so the key field now contains the actual
      events that should be polled for as set by the caller.
      
      The solution to the NULL poll_table pointer is to set the qproc field to
      NULL in poll_table once poll() matches the events, not the poll_table
      pointer itself.  That way drivers can obtain the mask through a new
      poll_requested_events inline.
      
      The poll_table_struct can still be NULL since some kernel code calls it
      internally (netfs_state_poll() in ./drivers/staging/pohmelfs/netfs.h).  In
      that case poll_requested_events() returns ~0 (i.e.  all events).
      
      Very rarely drivers might want to know whether poll_wait will actually
      wait.  If another earlier file descriptor in the set already matched the
      events the caller wanted to wait for, then the kernel will return from the
      select() call without waiting.  This might be useful information in order
      to avoid doing expensive work.
      
      A new helper function poll_does_not_wait() is added that drivers can use
      to detect this situation.  This is now used in sock_poll_wait() in
      include/net/sock.h.  This was the only place in the kernel that needed
      this information.
      
      Drivers should no longer access any of the poll_table internals, but use
      the poll_requested_events() and poll_does_not_wait() access functions
      instead.  In order to enforce that the poll_table fields are now prepended
      with an underscore and a comment was added warning against using them
      directly.
      
      This required a change in unix_dgram_poll() in unix/af_unix.c which used
      the key field to get the requested events.  It's been replaced by a call
      to poll_requested_events().
      
      For qproc it was especially important to change its name since the
      behavior of that field changes with this patch since this function pointer
      can now be NULL when that wasn't possible in the past.
      
      Any driver accessing the qproc or key fields directly will now fail to compile.
      
      Some notes regarding the correctness of this patch: the driver's poll()
      function is called with a 'struct poll_table_struct *wait' argument.  This
      pointer may or may not be NULL, drivers can never rely on it being one or
      the other as that depends on whether or not an earlier file descriptor in
      the select()'s fdset matched the requested events.
      
      There are only three things a driver can do with the wait argument:
      
      1) obtain the key field:
      
      	events = wait ? wait->key : ~0;
      
         This will still work although it should be replaced with the new
         poll_requested_events() function (which does exactly the same).
         This will now even work better, since wait is no longer set to NULL
         unnecessarily.
      
      2) use the qproc callback. This could be deadly since qproc can now be
         NULL. Renaming qproc should prevent this from happening. There are no
         kernel drivers that actually access this callback directly, BTW.
      
      3) test whether wait == NULL to determine whether poll would return without
         waiting. This is no longer sufficient as the correct test is now
         wait == NULL || wait->_qproc == NULL.
      
         However, the worst that can happen here is a slight performance hit in
         the case where wait != NULL and wait->_qproc == NULL. In that case the
         driver will assume that poll_wait() will actually add the fd to the set
         of waiting file descriptors. Of course, poll_wait() will not do that
         since it tests for wait->_qproc. This will not break anything, though.
      
         There is only one place in the whole kernel where this happens
         (sock_poll_wait() in include/net/sock.h) and that code will be replaced
         by a call to poll_does_not_wait() in the next patch.
      
         Note that even if wait->_qproc != NULL drivers cannot rely on poll_wait()
         actually waiting. The next file descriptor from the set might match the
         event mask and thus any possible waits will never happen.
      Signed-off-by: NHans Verkuil <hans.verkuil@cisco.com>
      Reviewed-by: NJonathan Corbet <corbet@lwn.net>
      Reviewed-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NHans de Goede <hdegoede@redhat.com>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      626cf236
  17. 24 2月, 2012 2 次提交
  18. 22 2月, 2012 1 次提交
    • P
      sock: Introduce the SO_PEEK_OFF sock option · ef64a54f
      Pavel Emelyanov 提交于
      This one specifies where to start MSG_PEEK-ing queue data from. When
      set to negative value means that MSG_PEEK works as ususally -- peeks
      from the head of the queue always.
      
      When some bytes are peeked from queue and the peeking offset is non
      negative it is moved forward so that the next peek will return next
      portion of data.
      
      When non-peeking recvmsg occurs and the peeking offset is non negative
      is is moved backward so that the next peek will still peek the proper
      data (i.e. the one that would have been picked if there were no non
      peeking recv in between).
      
      The offset is set using per-proto opteration to let the protocol handle
      the locking issues and to check whether the peeking offset feature is
      supported by the protocol the socket belongs to.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef64a54f
  19. 14 2月, 2012 1 次提交
  20. 03 2月, 2012 1 次提交
    • L
      cgroup: remove cgroup_subsys argument from callbacks · 761b3ef5
      Li Zefan 提交于
      The argument is not used at all, and it's not necessary, because
      a specific callback handler of course knows which subsys it
      belongs to.
      
      Now only ->pupulate() takes this argument, because the handlers of
      this callback always call cgroup_add_file()/cgroup_add_files().
      
      So we reduce a few lines of code, though the shrinking of object size
      is minimal.
      
       16 files changed, 113 insertions(+), 162 deletions(-)
      
         text    data     bss     dec     hex filename
      5486240  656987 7039960 13183187         c928d3 vmlinux.o.orig
      5486170  656987 7039960 13183117         c9288d vmlinux.o
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      761b3ef5
  21. 27 1月, 2012 1 次提交
  22. 23 1月, 2012 3 次提交
  23. 09 1月, 2012 1 次提交
  24. 08 1月, 2012 1 次提交
    • G
      net: fix sock_clone reference mismatch with tcp memcontrol · f3f511e1
      Glauber Costa 提交于
      Sockets can also be created through sock_clone. Because it copies
      all data in the sock structure, it also copies the memcg-related pointer,
      and all should be fine. However, since we now use reference counts in
      socket creation, we are left with some sockets that have no reference
      counts. It matters when we destroy them, since it leads to a mismatch.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Greg Thelen <gthelen@google.com>
      CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
      CC: Laurent Chavey <chavey@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3f511e1
  25. 23 12月, 2011 1 次提交
  26. 17 12月, 2011 1 次提交
  27. 14 12月, 2011 1 次提交
  28. 13 12月, 2011 2 次提交