1. 09 8月, 2011 1 次提交
    • C
      slub: Fix full list corruption if debugging is on · 6fbabb20
      Christoph Lameter 提交于
      When a slab is freed by __slab_free() and the slab can only contain a
      single object ever then it was full (and therefore not on the partial
      lists but on the full list in the debug case) before we reached
      slab_empty.
      
      This caused the following full list corruption when SLUB debugging was enabled:
      
        [ 5913.233035] ------------[ cut here ]------------
        [ 5913.233097] WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98()
        [ 5913.233101] Hardware name: Adamo 13
        [ 5913.233105] list_del corruption. prev->next should be ffffea000434fd20, but was ffffea0004199520
        [ 5913.233108] Modules linked in: nfs fscache fuse ebtable_nat ebtables ppdev parport_pc lp parport ipt_MASQUERADE iptable_nat nf_nat nfsd lockd nfs_acl auth_rpcgss xt_CHECKSUM sunrpc iptable_mangle bridge stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables rfcomm bnep arc4 iwlagn snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_intel btusb mac80211 snd_hda_codec bluetooth snd_hwdep snd_seq snd_seq_device snd_pcm usb_debug dell_wmi sparse_keymap cdc_ether usbnet cdc_acm uvcvideo cdc_wdm mii cfg80211 snd_timer dell_laptop videodev dcdbas snd microcode v4l2_compat_ioctl32 soundcore joydev tg3 pcspkr snd_page_alloc iTCO_wdt i2c_i801 rfkill iTCO_vendor_support wmi virtio_net kvm_intel kvm ipv6 xts gf128mul dm_crypt i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
        [ 5913.233213] Pid: 0, comm: swapper Not tainted 3.0.0+ #127
        [ 5913.233213] Call Trace:
        [ 5913.233213]  <IRQ>  [<ffffffff8105df18>] warn_slowpath_common+0x83/0x9b
        [ 5913.233213]  [<ffffffff8105dfd3>] warn_slowpath_fmt+0x46/0x48
        [ 5913.233213]  [<ffffffff8127e7c1>] __list_del_entry+0x8d/0x98
        [ 5913.233213]  [<ffffffff8127e7da>] list_del+0xe/0x2d
        [ 5913.233213]  [<ffffffff814e0430>] __slab_free+0x1db/0x235
        [ 5913.233213]  [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37
        [ 5913.233213]  [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37
        [ 5913.233213]  [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37
        [ 5913.233213]  [<ffffffff81133085>] kmem_cache_free+0x88/0x102
        [ 5913.233213]  [<ffffffff811706ab>] bvec_free_bs+0x35/0x37
        [ 5913.233213]  [<ffffffff811706e1>] bio_free+0x34/0x64
        [ 5913.233213]  [<ffffffff813dc390>] dm_bio_destructor+0x12/0x14
        [ 5913.233213]  [<ffffffff8116fef6>] bio_put+0x2b/0x2d
        [ 5913.233213]  [<ffffffff813dccab>] clone_endio+0x9e/0xb4
        [ 5913.233213]  [<ffffffff8116f7dd>] bio_endio+0x2d/0x2f
        [ 5913.233213]  [<ffffffffa00148da>] crypt_dec_pending+0x5c/0x8b [dm_crypt]
        [ 5913.233213]  [<ffffffffa00150a9>] crypt_endio+0x78/0x81 [dm_crypt]
      
      [ Full discussion here: https://lkml.org/lkml/2011/8/4/375 ]
      
      Make sure that we remove such a slab also from the full lists.
      Reported-and-tested-by: NDave Jones <davej@redhat.com>
      Reported-and-tested-by: NXiaotian Feng <xtfeng@gmail.com>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      6fbabb20
  2. 04 8月, 2011 17 次提交
    • P
      slab, lockdep: Annotate the locks before using them · 30765b92
      Peter Zijlstra 提交于
      Fernando found we hit the regular OFF_SLAB 'recursion' before we
      annotate the locks, cure this.
      
      The relevant portion of the stack-trace:
      
      > [    0.000000]  [<c085e24f>] rt_spin_lock+0x50/0x56
      > [    0.000000]  [<c04fb406>] __cache_free+0x43/0xc3
      > [    0.000000]  [<c04fb23f>] kmem_cache_free+0x6c/0xdc
      > [    0.000000]  [<c04fb2fe>] slab_destroy+0x4f/0x53
      > [    0.000000]  [<c04fb396>] free_block+0x94/0xc1
      > [    0.000000]  [<c04fc551>] do_tune_cpucache+0x10b/0x2bb
      > [    0.000000]  [<c04fc8dc>] enable_cpucache+0x7b/0xa7
      > [    0.000000]  [<c0bd9d3c>] kmem_cache_init_late+0x1f/0x61
      > [    0.000000]  [<c0bba687>] start_kernel+0x24c/0x363
      > [    0.000000]  [<c0bba0ba>] i386_start_kernel+0xa9/0xaf
      Reported-by: NFernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptopSigned-off-by: NIngo Molnar <mingo@elte.hu>
      30765b92
    • P
      slab, lockdep: Annotate slab -> rcu -> debug_object -> slab · 83835b3d
      Peter Zijlstra 提交于
      Lockdep thinks there's lock recursion through:
      
      	kmem_cache_free()
      	  cache_flusharray()
      	    spin_lock(&l3->list_lock)  <----------------.
      	    free_block()                                |
      	      slab_destroy()                            |
      		call_rcu()                              |
      		  debug_object_activate()               |
      		    debug_object_init()                 |
      		      __debug_object_init()             |
      			kmem_cache_alloc()              |
      			  cache_alloc_refill()          |
      			    spin_lock(&l3->list_lock) --'
      
      Now debug objects doesn't use SLAB_DESTROY_BY_RCU and hence there is no
      actual possibility of recursing. Luckily debug objects marks it slab
      with SLAB_DEBUG_OBJECTS so we can identify the thing.
      
      Mark all SLAB_DEBUG_OBJECTS (all one!) slab caches with a special
      lockdep key so that lockdep sees its a different cachep.
      
      Also add a WARN on trying to create a SLAB_DESTROY_BY_RCU |
      SLAB_DEBUG_OBJECTS cache, to avoid possible future trouble.
      Reported-and-tested-by: NSebastian Siewior <sebastian@breakpoint.cc>
      [ fixes to the initial patch ]
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1311341165.27400.58.camel@twinsSigned-off-by: NIngo Molnar <mingo@elte.hu>
      83835b3d
    • H
      mm: clarify the radix_tree exceptional cases · 8079b1c8
      Hugh Dickins 提交于
      Make the radix_tree exceptional cases, mostly in filemap.c, clearer.
      
      It's hard to devise a suitable snappy name that illuminates the use by
      shmem/tmpfs for swap, while keeping filemap/pagecache/radix_tree
      generality.  And akpm points out that /* radix_tree_deref_retry(page) */
      comments look like calls that have been commented out for unknown
      reason.
      
      Skirt the naming difficulty by rearranging these blocks to handle the
      transient radix_tree_deref_retry(page) case first; then just explain the
      remaining shmem/tmpfs swap case in a comment.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8079b1c8
    • H
      tmpfs radix_tree: locate_item to speed up swapoff · e504f3fd
      Hugh Dickins 提交于
      We have already acknowledged that swapoff of a tmpfs file is slower than
      it was before conversion to the generic radix_tree: a little slower
      there will be acceptable, if the hotter paths are faster.
      
      But it was a shock to find swapoff of a 500MB file 20 times slower on my
      laptop, taking 10 minutes; and at that rate it significantly slows down
      my testing.
      
      Now, most of that turned out to be overhead from PROVE_LOCKING and
      PROVE_RCU: without those it was only 4 times slower than before; and
      more realistic tests on other machines don't fare as badly.
      
      I've tried a number of things to improve it, including tagging the swap
      entries, then doing lookup by tag: I'd expected that to halve the time,
      but in practice it's erratic, and often counter-productive.
      
      The only change I've so far found to make a consistent improvement, is
      to short-circuit the way we go back and forth, gang lookup packing
      entries into the array supplied, then shmem scanning that array for the
      target entry.  Scanning in place doubles the speed, so it's now only
      twice as slow as before (or three times slower when the PROVEs are on).
      
      So, add radix_tree_locate_item() as an expedient, once-off,
      single-caller hack to do the lookup directly in place.  #ifdef it on
      CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited
      applicability as save space in other configurations.  And, sadly,
      #include sched.h for cond_resched().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e504f3fd
    • H
      mm: a few small updates for radix-swap · 31475dd6
      Hugh Dickins 提交于
      Remove PageSwapBacked (!page_is_file_cache) cases from
      add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now
      go through shmem_add_to_page_cache().
      
      Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
      and add a comment on swap entries to invalidate_mapping_pages().
      
      And mincore_page() uses find_get_page() on what might be shmem or a
      tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
      find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31475dd6
    • H
      tmpfs: use kmemdup for short symlinks · 69f07ec9
      Hugh Dickins 提交于
      But we've not yet removed the old swp_entry_t i_direct[16] from
      shmem_inode_info.  That's because it was still being shared with the
      inline symlink.  Remove it now (saving 64 or 128 bytes from shmem inode
      size), and use kmemdup() for short symlinks, say, those up to 128 bytes.
      
      I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode()
      rather than shmem_evict_inode(), where we usually do such freeing? I
      guess it doesn't matter, and I'm not into NUMA mpol testing right now.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69f07ec9
    • H
      tmpfs: convert shmem_writepage and enable swap · 6922c0c7
      Hugh Dickins 提交于
      Convert shmem_writepage() to use shmem_delete_from_page_cache() to use
      shmem_radix_tree_replace() to substitute swap entry for page pointer
      atomically in the radix tree.
      
      As with shmem_add_to_page_cache(), it's not entirely satisfactory to be
      copying such code from delete_from_swap_cache, but again judged easier
      to sell than making its other callers go through the extras.
      
      Remove the toy implementation's shmem_put_swap() and shmem_get_swap(),
      now unreferenced, and the hack to disable swap: it's now good to go.
      
      The way things have worked out, info->lock no longer helps to guard the
      shmem_swaplist: we increment swapped under shmem_swaplist_mutex only.
      That global mutex exclusion between shmem_writepage() and shmem_unuse()
      is not pretty, and we ought to find another way; but it's been forced on
      us by recent race discoveries, not a consequence of this patchset.
      
      And what has become of the WARN_ON_ONCE(1) free_swap_and_cache() if a
      swap entry was found already present? That's no longer possible, the
      (unknown) one inserting this page into filecache would hit the swap
      entry occupying that slot.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6922c0c7
    • H
      tmpfs: convert mem_cgroup shmem to radix-swap · aa3b1895
      Hugh Dickins 提交于
      Remove mem_cgroup_shmem_charge_fallback(): it was only required when we
      had to move swappage to filecache with GFP_NOWAIT.
      
      Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(), by
      moving its call out from shmem_add_to_page_cache() to two of thats three
      callers.  But leave it doing mem_cgroup_uncharge_cache_page() on error:
      although asymmetrical, it's easier for all 3 callers to handle.
      
      These two changes would also be appropriate if anyone were to start
      using shmem_read_mapping_page_gfp() with GFP_NOWAIT.
      
      Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test
      radix_tree_exceptional_entry() to get what it needs for itself.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa3b1895
    • H
      tmpfs: convert shmem_getpage_gfp to radix-swap · 54af6042
      Hugh Dickins 提交于
      Convert shmem_getpage_gfp(), the engine-room of shmem, to expect page or
      swap entry returned from radix tree by find_lock_page().
      
      Whereas the repetitive old method proceeded mainly under info->lock,
      dropping and repeating whenever one of the conditions needed was not
      met, now we can proceed without it, leaving shmem_add_to_page_cache() to
      check for a race.
      
      This way there is no need to preallocate a page, no need for an early
      radix_tree_preload(), no need for mem_cgroup_shmem_charge_fallback().
      
      Move the error unwinding down to the bottom instead of repeating it
      throughout.  ENOSPC handling is a little different from before: there is
      no longer any race between find_lock_page() and finding swap, but we can
      arrive at ENOSPC before calling shmem_recalc_inode(), which might
      occasionally discover freed space.
      
      Be stricter to check i_size before returning.  info->lock is used for
      little but alloced, swapped, i_blocks updates.  Move i_blocks updates
      out from under the max_blocks check, so even an unlimited size=0 mount
      can show accurate du.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54af6042
    • H
      tmpfs: convert shmem_unuse_inode to radix-swap · 46f65ec1
      Hugh Dickins 提交于
      Convert shmem_unuse_inode() to use a lockless gang lookup of the radix
      tree, searching for matching swap.
      
      This is somewhat slower than the old method: because of repeated radix
      tree descents, because of copying entries up, but probably most because
      the old method noted and skipped once a vector page was cleared of swap.
      Perhaps we can devise a use of radix tree tagging to achieve that later.
      
      shmem_add_to_page_cache() uses shmem_radix_tree_replace() to compensate
      for the lockless lookup by checking that the expected entry is in place,
      under lock.  It is not very satisfactory to be copying this much from
      add_to_page_cache_locked(), but I think easier to sell than insisting
      that every caller of add_to_page_cache*() go through the extras.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46f65ec1
    • H
      tmpfs: convert shmem_truncate_range to radix-swap · 7a5d0fbb
      Hugh Dickins 提交于
      Disable the toy swapping implementation in shmem_writepage() - it's hard
      to support two schemes at once - and convert shmem_truncate_range() to a
      lockless gang lookup of swap entries along with pages, freeing both.
      
      Since the second loop tightens its noose until all entries of either
      kind have been squeezed out (and we shall make sure that there's not an
      instant when neither is visible), there is no longer a need for yet
      another pass below.
      
      shmem_radix_tree_replace() compensates for the lockless lookup by
      checking that the expected entry is in place, under lock, before
      replacing it.  Here it just deletes, but will be used in later patches
      to substitute swap entry for page or page for swap entry.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a5d0fbb
    • H
      tmpfs: copy truncate_inode_pages_range · bda97eab
      Hugh Dickins 提交于
      Bring truncate.c's code for truncate_inode_pages_range() inline into
      shmem_truncate_range(), replacing its first call (there's a followup
      call below, but leave that one, it will disappear next).
      
      Don't play with it yet, apart from leaving out the cleancache flush, and
      (importantly) the nrpages == 0 skip, and moving shmem_setattr()'s
      partial page preparation into its partial page handling.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bda97eab
    • H
      tmpfs: miscellaneous trivial cleanups · 41ffe5d5
      Hugh Dickins 提交于
      While it's at its least, make a number of boring nitpicky cleanups to
      shmem.c, mostly for consistency of variable naming.  Things like "swap"
      instead of "entry", "pgoff_t index" instead of "unsigned long idx".
      
      And since everything else here is prefixed "shmem_", better change
      init_tmpfs() to shmem_init().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41ffe5d5
    • H
      tmpfs: demolish old swap vector support · 285b2c4f
      Hugh Dickins 提交于
      The maximum size of a shmem/tmpfs file has been limited by the maximum
      size of its triple-indirect swap vector.  With 4kB page size, maximum
      filesize was just over 2TB on a 32-bit kernel, but sadly one eighth of
      that on a 64-bit kernel.  (With 8kB page size, maximum filesize was just
      over 4TB on a 64-bit kernel, but 16TB on a 32-bit kernel,
      MAX_LFS_FILESIZE being then more restrictive than swap vector layout.)
      
      It's a shame that tmpfs should be more restrictive than ramfs, and this
      limitation has now been noticed.  Add another level to the swap vector?
      No, it became obscure and hard to maintain, once I complicated it to
      make use of highmem pages nine years ago: better choose another way.
      
      Surely, if 2.4 had had the radix tree pagecache introduced in 2.5, then
      tmpfs would never have invented its own peculiar radix tree: we would
      have fitted swap entries into the common radix tree instead, in much the
      same way as we fit swap entries into page tables.
      
      And why should each file have a separate radix tree for its pages and
      for its swap entries? The swap entries are required precisely where and
      when the pages are not.  We want to put them together in a single radix
      tree: which can then avoid much of the locking which was needed to
      prevent them from being exchanged underneath us.
      
      This also avoids the waste of memory devoted to swap vectors, first in
      the shmem_inode itself, then at least two more pages once a file grew
      beyond 16 data pages (pages accounted by df and du, but not by memcg).
      Allocated upfront, to avoid allocation when under swapping pressure, but
      pure waste when CONFIG_SWAP is not set - I have never spattered around
      the ifdefs to prevent that, preferring this move to sharing the common
      radix tree instead.
      
      There are three downsides to sharing the radix tree.  One, that it binds
      tmpfs more tightly to the rest of mm, either requiring knowledge of swap
      entries in radix tree there, or duplication of its code here in shmem.c.
      I believe that the simplications and memory savings (and probable higher
      performance, not yet measured) justify that.
      
      Two, that on HIGHMEM systems with SWAP enabled, it's the lowmem radix
      nodes that cannot be freed under memory pressure - whereas before it was
      the less precious highmem swap vector pages that could not be freed.
      I'm hoping that 64-bit has now been accessible for long enough, that the
      highmem argument has grown much less persuasive.
      
      Three, that swapoff is slower than it used to be on tmpfs files, since
      it's using a simple generic mechanism not tailored to it: I find this
      noticeable, and shall want to improve, but maybe nobody else will
      notice.
      
      So...  now remove most of the old swap vector code from shmem.c.  But,
      for the moment, keep the simple i_direct vector of 16 pages, with simple
      accessors shmem_put_swap() and shmem_get_swap(), as a toy implementation
      to help mark where swap needs to be handled in subsequent patches.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      285b2c4f
    • H
      mm: let swap use exceptional entries · a2c16d6c
      Hugh Dickins 提交于
      If swap entries are to be stored along with struct page pointers in a
      radix tree, they need to be distinguished as exceptional entries.
      
      Most of the handling of swap entries in radix tree will be contained in
      shmem.c, but a few functions in filemap.c's common code need to check
      for their appearance: find_get_page(), find_lock_page(),
      find_get_pages() and find_get_pages_contig().
      
      So as not to slow their fast paths, tuck those checks inside the
      existing checks for unlikely radix_tree_deref_slot(); except for
      find_lock_page(), where it is an added test.  And make it a BUG in
      find_get_pages_tag(), which is not applied to tmpfs files.
      
      A part of the reason for eliminating shmem_readpage() earlier, was to
      minimize the places where common code would need to allow for swap
      entries.
      
      The swp_entry_t known to swapfile.c must be massaged into a slightly
      different form when stored in the radix tree, just as it gets massaged
      into a pte_t when stored in page tables.
      
      In an i386 kernel this limits its information (type and page offset) to
      30 bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum
      swapfile size of 128GB.  Which is less than the 512GB we previously
      allowed with X86_PAE (where the swap entry can occupy the entire upper
      32 bits of a pte_t), but not a new limitation on 32-bit without PAE; and
      there's not a new limitation on 64-bit (where swap filesize is already
      limited to 16TB by a 32-bit page offset).  Thirty areas of 128GB is
      probably still enough swap for a 64GB 32-bit machine.
      
      Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and
      enforce filesize limit in read_swap_header(), just as for ptes.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2c16d6c
    • H
      radix_tree: exceptional entries and indices · 6328650b
      Hugh Dickins 提交于
      A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its
      peculiar swap vector, instead keeping a file's swap entries in the same
      radix tree as its struct page pointers: thus saving memory, and
      simplifying its code and locking.
      
      This patch:
      
      The radix_tree is used by several subsystems for different purposes.  A
      major use is to store the struct page pointers of a file's pagecache for
      memory management.  But what if mm wanted to store something other than
      page pointers there too?
      
      The low bit of a radix_tree entry is already used to denote an indirect
      pointer, for internal use, and the unlikely radix_tree_deref_retry()
      case.
      
      Define the next bit as denoting an exceptional entry, and supply inline
      functions radix_tree_exception() to return non-0 in either unlikely
      case, and radix_tree_exceptional_entry() to return non-0 in the second
      case.
      
      If a subsystem already uses radix_tree with that bit set, no problem: it
      does not affect internal workings at all, but is defined for the
      convenience of those storing well-aligned pointers in the radix_tree.
      
      The radix_tree_gang_lookups have an implicit assumption that the caller
      can deduce the offset of each entry returned e.g.  by the page->index of
      a struct page.  But that may not be feasible for some kinds of item to
      be stored there.
      
      radix_tree_gang_lookup_slot() allow for an optional indices argument,
      output array in which to return those offsets.  The same could be added
      to other radix_tree_gang_lookups, but for now keep it to the only one
      for which we need it.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6328650b
    • A
      fault-injection: add ability to export fault_attr in arbitrary directory · dd48c085
      Akinobu Mita 提交于
      init_fault_attr_dentries() is used to export fault_attr via debugfs.
      But it can only export it in debugfs root directory.
      
      Per Forlin is working on mmc_fail_request which adds support to inject
      data errors after a completed host transfer in MMC subsystem.
      
      The fault_attr for mmc_fail_request should be defined per mmc host and
      export it in debugfs directory per mmc host like
      /sys/kernel/debug/mmc0/mmc_fail_request.
      
      init_fault_attr_dentries() doesn't help for mmc_fail_request.  So this
      introduces fault_create_debugfs_attr() which is able to create a
      directory in the arbitrary directory and replace
      init_fault_attr_dentries().
      
      [akpm@linux-foundation.org: extraneous semicolon, per Randy]
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Tested-by: NPer Forlin <per.forlin@linaro.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd48c085
  3. 03 8月, 2011 1 次提交
    • H
      HWPoison: add memory_failure_queue() · ea8f5fb8
      Huang Ying 提交于
      memory_failure() is the entry point for HWPoison memory error
      recovery.  It must be called in process context.  But commonly
      hardware memory errors are notified via MCE or NMI, so some delayed
      execution mechanism must be used.  In MCE handler, a work queue + ring
      buffer mechanism is used.
      
      In addition to MCE, now APEI (ACPI Platform Error Interface) GHES
      (Generic Hardware Error Source) can be used to report memory errors
      too.  To add support to APEI GHES memory recovery, a mechanism similar
      to that of MCE is implemented.  memory_failure_queue() is the new
      entry point that can be called in IRQ context.  The next step is to
      make MCE handler uses this interface too.
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLen Brown <len.brown@intel.com>
      ea8f5fb8
  4. 02 8月, 2011 1 次提交
  5. 31 7月, 2011 1 次提交
  6. 28 7月, 2011 1 次提交
  7. 27 7月, 2011 16 次提交
    • A
      atomic: use <linux/atomic.h> · 60063497
      Arun Sharma 提交于
      This allows us to move duplicated code in <asm/atomic.h>
      (atomic_inc_not_zero() for now) to <linux/atomic.h>
      Signed-off-by: NArun Sharma <asharma@fb.com>
      Reviewed-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60063497
    • A
      fail_page_alloc: simplify debugfs initialization · b2588c4b
      Akinobu Mita 提交于
      Now cleanup_fault_attr_dentries() recursively removes a directory, So we
      can simplify the error handling in the initialization code and no need
      to hold dentry structs for each debugfs file.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2588c4b
    • A
      failslab: simplify debugfs initialization · 810f09b8
      Akinobu Mita 提交于
      Now cleanup_fault_attr_dentries() recursively removes a directory, So we
      can simplify the error handling in the initialization code and no need
      to hold dentry structs for each debugfs file.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      810f09b8
    • A
      fault-injection: use debugfs_remove_recursive · 7f5ddcc8
      Akinobu Mita 提交于
      Use debugfs_remove_recursive() to simplify initialization and
      deinitialization of fault injection debugfs files.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f5ddcc8
    • M
      cpusets: randomize node rotor used in cpuset_mem_spread_node() · 778d3b0f
      Michal Hocko 提交于
      [ This patch has already been accepted as commit 0ac0c0d0 but later
        reverted (commit 35926ff5) because it itroduced arch specific
        __node_random which was defined only for x86 code so it broke other
        archs.  This is a followup without any arch specific code.  Other than
        that there are no functional changes.]
      
      Some workloads that create a large number of small files tend to assign
      too many pages to node 0 (multi-node systems).  Part of the reason is
      that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts
      at node 0 for newly created tasks.
      
      This patch changes the rotor to be initialized to a random node number
      of the cpuset.
      
      [akpm@linux-foundation.org: fix layout]
      [Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
      [mhocko@suse.cz: Make it arch independent]
      [akpm@linux-foundation.org: fix CONFIG_NUMA=y, MAX_NUMNODES>1 build]
      Signed-off-by: NJack Steiner <steiner@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Paul Menage <menage@google.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Menage <menage@google.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      778d3b0f
    • M
      memcg: get rid of percpu_charge_mutex lock · 8521fc50
      Michal Hocko 提交于
      percpu_charge_mutex protects from multiple simultaneous per-cpu charge
      caches draining because we might end up having too many work items.  At
      least this was the case until commit 26fe6168 ("memcg: fix percpu
      cached charge draining frequency") when we introduced a more targeted
      draining for async mode.
      
      Now that also sync draining is targeted we can safely remove mutex
      because we will not send more work than the current number of CPUs.
      FLUSHING_CACHED_CHARGE protects from sending the same work multiple
      times and stock->nr_pages == 0 protects from pointless sending a work if
      there is obviously nothing to be done.  This is of course racy but we
      can live with it as the race window is really small (we would have to
      see FLUSHING_CACHED_CHARGE cleared while nr_pages would be still
      non-zero).
      
      The only remaining place where we can race is synchronous mode when we
      rely on FLUSHING_CACHED_CHARGE test which might have been set by other
      drainer on the same group but we should wait in that case as well.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8521fc50
    • M
      memcg: add mem_cgroup_same_or_subtree() helper · 3e92041d
      Michal Hocko 提交于
      We are checking whether a given two groups are same or at least in the
      same subtree of a hierarchy at several places.  Let's make a helper for
      it to make code easier to read.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e92041d
    • M
      memcg: unify sync and async per-cpu charge cache draining · d38144b7
      Michal Hocko 提交于
      Currently we have two ways how to drain per-CPU caches for charges.
      drain_all_stock_sync will synchronously drain all caches while
      drain_all_stock_async will asynchronously drain only those that refer to
      a given memory cgroup or its subtree in hierarchy.  Targeted async
      draining has been introduced by 26fe6168 (memcg: fix percpu cached
      charge draining frequency) to reduce the cpu workers number.
      
      sync draining is currently triggered only from mem_cgroup_force_empty
      which is triggered only by userspace (mem_cgroup_force_empty_write) or
      when a cgroup is removed (mem_cgroup_pre_destroy).  Although these are
      not usually frequent operations it still makes some sense to do targeted
      draining as well, especially if the box has many CPUs.
      
      This patch unifies both methods to use the single code (drain_all_stock)
      which relies on the original async implementation and just adds
      flush_work to wait on all caches that are still under work for the sync
      mode.  We are using FLUSHING_CACHED_CHARGE bit check to prevent from
      waiting on a work that we haven't triggered.  Please note that both sync
      and async functions are currently protected by percpu_charge_mutex so we
      cannot race with other drainers.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d38144b7
    • M
      memcg: do not try to drain per-cpu caches without pages · d1a05b69
      Michal Hocko 提交于
      drain_all_stock_async tries to optimize a work to be done on the work
      queue by excluding any work for the current CPU because it assumes that
      the context we are called from already tried to charge from that cache
      and it's failed so it must be empty already.
      
      While the assumption is correct we can optimize it even more by checking
      the current number of pages in the cache.  This will also reduce a work
      on other CPUs with an empty stock.
      
      For the current CPU we can simply call drain_local_stock rather than
      deferring it to the work queue.
      
      [kamezawa.hiroyu@jp.fujitsu.com: use drain_local_stock for current CPU optimization]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1a05b69
    • K
      memcg: add memory.vmscan_stat · 82f9d486
      KAMEZAWA Hiroyuki 提交于
      The commit log of 0ae5e89c ("memcg: count the soft_limit reclaim
      in...") says it adds scanning stats to memory.stat file.  But it doesn't
      because we considered we needed to make a concensus for such new APIs.
      
      This patch is a trial to add memory.scan_stat. This shows
        - the number of scanned pages(total, anon, file)
        - the number of rotated pages(total, anon, file)
        - the number of freed pages(total, anon, file)
        - the number of elaplsed time (including sleep/pause time)
      
        for both of direct/soft reclaim.
      
      The biggest difference with oringinal Ying's one is that this file
      can be reset by some write, as
      
        # echo 0 ...../memory.scan_stat
      
      Example of output is here. This is a result after make -j 6 kernel
      under 300M limit.
      
        [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
        [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
        scanned_pages_by_limit 9471864
        scanned_anon_pages_by_limit 6640629
        scanned_file_pages_by_limit 2831235
        rotated_pages_by_limit 4243974
        rotated_anon_pages_by_limit 3971968
        rotated_file_pages_by_limit 272006
        freed_pages_by_limit 2318492
        freed_anon_pages_by_limit 962052
        freed_file_pages_by_limit 1356440
        elapsed_ns_by_limit 351386416101
        scanned_pages_by_system 0
        scanned_anon_pages_by_system 0
        scanned_file_pages_by_system 0
        rotated_pages_by_system 0
        rotated_anon_pages_by_system 0
        rotated_file_pages_by_system 0
        freed_pages_by_system 0
        freed_anon_pages_by_system 0
        freed_file_pages_by_system 0
        elapsed_ns_by_system 0
        scanned_pages_by_limit_under_hierarchy 9471864
        scanned_anon_pages_by_limit_under_hierarchy 6640629
        scanned_file_pages_by_limit_under_hierarchy 2831235
        rotated_pages_by_limit_under_hierarchy 4243974
        rotated_anon_pages_by_limit_under_hierarchy 3971968
        rotated_file_pages_by_limit_under_hierarchy 272006
        freed_pages_by_limit_under_hierarchy 2318492
        freed_anon_pages_by_limit_under_hierarchy 962052
        freed_file_pages_by_limit_under_hierarchy 1356440
        elapsed_ns_by_limit_under_hierarchy 351386416101
        scanned_pages_by_system_under_hierarchy 0
        scanned_anon_pages_by_system_under_hierarchy 0
        scanned_file_pages_by_system_under_hierarchy 0
        rotated_pages_by_system_under_hierarchy 0
        rotated_anon_pages_by_system_under_hierarchy 0
        rotated_file_pages_by_system_under_hierarchy 0
        freed_pages_by_system_under_hierarchy 0
        freed_anon_pages_by_system_under_hierarchy 0
        freed_file_pages_by_system_under_hierarchy 0
        elapsed_ns_by_system_under_hierarchy 0
      
      total_xxxx is for hierarchy management.
      
      This will be useful for further memcg developments and need to be
      developped before we do some complicated rework on LRU/softlimit
      management.
      
      This patch adds a new struct memcg_scanrecord into scan_control struct.
      sc->nr_scanned at el is not designed for exporting information.  For
      example, nr_scanned is reset frequentrly and incremented +2 at scanning
      mapped pages.
      
      To avoid complexity, I added a new param in scan_control which is for
      exporting scanning score.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andrew Bresticker <abrestic@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82f9d486
    • D
      memcg: fix behavior of mem_cgroup_resize_limit() · 108b6a78
      Daisuke Nishimura 提交于
      Commit 22a668d7 ("memcg: fix behavior under memory.limit equals to
      memsw.limit") introduced "memsw_is_minimum" flag, which becomes true
      when mem_limit == memsw_limit.  The flag is checked at the beginning of
      reclaim, and "noswap" is set if the flag is true, because using swap is
      meaningless in this case.
      
      This works well in most cases, but when we try to shrink mem_limit,
      which is the same as memsw_limit now, we might fail to shrink mem_limit
      because swap doesn't used.
      
      This patch fixes this behavior by:
       - check MEM_CGROUP_RECLAIM_SHRINK at the begining of reclaim
       - If it is set, don't set "noswap" flag even if memsw_is_minimum is true.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      108b6a78
    • K
      memcg: fix vmscan count in small memcgs · 4508378b
      KAMEZAWA Hiroyuki 提交于
      Commit 246e87a9 ("memcg: fix get_scan_count() for small targets")
      fixes the memcg/kswapd behavior against small targets and prevent vmscan
      priority too high.
      
      But the implementation is too naive and adds another problem to small
      memcg.  It always force scan to 32 pages of file/anon and doesn't handle
      swappiness and other rotate_info.  It makes vmscan to scan anon LRU
      regardless of swappiness and make reclaim bad.  This patch fixes it by
      adjusting scanning count with regard to swappiness at el.
      
      At a test "cat 1G file under 300M limit." (swappiness=20)
       before patch
              scanned_pages_by_limit 360919
              scanned_anon_pages_by_limit 180469
              scanned_file_pages_by_limit 180450
              rotated_pages_by_limit 31
              rotated_anon_pages_by_limit 25
              rotated_file_pages_by_limit 6
              freed_pages_by_limit 180458
              freed_anon_pages_by_limit 19
              freed_file_pages_by_limit 180439
              elapsed_ns_by_limit 429758872
       after patch
              scanned_pages_by_limit 180674
              scanned_anon_pages_by_limit 24
              scanned_file_pages_by_limit 180650
              rotated_pages_by_limit 35
              rotated_anon_pages_by_limit 24
              rotated_file_pages_by_limit 11
              freed_pages_by_limit 180634
              freed_anon_pages_by_limit 0
              freed_file_pages_by_limit 180634
              elapsed_ns_by_limit 367119089
              scanned_pages_by_system 0
      
      the numbers of scanning anon are decreased(as expected), and elapsed time
      reduced. By this patch, small memcgs will work better.
      (*) Because the amount of file-cache is much bigger than anon,
          recalaim_stat's rotate-scan counter make scanning files more.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4508378b
    • M
      memcg: change memcg_oom_mutex to spinlock · 1af8efe9
      Michal Hocko 提交于
      memcg_oom_mutex is used to protect memcg OOM path and eventfd interface
      for oom_control.  None of the critical sections which it protects sleep
      (eventfd_signal works from atomic context and the rest are simple linked
      list resp.  oom_lock atomic operations).
      
      Mutex is also too heavyweight for those code paths because it triggers a
      lot of scheduling.  It also makes makes convoying effects more visible
      when we have a big number of oom killing because we take the lock
      mutliple times during mem_cgroup_handle_oom so we have multiple places
      where many processes can sleep.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1af8efe9
    • M
      memcg: make oom_lock 0 and 1 based rather than counter · 79dfdacc
      Michal Hocko 提交于
      Commit 867578cb ("memcg: fix oom kill behavior") introduced a oom_lock
      counter which is incremented by mem_cgroup_oom_lock when we are about to
      handle memcg OOM situation.  mem_cgroup_handle_oom falls back to a sleep
      if oom_lock > 1 to prevent from multiple oom kills at the same time.
      The counter is then decremented by mem_cgroup_oom_unlock called from the
      same function.
      
      This works correctly but it can lead to serious starvations when we have
      many processes triggering OOM and many CPUs available for them (I have
      tested with 16 CPUs).
      
      Consider a process (call it A) which gets the oom_lock (the first one
      that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other
      processes that are blocked on the mutex.  While A releases the mutex and
      calls mem_cgroup_out_of_memory others will wake up (one after another)
      and increase the counter and fall into sleep (memcg_oom_waitq).
      
      Once A finishes mem_cgroup_out_of_memory it takes the mutex again and
      decreases oom_lock and wakes other tasks (if releasing memory by
      somebody else - e.g.  killed process - hasn't done it yet).
      
      A testcase would look like:
        Assume malloc XXX is a program allocating XXX Megabytes of memory
        which touches all allocated pages in a tight loop
        # swapoff SWAP_DEVICE
        # cgcreate -g memory:A
        # cgset -r memory.oom_control=0   A
        # cgset -r memory.limit_in_bytes= 200M
        # for i in `seq 100`
        # do
        #     cgexec -g memory:A   malloc 10 &
        # done
      
      The main problem here is that all processes still race for the mutex and
      there is no guarantee that we will get counter back to 0 for those that
      got back to mem_cgroup_handle_oom.  In the end the whole convoy
      in/decreases the counter but we do not get to 1 that would enable
      killing so nothing useful can be done.  The time is basically unbounded
      because it highly depends on scheduling and ordering on mutex (I have
      seen this taking hours...).
      
      This patch replaces the counter by a simple {un}lock semantic.  As
      mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to
      make sure that nobody else races with us which is guaranteed by the
      memcg_oom_mutex.
      
      We have to be careful while locking subtrees because we can encounter a
      subtree which is already locked: hierarchy:
      
                A
              /   \
             B     \
            /\      \
           C  D     E
      
      B - C - D tree might be already locked.  While we want to enable locking
      E subtree because OOM situations cannot influence each other we
      definitely do not want to allow locking A.
      
      Therefore we have to refuse lock if any subtree is already locked and
      clear up the lock for all nodes that have been set up to the failure
      point.
      
      On the other hand we have to make sure that the rest of the world will
      recognize that a group is under OOM even though it doesn't have a lock.
      Therefore we have to introduce under_oom variable which is incremented
      and decremented for the whole subtree when we enter resp.  leave
      mem_cgroup_handle_oom.  under_oom, unlike oom_lock, doesn't need be
      updated under memcg_oom_mutex because its users only check a single
      group and they use atomic operations for that.
      
      This can be checked easily by the following test case:
      
        # cgcreate -g memory:A
        # cgset -r memory.use_hierarchy=1 A
        # cgset -r memory.oom_control=1   A
        # cgset -r memory.limit_in_bytes= 100M
        # cgset -r memory.memsw.limit_in_bytes= 100M
        # cgcreate -g memory:A/B
        # cgset -r memory.oom_control=1 A/B
        # cgset -r memory.limit_in_bytes=20M
        # cgset -r memory.memsw.limit_in_bytes=20M
        # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of group B
        # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of group A
      
      While B gets oom_lock A will not get it.  Both of them go into sleep and
      wait for an external action.  We can make the limit higher for A to
      enforce waking it up
      
        # cgset -r memory.memsw.limit_in_bytes=300M A
        # cgset -r memory.limit_in_bytes=300M A
      
      malloc in A has to wake up even though it doesn't have oom_lock.
      
      Finally, the unlock path is very easy because we always unlock only the
      subtree we have locked previously while we always decrement under_oom.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79dfdacc
    • K
      memcg: consolidate memory cgroup lru stat functions · bb2a0de9
      KAMEZAWA Hiroyuki 提交于
      In mm/memcontrol.c, there are many lru stat functions as..
      
        mem_cgroup_zone_nr_lru_pages
        mem_cgroup_node_nr_file_lru_pages
        mem_cgroup_nr_file_lru_pages
        mem_cgroup_node_nr_anon_lru_pages
        mem_cgroup_nr_anon_lru_pages
        mem_cgroup_node_nr_unevictable_lru_pages
        mem_cgroup_nr_unevictable_lru_pages
        mem_cgroup_node_nr_lru_pages
        mem_cgroup_nr_lru_pages
        mem_cgroup_get_local_zonestat
      
      Some of them are under #ifdef MAX_NUMNODES >1 and others are not.
      This seems bad. This patch consolidates all functions into
      
        mem_cgroup_zone_nr_lru_pages()
        mem_cgroup_node_nr_lru_pages()
        mem_cgroup_nr_lru_pages()
      
      For these functions, "which LRU?" information is passed by a mask.
      
      example:
        mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON))
      
      And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON.
      
      example:
        mem_cgroup_nr_lru_pages(mem, ALL_LRU)
      
      BTW, considering layout of NUMA memory placement of counters, this patch seems
      to be better.
      
      Now, when we gather all LRU information, we scan in following orer
          for_each_lru -> for_each_node -> for_each_zone.
      
      This means we'll touch cache lines in different node in turn.
      
      After patch, we'll scan
          for_each_node -> for_each_zone -> for_each_lru(mask)
      
      Then, we'll gather information in the same cacheline at once.
      
      [akpm@linux-foundation.org: fix warnigns, build error]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb2a0de9
    • K
      memcg: export memory cgroup's swappiness with mem_cgroup_swappiness() · 1f4c025b
      KAMEZAWA Hiroyuki 提交于
      Each memory cgroup has a 'swappiness' value which can be accessed by
      get_swappiness(memcg).  The major user is try_to_free_mem_cgroup_pages()
      and swappiness is passed by argument.  It's propagated by scan_control.
      
      get_swappiness() is a static function but some planned updates will need
      to get swappiness from files other than memcontrol.c This patch exports
      get_swappiness() as mem_cgroup_swappiness().  With this, we can remove the
      argument of swapiness from try_to_free...  and drop swappiness from
      scan_control.  only memcg uses it.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f4c025b
  8. 26 7月, 2011 2 次提交