1. 18 9月, 2012 1 次提交
  2. 01 8月, 2012 6 次提交
    • T
      memcg: gix memory accounting scalability in shrink_page_list · 69980e31
      Tim Chen 提交于
      I noticed in a multi-process parallel files reading benchmark I ran on a 8
      socket machine, throughput slowed down by a factor of 8 when I ran the
      benchmark within a cgroup container.  I traced the problem to the
      following code path (see below) when we are trying to reclaim memory from
      file cache.  The res_counter_uncharge function is called on every page
      that's reclaimed and created heavy lock contention.  The patch below
      allows the reclaimed pages to be uncharged from the resource counter in
      batch and recovered the regression.
      
      Tim
      
           40.67%           usemem  [kernel.kallsyms]                   [k] _raw_spin_lock
                            |
                            --- _raw_spin_lock
                               |
                               |--92.61%-- res_counter_uncharge
                               |          |
                               |          |--100.00%-- __mem_cgroup_uncharge_common
                               |          |          |
                               |          |          |--100.00%-- mem_cgroup_uncharge_cache_page
                               |          |          |          __remove_mapping
                               |          |          |          shrink_page_list
                               |          |          |          shrink_inactive_list
                               |          |          |          shrink_mem_cgroup_zone
                               |          |          |          shrink_zone
                               |          |          |          do_try_to_free_pages
                               |          |          |          try_to_free_pages
                               |          |          |          __alloc_pages_nodemask
                               |          |          |          alloc_pages_current
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69980e31
    • H
      memcg: further prevent OOM with too many dirty pages · c3b94f44
      Hugh Dickins 提交于
      The may_enter_fs test turns out to be too restrictive: though I saw no
      problem with it when testing on 3.5-rc6, it very soon OOMed when I tested
      on 3.5-rc6-mm1.  I don't know what the difference there is, perhaps I just
      slightly changed the way I started off the testing: dd if=/dev/zero
      of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync repeatedly, in 20M
      memory.limit_in_bytes cgroup to ext4 on USB stick.
      
      ext4 (and gfs2 and xfs) turn out to allocate new pages for writing with
      AOP_FLAG_NOFS: that seems a little worrying, and it's unclear to me why
      the transaction needs to be started even before allocating pagecache
      memory.  But it may not be worth worrying about these days: if direct
      reclaim avoids FS writeback, does __GFP_FS now mean anything?
      
      Anyway, we insisted on the may_enter_fs test to avoid hangs with the loop
      device; but since that also masks off __GFP_IO, we can test for __GFP_IO
      directly, ignoring may_enter_fs and __GFP_FS.
      
      But even so, the test still OOMs sometimes: when originally testing on
      3.5-rc6, it OOMed about one time in five or ten; when testing just now on
      3.5-rc6-mm1, it OOMed on the first iteration.
      
      This residual problem comes from an accumulation of pages under ordinary
      writeback, not marked PageReclaim, so rightly not causing the memcg check
      to wait on their writeback: these too can prevent shrink_page_list() from
      freeing any pages, so many times that memcg reclaim fails and OOMs.
      
      Deal with these in the same way as direct reclaim now deals with dirty FS
      pages: mark them PageReclaim.  It is appropriate to rotate these to tail
      of list when writepage completes, but more importantly, the PageReclaim
      flag makes memcg reclaim wait on them if encountered again.  Increment
      NR_VMSCAN_IMMEDIATE?  That's arguable: I chose not.
      
      Setting PageReclaim here may occasionally race with end_page_writeback()
      clearing it: lru_deactivate_fn() already faced the same race, and
      correctly concluded that the window is small and the issue non-critical.
      
      With these changes, the test runs indefinitely without OOMing on ext4,
      ext3 and ext2: I'll move on to test with other filesystems later.
      
      Trivia: invert conditions for a clearer block without an else, and goto
      keep_locked to do the unlock_page.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3b94f44
    • M
      memcg: prevent OOM with too many dirty pages · e62e384e
      Michal Hocko 提交于
      The current implementation of dirty pages throttling is not memcg aware
      which makes it easy to have memcg LRUs full of dirty pages.  Without
      throttling, these LRUs can be scanned faster than the rate of writeback,
      leading to memcg OOM conditions when the hard limit is small.
      
      This patch fixes the problem by throttling the allocating process
      (possibly a writer) during the hard limit reclaim by waiting on
      PageReclaim pages.  We are waiting only for PageReclaim pages because
      those are the pages that made one full round over LRU and that means that
      the writeback is much slower than scanning.
      
      The solution is far from being ideal - long term solution is memcg aware
      dirty throttling - but it is meant to be a band aid until we have a real
      fix.  We are seeing this happening during nightly backups which are placed
      into containers to prevent from eviction of the real working set.
      
      The change affects only memcg reclaim and only when we encounter
      PageReclaim pages which is a signal that the reclaim doesn't catch up on
      with the writers so somebody should be throttled.  This could be
      potentially unfair because it could be somebody else from the group who
      gets throttled on behalf of the writer but as writers need to allocate as
      well and they allocate in higher rate the probability that only innocent
      processes would be penalized is not that high.
      
      I have tested this change by a simple dd copying /dev/zero to tmpfs or
      ext3 running under small memcg (1G copy under 5M, 60M, 300M and 2G
      containers) and dd got killed by OOM killer every time.  With the patch I
      could run the dd with the same size under 5M controller without any OOM.
      The issue is more visible with slower devices for output.
      
      * With the patch
      ================
      * tmpfs size=2G
      ---------------
      $ vim cgroup_cache_oom_test.sh
      $ ./cgroup_cache_oom_test.sh 5M
      using Limit 5M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s
      $ ./cgroup_cache_oom_test.sh 60M
      using Limit 60M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s
      $ ./cgroup_cache_oom_test.sh 300M
      using Limit 300M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s
      $ ./cgroup_cache_oom_test.sh 2G
      using Limit 2G for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s
      
      * ext3
      ------
      $ ./cgroup_cache_oom_test.sh 5M
      using Limit 5M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s
      $ ./cgroup_cache_oom_test.sh 60M
      using Limit 60M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s
      $ ./cgroup_cache_oom_test.sh 300M
      using Limit 300M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s
      $ ./cgroup_cache_oom_test.sh 2G
      using Limit 2G for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s
      
      * Without the patch
      ===================
      * tmpfs size=2G
      ---------------
      $ ./cgroup_cache_oom_test.sh 5M
      using Limit 5M for group
      ./cgroup_cache_oom_test.sh: line 46:  4668 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
      $ ./cgroup_cache_oom_test.sh 60M
      using Limit 60M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s
      $ ./cgroup_cache_oom_test.sh 300M
      using Limit 300M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s
      $ ./cgroup_cache_oom_test.sh 2G
      using Limit 2G for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s
      
      * ext3
      ------
      $ ./cgroup_cache_oom_test.sh 5M
      using Limit 5M for group
      ./cgroup_cache_oom_test.sh: line 46:  4689 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
      $ ./cgroup_cache_oom_test.sh 60M
      using Limit 60M for group
      ./cgroup_cache_oom_test.sh: line 46:  4692 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
      $ ./cgroup_cache_oom_test.sh 300M
      using Limit 300M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s
      $ ./cgroup_cache_oom_test.sh 2G
      using Limit 2G for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s
      
      [akpm@linux-foundation.org: tweak changelog, reordered the test to optimize for CONFIG_CGROUP_MEM_RES_CTLR=n]
      [hughd@google.com: fix deadlock with loop driver]
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e62e384e
    • M
      mm: account for the number of times direct reclaimers get throttled · 68243e76
      Mel Gorman 提交于
      Under significant pressure when writing back to network-backed storage,
      direct reclaimers may get throttled.  This is expected to be a short-lived
      event and the processes get woken up again but processes do get stalled.
      This patch counts how many times such stalling occurs.  It's up to the
      administrator whether to reduce these stalls by increasing
      min_free_kbytes.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68243e76
    • M
      mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is... · 5515061d
      Mel Gorman 提交于
      mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage
      
      If swap is backed by network storage such as NBD, there is a risk that a
      large number of reclaimers can hang the system by consuming all
      PF_MEMALLOC reserves.  To avoid these hangs, the administrator must tune
      min_free_kbytes in advance which is a bit fragile.
      
      This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
      are in use.  If the system is routinely getting throttled the system
      administrator can increase min_free_kbytes so degradation is smoother but
      the system will keep running.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5515061d
    • A
      memcg: rename config variables · c255a458
      Andrew Morton 提交于
      Sanity:
      
      CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
      CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM
      
      [mhocko@suse.cz: fix missed bits]
      Cc: Glauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c255a458
  3. 18 7月, 2012 1 次提交
    • A
      mm: fix lost kswapd wakeup in kswapd_stop() · 1c7e7f6c
      Aaditya Kumar 提交于
      Offlining memory may block forever, waiting for kswapd() to wake up
      because kswapd() does not check the event kthread->should_stop before
      sleeping.
      
      The proper pattern, from Documentation/memory-barriers.txt, is:
      
         ---  waker  ---
         event_indicated = 1;
         wake_up_process(event_daemon);
      
         ---  sleeper  ---
         for (;;) {
            set_current_state(TASK_UNINTERRUPTIBLE);
            if (event_indicated)
               break;
            schedule();
         }
      
         set_current_state() may be wrapped by:
            prepare_to_wait();
      
      In the kswapd() case, event_indicated is kthread->should_stop.
      
        === offlining memory (waker) ===
         kswapd_stop()
            kthread_stop()
               kthread->should_stop = 1
               wake_up_process()
               wait_for_completion()
      
        ===  kswapd_try_to_sleep (sleeper) ===
         kswapd_try_to_sleep()
            prepare_to_wait()
                 .
                 .
            schedule()
                 .
                 .
            finish_wait()
      
      The schedule() needs to be protected by a test of kthread->should_stop,
      which is wrapped by kthread_should_stop().
      
      Reproducer:
         Do heavy file I/O in background.
         Do a memory offline/online in a tight loop
      Signed-off-by: NAaditya Kumar <aaditya.kumar@ap.sony.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c7e7f6c
  4. 12 7月, 2012 1 次提交
    • J
      memory hotplug: fix invalid memory access caused by stale kswapd pointer · d8adde17
      Jiang Liu 提交于
      kswapd_stop() is called to destroy the kswapd work thread when all memory
      of a NUMA node has been offlined.  But kswapd_stop() only terminates the
      work thread without resetting NODE_DATA(nid)->kswapd to NULL.  The stale
      pointer will prevent kswapd_run() from creating a new work thread when
      adding memory to the memory-less NUMA node again.  Eventually the stale
      pointer may cause invalid memory access.
      
      An example stack dump as below. It's reproduced with 2.6.32, but latest
      kernel has the same issue.
      
        BUG: unable to handle kernel NULL pointer dereference at (null)
        IP: [<ffffffff81051a94>] exit_creds+0x12/0x78
        PGD 0
        Oops: 0000 [#1] SMP
        last sysfs file: /sys/devices/system/memory/memory391/state
        CPU 11
        Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod
        Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285
        RIP: 0010:exit_creds+0x12/0x78
        RSP: 0018:ffff8806044f1d78  EFLAGS: 00010202
        RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502
        RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
        RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10
        R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000
        R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001
        FS:  00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600)
        Stack:
         ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e
         ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1
         0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003
        Call Trace:
          __put_task_struct+0x5d/0x97
          kthread_stop+0x50/0x58
          offline_pages+0x324/0x3da
          memory_block_change_state+0x179/0x1db
          store_mem_state+0x9e/0xbb
          sysfs_write_file+0xd0/0x107
          vfs_write+0xad/0x169
          sys_write+0x45/0x6e
          system_call_fastpath+0x16/0x1b
        Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 <8b> 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0
        RIP  exit_creds+0x12/0x78
         RSP <ffff8806044f1d78>
        CR2: 0000000000000000
      
      [akpm@linux-foundation.org: add pglist_data.kswapd locking comments]
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8adde17
  5. 28 6月, 2012 2 次提交
  6. 30 5月, 2012 28 次提交
  7. 26 4月, 2012 1 次提交