1. 18 7月, 2012 1 次提交
    • A
      mm: fix lost kswapd wakeup in kswapd_stop() · 1c7e7f6c
      Aaditya Kumar 提交于
      Offlining memory may block forever, waiting for kswapd() to wake up
      because kswapd() does not check the event kthread->should_stop before
      sleeping.
      
      The proper pattern, from Documentation/memory-barriers.txt, is:
      
         ---  waker  ---
         event_indicated = 1;
         wake_up_process(event_daemon);
      
         ---  sleeper  ---
         for (;;) {
            set_current_state(TASK_UNINTERRUPTIBLE);
            if (event_indicated)
               break;
            schedule();
         }
      
         set_current_state() may be wrapped by:
            prepare_to_wait();
      
      In the kswapd() case, event_indicated is kthread->should_stop.
      
        === offlining memory (waker) ===
         kswapd_stop()
            kthread_stop()
               kthread->should_stop = 1
               wake_up_process()
               wait_for_completion()
      
        ===  kswapd_try_to_sleep (sleeper) ===
         kswapd_try_to_sleep()
            prepare_to_wait()
                 .
                 .
            schedule()
                 .
                 .
            finish_wait()
      
      The schedule() needs to be protected by a test of kthread->should_stop,
      which is wrapped by kthread_should_stop().
      
      Reproducer:
         Do heavy file I/O in background.
         Do a memory offline/online in a tight loop
      Signed-off-by: NAaditya Kumar <aaditya.kumar@ap.sony.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c7e7f6c
  2. 12 7月, 2012 1 次提交
    • J
      memory hotplug: fix invalid memory access caused by stale kswapd pointer · d8adde17
      Jiang Liu 提交于
      kswapd_stop() is called to destroy the kswapd work thread when all memory
      of a NUMA node has been offlined.  But kswapd_stop() only terminates the
      work thread without resetting NODE_DATA(nid)->kswapd to NULL.  The stale
      pointer will prevent kswapd_run() from creating a new work thread when
      adding memory to the memory-less NUMA node again.  Eventually the stale
      pointer may cause invalid memory access.
      
      An example stack dump as below. It's reproduced with 2.6.32, but latest
      kernel has the same issue.
      
        BUG: unable to handle kernel NULL pointer dereference at (null)
        IP: [<ffffffff81051a94>] exit_creds+0x12/0x78
        PGD 0
        Oops: 0000 [#1] SMP
        last sysfs file: /sys/devices/system/memory/memory391/state
        CPU 11
        Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod
        Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285
        RIP: 0010:exit_creds+0x12/0x78
        RSP: 0018:ffff8806044f1d78  EFLAGS: 00010202
        RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502
        RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
        RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10
        R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000
        R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001
        FS:  00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600)
        Stack:
         ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e
         ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1
         0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003
        Call Trace:
          __put_task_struct+0x5d/0x97
          kthread_stop+0x50/0x58
          offline_pages+0x324/0x3da
          memory_block_change_state+0x179/0x1db
          store_mem_state+0x9e/0xbb
          sysfs_write_file+0xd0/0x107
          vfs_write+0xad/0x169
          sys_write+0x45/0x6e
          system_call_fastpath+0x16/0x1b
        Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 <8b> 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0
        RIP  exit_creds+0x12/0x78
         RSP <ffff8806044f1d78>
        CR2: 0000000000000000
      
      [akpm@linux-foundation.org: add pglist_data.kswapd locking comments]
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8adde17
  3. 30 5月, 2012 28 次提交
  4. 26 4月, 2012 1 次提交
  5. 13 4月, 2012 1 次提交
  6. 25 3月, 2012 1 次提交
  7. 23 3月, 2012 1 次提交
  8. 22 3月, 2012 6 次提交
    • K
      mm: forbid lumpy-reclaim in shrink_active_list() · 1480de03
      Konstantin Khlebnikov 提交于
      Reset the reclaim mode in shrink_active_list() to RECLAIM_MODE_SINGLE |
      RECLAIM_MODE_ASYNC.  (sync/async sign is used only in shrink_page_list
      and does not affect shrink_active_list)
      
      Currenly shrink_active_list() sometimes works in lumpy-reclaim mode, if
      RECLAIM_MODE_LUMPYRECLAIM is left over from an earlier
      shrink_inactive_list().  Meanwhile, in age_active_anon()
      sc->reclaim_mode is totally zero.  So the current behavior is too
      complex and confusing, and this looks like bug.
      
      In general, shrink_active_list() populates the inactive list for the
      next shrink_inactive_list().  Lumpy shring_inactive_list() isolates
      pages around the chosen one from both the active and inactive lists.
      So, there is no reason for lumpy isolation in shrink_active_list().
      
      See also: https://lkml.org/lkml/2012/3/15/583Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Proposed-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1480de03
    • M
      cpuset: mm: reduce large amounts of memory barrier related damage v3 · cc9a6c87
      Mel Gorman 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") wins a super prize for the largest number of
      memory barriers entered into fast paths for one commit.
      
      [get|put]_mems_allowed is incredibly heavy with pairs of full memory
      barriers inserted into a number of hot paths.  This was detected while
      investigating at large page allocator slowdown introduced some time
      after 2.6.32.  The largest portion of this overhead was shown by
      oprofile to be at an mfence introduced by this commit into the page
      allocator hot path.
      
      For extra style points, the commit introduced the use of yield() in an
      implementation of what looks like a spinning mutex.
      
      This patch replaces the full memory barriers on both read and write
      sides with a sequence counter with just read barriers on the fast path
      side.  This is much cheaper on some architectures, including x86.  The
      main bulk of the patch is the retry logic if the nodemask changes in a
      manner that can cause a false failure.
      
      While updating the nodemask, a check is made to see if a false failure
      is a risk.  If it is, the sequence number gets bumped and parallel
      allocators will briefly stall while the nodemask update takes place.
      
      In a page fault test microbenchmark, oprofile samples from
      __alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
      actual results were
      
                                   3.3.0-rc3          3.3.0-rc3
                                   rc3-vanilla        nobarrier-v2r1
          Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
          Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
          Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
          Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
          Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
          Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
          Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
          Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
          Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
          Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
          Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
          Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
          Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
          Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
          Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
          MMTests Statistics: duration
          Sys Time Running Test (seconds)             135.68    132.17
          User+Sys Time Running Test (seconds)         164.2    160.13
          Total Elapsed Time (seconds)                123.46    120.87
      
      The overall improvement is small but the System CPU time is much
      improved and roughly in correlation to what oprofile reported (these
      performance figures are without profiling so skew is expected).  The
      actual number of page faults is noticeably improved.
      
      For benchmarks like kernel builds, the overall benefit is marginal but
      the system CPU time is slightly reduced.
      
      To test the actual bug the commit fixed I opened two terminals.  The
      first ran within a cpuset and continually ran a small program that
      faulted 100M of anonymous data.  In a second window, the nodemask of the
      cpuset was continually randomised in a loop.
      
      Without the commit, the program would fail every so often (usually
      within 10 seconds) and obviously with the commit everything worked fine.
      With this patch applied, it also worked fine so the fix should be
      functionally equivalent.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc9a6c87
    • C
      mm/vmscan.c: fix spelling error · c7cfa37b
      Copot Alexandru 提交于
      s/noticable/noticeable/
      Signed-off-by: NCopot Alexandru <alex.mihai.c@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7cfa37b
    • H
      vmscan: handle isolated pages with lru lock released · d563c050
      Hillf Danton 提交于
      When shrinking inactive lru list, isolated pages are queued on locally
      private list, so the lock-hold time could be reduced if pages are counted
      without lock protection.
      
      To achieve that, firstly updating reclaim stat is delayed until the
      putback stage, after reacquiring the lru lock.
      
      Secondly, operations related to vm and zone stats are now proteced with
      preemption disabled as they are per-cpu operations.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d563c050
    • M
      mm: vmscan: forcibly scan highmem if there are too many buffer_heads pinning highmem · cc715d99
      Mel Gorman 提交于
      Stuart Foster reported on bugzilla that copying large amounts of data
      from NTFS caused an OOM kill on 32-bit X86 with 16G of memory.  Andrew
      Morton correctly identified that the problem was NTFS was using 512
      blocks meaning each page had 8 buffer_heads in low memory pinning it.
      
      In the past, direct reclaim used to scan highmem even if the allocating
      process did not specify __GFP_HIGHMEM but not any more.  kswapd no longer
      will reclaim from zones that are above the high watermark.  The intention
      in both cases was to minimise unnecessary reclaim.  The downside is on
      machines with large amounts of highmem that lowmem can be fully consumed
      by buffer_heads with nothing trying to free them.
      
      The following patch is based on a suggestion by Andrew Morton to extend
      the buffer_heads_over_limit case to force kswapd and direct reclaim to
      scan the highmem zone regardless of the allocation request or watermarks.
      
      Addresses https://bugzilla.kernel.org/show_bug.cgi?id=42578
      
      [hughd@google.com: move buffer_heads_over_limit check up]
      [akpm@linux-foundation.org: buffer_heads_over_limit is unlikely]
      Reported-by: NStuart Foster <smf.linux@ntlworld.com>
      Tested-by: NStuart Foster <smf.linux@ntlworld.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc715d99
    • R
      vmscan: only defer compaction for failed order and higher · aff62249
      Rik van Riel 提交于
      Currently a failed order-9 (transparent hugepage) compaction can lead to
      memory compaction being temporarily disabled for a memory zone.  Even if
      we only need compaction for an order 2 allocation, eg.  for jumbo frames
      networking.
      
      The fix is relatively straightforward: keep track of the highest order at
      which compaction is succeeding, and only defer compaction for orders at
      which compaction is failing.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aff62249