1. 16 6月, 2011 17 次提交
    • M
      mm: compaction: ensure that the compaction free scanner does not move to the next zone · 7454f4ba
      Mel Gorman 提交于
      Compaction works with two scanners, a migration and a free scanner.  When
      the scanners crossover, migration within the zone is complete.  The
      location of the scanner is recorded on each cycle to avoid excesive
      scanning.
      
      When a zone is small and mostly reserved, it's very easy for the migration
      scanner to be close to the end of the zone.  Then the following situation
      can occurs
      
        o migration scanner isolates some pages near the end of the zone
        o free scanner starts at the end of the zone but finds that the
          migration scanner is already there
        o free scanner gets reinitialised for the next cycle as
          cc->migrate_pfn + pageblock_nr_pages
          moving the free scanner into the next zone
        o migration scanner moves into the next zone
      
      When this happens, NR_ISOLATED accounting goes haywire because some of the
      accounting happens against the wrong zone.  One zones counter remains
      positive while the other goes negative even though the overall global
      count is accurate.  This was reported on X86-32 with !SMP because !SMP
      allows the negative counters to be visible.  The fact that it is the bug
      should theoritically be possible there.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7454f4ba
    • S
      compaction: checks correct fragmentation index · a582a738
      Shaohua Li 提交于
      fragmentation_index() returns -1000 when the allocation might succeed
      This doesn't match the comment and code in compaction_suitable(). I
      thought compaction_suitable should return COMPACT_PARTIAL in -1000
      case, because in this case allocation could succeed depending on
      watermarks.
      
      The impact of this is that compaction starts and compact_finished() is
      called which rechecks the watermarks and the free lists.  It should have
      the same result in that compaction should not start but is more expensive.
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a582a738
    • M
      mm/memory-failure.c: fix page isolated count mismatch · 5db8a73a
      Minchan Kim 提交于
      Pages isolated for migration are accounted with the vmstat counters
      NR_ISOLATE_[ANON|FILE].  Callers of migrate_pages() are expected to
      increment these counters when pages are isolated from the LRU.  Once the
      pages have been migrated, they are put back on the LRU or freed and the
      isolated count is decremented.
      
      Memory failure is not properly accounting for pages it isolates causing
      the NR_ISOLATED counters to be negative.  On SMP builds, this goes
      unnoticed as negative counters are treated as 0 due to expected per-cpu
      drift.  On UP builds, the counter is treated by too_many_isolated() as a
      large value causing processes to enter D state during page reclaim or
      compaction.  This patch accounts for pages isolated by memory failure
      correctly.
      
      [mel@csn.ul.ie: rewrote changelog]
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5db8a73a
    • K
      memcg: avoid percpu cached charge draining at softlimit · fbc29a25
      KAMEZAWA Hiroyuki 提交于
      Based on Michal Hocko's comment.
      
      We are not draining per cpu cached charges during soft limit reclaim
      because background reclaim doesn't care about charges.  It tries to free
      some memory and charges will not give any.
      
      Cached charges might influence only selection of the biggest soft limit
      offender but as the call is done only after the selection has been already
      done it makes no change.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fbc29a25
    • K
      memcg: fix percpu cached charge draining frequency · 26fe6168
      KAMEZAWA Hiroyuki 提交于
      For performance, memory cgroup caches some "charge" from res_counter into
      per cpu cache.  This works well but because it's cache, it needs to be
      flushed in some cases.  Typical cases are
      
         1. when someone hit limit.
      
         2. when rmdir() is called and need to charges to be 0.
      
      But "1" has problem.
      
      Recently, with large SMP machines, we see many kworker runs because of
      flushing memcg's cache.  Bad things in implementation are that even if a
      cpu contains a cache for memcg not related to a memcg which hits limit,
      drain code is called.
      
      This patch does
              A) check percpu cache contains a useful data or not.
              B) check other asynchronous percpu draining doesn't run.
              C) don't call local cpu callback.
      
      (*)This patch avoid changing the calling condition with hard-limit.
      
      When I run "cat 1Gfile > /dev/null" under 300M limit memcg,
      
      [Before]
      13767 kamezawa  20   0 98.6m  424  416 D 10.0  0.0   0:00.61 cat
         58 root      20   0     0    0    0 S  0.6  0.0   0:00.09 kworker/2:1
         60 root      20   0     0    0    0 S  0.6  0.0   0:00.08 kworker/4:1
          4 root      20   0     0    0    0 S  0.3  0.0   0:00.02 kworker/0:0
         57 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/1:1
         61 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/5:1
         62 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/6:1
         63 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/7:1
      
      [After]
       2676 root      20   0 98.6m  416  416 D  9.3  0.0   0:00.87 cat
       2626 kamezawa  20   0 15192 1312  920 R  0.3  0.0   0:00.28 top
          1 root      20   0 19384 1496 1204 S  0.0  0.0   0:00.66 init
          2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd
          3 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0
          4 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kworker/0:0
      
      [akpm@linux-foundation.org: make percpu_charge_mutex static, tweak comments]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NYing Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26fe6168
    • K
      memcg: fix wrong check of noswap with softlimit · 7ae534d0
      KAMEZAWA Hiroyuki 提交于
      Hierarchical reclaim doesn't swap out if memsw and resource limits are
      thye same (memsw_is_minimum == true) because we would hit mem+swap limit
      anyway (during hard limit reclaim).
      
      If it comes to the soft limit we shouldn't consider memsw_is_minimum at
      all because it doesn't make much sense.  Either the soft limit is bellow
      the hard limit and then we cannot hit mem+swap limit or the direct reclaim
      takes a precedence.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ae534d0
    • K
      memcg: fix init_page_cgroup nid with sparsemem · 37573e8c
      KAMEZAWA Hiroyuki 提交于
      Commit 21a3c964 ("memcg: allocate memory cgroup structures in local
      nodes") makes page_cgroup allocation as NUMA aware.  But that caused a
      problem https://bugzilla.kernel.org/show_bug.cgi?id=36192.
      
      The problem was getting a NID from invalid struct pages, which was not
      initialized because it was out-of-node, out of [node_start_pfn,
      node_end_pfn)
      
      Now, with sparsemem, page_cgroup_init scans pfn from 0 to max_pfn.  But
      this may scan a pfn which is not on any node and can access memmap which
      is not initialized.
      
      This makes page_cgroup_init() for SPARSEMEM node aware and remove a code
      to get nid from page->flags.  (Then, we'll use valid NID always.)
      
      [akpm@linux-foundation.org: try to fix up comments]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37573e8c
    • K
      mm: memory.numa_stat: fix file permission · 89577127
      KAMEZAWA Hiroyuki 提交于
      Commit 406eb0c9 ("memcg: add memory.numastat api for numa
      statistics") adds memory.numa_stat file for memory cgroup.  But the file
      permissions are wrong.
      
        [kamezawa@bluextal linux-2.6]$ ls -l /cgroup/memory/A/memory.numa_stat
        ---------- 1 root root 0 Jun  9 18:36 /cgroup/memory/A/memory.numa_stat
      
      This patch fixes the permission as
      
        [root@bluextal kamezawa]# ls -l /cgroup/memory/A/memory.numa_stat
        -r--r--r-- 1 root root 0 Jun 10 16:49 /cgroup/memory/A/memory.numa_stat
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NYing Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89577127
    • R
      mm: fix negative commitlimit when gigantic hugepages are allocated · b0320c7b
      Rafael Aquini 提交于
      When 1GB hugepages are allocated on a system, free(1) reports less
      available memory than what really is installed in the box.  Also, if the
      total size of hugepages allocated on a system is over half of the total
      memory size, CommitLimit becomes a negative number.
      
      The problem is that gigantic hugepages (order > MAX_ORDER) can only be
      allocated at boot with bootmem, thus its frames are not accounted to
      'totalram_pages'.  However, they are accounted to hugetlb_total_pages()
      
      What happens to turn CommitLimit into a negative number is this
      calculation, in fs/proc/meminfo.c:
      
              allowed = ((totalram_pages - hugetlb_total_pages())
                      * sysctl_overcommit_ratio / 100) + total_swap_pages;
      
      A similar calculation occurs in __vm_enough_memory() in mm/mmap.c.
      
      Also, every vm statistic which depends on 'totalram_pages' will render
      confusing values, as if system were 'missing' some part of its memory.
      
      Impact of this bug:
      
      When gigantic hugepages are allocated and sysctl_overcommit_memory ==
      OVERCOMMIT_NEVER.  In a such situation, __vm_enough_memory() goes through
      the mentioned 'allowed' calculation and might end up mistakenly returning
      -ENOMEM, thus forcing the system to start reclaiming pages earlier than it
      would be ususal, and this could cause detrimental impact to overall
      system's performance, depending on the workload.
      
      Besides the aforementioned scenario, I can only think of this causing
      annoyances with memory reports from /proc/meminfo and free(1).
      
      [akpm@linux-foundation.org: standardize comment layout]
      Reported-by: NRuss Anderson <rja@sgi.com>
      Signed-off-by: NRafael Aquini <aquini@linux.com>
      Acked-by: NRuss Anderson <rja@sgi.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0320c7b
    • K
      mm/memory_hotplug.c: fix building of node hotplug zonelist · 959ecc48
      KAMEZAWA Hiroyuki 提交于
      During memory hotplug we refresh zonelists when we online a page in a new
      zone.  It means that the node's zonelist is not initialized until pages
      are onlined.  So for example, "nid" passed by MEM_GOING_ONLINE notifier
      will point to NODE_DATA(nid) which has no zone fallback list.  Moreover,
      if we hot-add cpu-only nodes, alloc_pages() will do no fallback.
      
      This patch makes a zonelist when a new pgdata is available.
      
      Note: in production, at fujitsu, memory should be onlined before cpu
            and our server didn't have any memory-less nodes and had no problems.
      
            But recent changes in MEM_GOING_ONLINE+page_cgroup
            will access not initialized zonelist of node.
            Anyway, there are memory-less node and we need some care.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      959ecc48
    • M
      mm: compaction: fix special case -1 order checks · 3957c776
      Michal Hocko 提交于
      Commit 56de7263 ("mm: compaction: direct compact when a high-order
      allocation fails") introduced a check for cc->order == -1 in
      compact_finished.  We should continue compacting in that case because
      the request came from userspace and there is no particular order to
      compact for.  Similar check has been added by 82478fb7 (mm: compaction:
      prevent division-by-zero during user-requested compaction) for
      compaction_suitable.
      
      The check is, however, done after zone_watermark_ok which uses order as a
      right hand argument for shifts.  Not only watermark check is pointless if
      we can break out without it but it also uses 1 << -1 which is not well
      defined (at least from C standard).  Let's move the -1 check above
      zone_watermark_ok.
      
      [minchan.kim@gmail.com> - caught compaction_suitable]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3957c776
    • S
      mm: fix wrong kunmap_atomic() pointer · 5f1a1907
      Steven Rostedt 提交于
      Running a ktest.pl test, I hit the following bug on x86_32:
      
        ------------[ cut here ]------------
        WARNING: at arch/x86/mm/highmem_32.c:81 __kunmap_atomic+0x64/0xc1()
         Hardware name:
        Modules linked in:
        Pid: 93, comm: sh Not tainted 2.6.39-test+ #1
        Call Trace:
         [<c04450da>] warn_slowpath_common+0x7c/0x91
         [<c042f5df>] ? __kunmap_atomic+0x64/0xc1
         [<c042f5df>] ? __kunmap_atomic+0x64/0xc1^M
         [<c0445111>] warn_slowpath_null+0x22/0x24
         [<c042f5df>] __kunmap_atomic+0x64/0xc1
         [<c04d4a22>] unmap_vmas+0x43a/0x4e0
         [<c04d9065>] exit_mmap+0x91/0xd2
         [<c0443057>] mmput+0x43/0xad
         [<c0448358>] exit_mm+0x111/0x119
         [<c044855f>] do_exit+0x1ff/0x5fa
         [<c0454ea2>] ? set_current_blocked+0x3c/0x40
         [<c0454f24>] ? sigprocmask+0x7e/0x8e
         [<c0448b55>] do_group_exit+0x65/0x88
         [<c0448b90>] sys_exit_group+0x18/0x1c
         [<c0c3915f>] sysenter_do_call+0x12/0x38
        ---[ end trace 8055f74ea3c0eb62 ]---
      
      Running a ktest.pl git bisect, found the culprit: commit e303297e
      ("mm: extended batches for generic mmu_gather")
      
      But although this was the commit triggering the bug, it was not the one
      originally responsible for the bug.  That was commit d16dfc55 ("mm:
      mmu_gather rework").
      
      The code in zap_pte_range() has something that looks like the following:
      
      	pte =  pte_offset_map_lock(mm, pmd, addr, &ptl);
      	do {
      		[...]
      	} while (pte++, addr += PAGE_SIZE, addr != end);
      	pte_unmap_unlock(pte - 1, ptl);
      
      The pte starts off pointing at the first element in the page table
      directory that was returned by the pte_offset_map_lock().  When it's done
      with the page, pte will be pointing to anything between the next entry and
      the first entry of the next page inclusive.  By doing a pte - 1, this puts
      the pte back onto the original page, which is all that pte_unmap_unlock()
      needs.
      
      In most archs (64 bit), this is not an issue as the pte is ignored in the
      pte_unmap_unlock().  But on 32 bit archs, where things may be kmapped, it
      is essential that the pte passed to pte_unmap_unlock() resides on the same
      page that was given by pte_offest_map_lock().
      
      The problem came in d16dfc55 ("mm: mmu_gather rework") where it introduced
      a "break;" from the while loop.  This alone did not seem to easily trigger
      the bug.  But the modifications made by e303297e caused that "break;" to
      be hit on the first iteration, before the pte++.
      
      The pte not being incremented will now cause pte_unmap_unlock(pte - 1) to
      be pointing to the previous page.  This will cause the wrong page to be
      unmapped, and also trigger the warning above.
      
      The simple solution is to just save the pointer given by
      pte_offset_map_lock() and use it in the unlock.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f1a1907
    • K
      vmscan: implement swap token priority aging · d7911ef3
      KOSAKI Motohiro 提交于
      While testing for memcg aware swap token, I observed a swap token was
      often grabbed an intermittent running process (eg init, auditd) and they
      never release a token.
      
      Why?
      
      Some processes (eg init, auditd, audispd) wake up when a process exiting.
      And swap token can be get first page-in process when a process exiting
      makes no swap token owner.  Thus such above intermittent running process
      often get a token.
      
      And currently, swap token priority is only decreased at page fault path.
      Then, if the process sleep immediately after to grab swap token, the swap
      token priority never be decreased.  That's obviously undesirable.
      
      This patch implement very poor (and lightweight) priority aging.  It only
      be affect to the above corner case and doesn't change swap tendency
      workload performance (eg multi process qsbench load)
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7911ef3
    • K
      vmscan: implement swap token trace · 83cd81a3
      KOSAKI Motohiro 提交于
      This is useful for observing swap token activity.
      
      example output:
      
                   zsh-1845  [000]   598.962716: update_swap_token_priority:
      mm=ffff88015eaf7700 old_prio=1 new_prio=0
                memtoy-1830  [001]   602.033900: update_swap_token_priority:
      mm=ffff880037a45880 old_prio=947 new_prio=949
                memtoy-1830  [000]   602.041509: update_swap_token_priority:
      mm=ffff880037a45880 old_prio=949 new_prio=951
                memtoy-1830  [000]   602.051959: update_swap_token_priority:
      mm=ffff880037a45880 old_prio=951 new_prio=953
                memtoy-1830  [000]   602.052188: update_swap_token_priority:
      mm=ffff880037a45880 old_prio=953 new_prio=955
                memtoy-1830  [001]   602.427184: put_swap_token:
      token_mm=ffff880037a45880
                   zsh-1789  [000]   602.427281: replace_swap_token:
      old_token_mm=          (null) old_prio=0 new_token_mm=ffff88015eaf7018
      new_prio=2
                   zsh-1789  [001]   602.433456: update_swap_token_priority:
      mm=ffff88015eaf7018 old_prio=2 new_prio=4
                   zsh-1789  [000]   602.437613: update_swap_token_priority:
      mm=ffff88015eaf7018 old_prio=4 new_prio=6
                   zsh-1789  [000]   602.443924: update_swap_token_priority:
      mm=ffff88015eaf7018 old_prio=6 new_prio=8
                   zsh-1789  [000]   602.451873: update_swap_token_priority:
      mm=ffff88015eaf7018 old_prio=8 new_prio=10
                   zsh-1789  [001]   602.462639: update_swap_token_priority:
      mm=ffff88015eaf7018 old_prio=10 new_prio=12
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: Rik van Riel<riel@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83cd81a3
    • K
      vmscan,memcg: memcg aware swap token · a433658c
      KOSAKI Motohiro 提交于
      Currently, memcg reclaim can disable swap token even if the swap token mm
      doesn't belong in its memory cgroup.  It's slightly risky.  If an admin
      creates very small mem-cgroup and silly guy runs contentious heavy memory
      pressure workload, every tasks are going to lose swap token and then
      system may become unresponsive.  That's bad.
      
      This patch adds 'memcg' parameter into disable_swap_token().  and if the
      parameter doesn't match swap token, VM doesn't disable it.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: Rik van Riel<riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a433658c
    • R
      mm/memory.c: fix kernel-doc notation · 0164f69d
      Randy Dunlap 提交于
      Fix new kernel-doc warnings in mm/memory.c:
      
        Warning(mm/memory.c:1327): No description found for parameter 'tlb'
        Warning(mm/memory.c:1327): Excess function parameter 'tlbp' description in 'unmap_vmas'
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0164f69d
    • A
      mm: remove khugepaged double thp vmstat update with CONFIG_NUMA=n · f300ea49
      Andrea Arcangeli 提交于
      Johannes noticed the vmstat update is already taken care of by
      khugepaged_alloc_hugepage() internally.  The only places that are required
      to update the vmstat are the callers of alloc_hugepage (callers of
      khugepaged_alloc_hugepage aren't).
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NJohannes Weiner <jweiner@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NJohannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f300ea49
  2. 06 6月, 2011 1 次提交
  3. 04 6月, 2011 3 次提交
    • A
      more conservative S_NOSEC handling · 9e1f1de0
      Al Viro 提交于
      Caching "we have already removed suid/caps" was overenthusiastic as merged.
      On network filesystems we might have had suid/caps set on another client,
      silently picked by this client on revalidate, all of that *without* clearing
      the S_NOSEC flag.
      
      AFAICS, the only reasonably sane way to deal with that is
      	* new superblock flag; unless set, S_NOSEC is not going to be set.
      	* local block filesystems set it in their ->mount() (more accurately,
      mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
      local block ones clear it)
      	* if any network filesystem (or a cluster one) wants to use S_NOSEC,
      it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
      inode attribute changes are picked from other clients.
      
      It's not an earth-shattering hole (anybody that can set suid on another client
      will almost certainly be able to write to the file before doing that anyway),
      but it's a bug that needs fixing.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9e1f1de0
    • S
      SLAB: Record actual last user of freed objects. · a947eb95
      Suleiman Souhlal 提交于
      Currently, when using CONFIG_DEBUG_SLAB, we put in kfree() or
      kmem_cache_free() as the last user of free objects, which is not
      very useful, so change it to the caller of those functions instead.
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NSuleiman Souhlal <suleiman@google.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      a947eb95
    • C
      slub: always align cpu_slab to honor cmpxchg_double requirement · d4d84fef
      Chris Metcalf 提交于
      On an architecture without CMPXCHG_LOCAL but with DEBUG_VM enabled,
      the VM_BUG_ON() in __pcpu_double_call_return_bool() will cause an early
      panic during boot unless we always align cpu_slab properly.
      
      In principle we could remove the alignment-testing VM_BUG_ON() for
      architectures that don't have CMPXCHG_LOCAL, but leaving it in means
      that new code will tend not to break x86 even if it is introduced
      on another platform, and it's low cost to require alignment.
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      d4d84fef
  4. 02 6月, 2011 1 次提交
  5. 30 5月, 2011 1 次提交
  6. 29 5月, 2011 4 次提交
    • H
      mm: fix page_lock_anon_vma leaving mutex locked · eee0f252
      Hugh Dickins 提交于
      On one machine I've been getting hangs, a page fault's anon_vma_prepare()
      waiting in anon_vma_lock(), other processes waiting for that page's lock.
      
      This is a replay of last year's f1819427 "mm: fix hang on
      anon_vma->root->lock".
      
      The new page_lock_anon_vma() places too much faith in its refcount: when
      it has acquired the mutex_trylock(), it's possible that a racing task in
      anon_vma_alloc() has just reallocated the struct anon_vma, set refcount
      to 1, and is about to reset its anon_vma->root.
      
      Fix this by saving anon_vma->root, and relying on the usual page_mapped()
      check instead of a refcount check: if page is still mapped, the anon_vma
      is still ours; if page is not still mapped, we're no longer interested.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eee0f252
    • H
      mm: fix kernel BUG at mm/rmap.c:1017! · 5dbe0af4
      Hugh Dickins 提交于
      I've hit the "address >= vma->vm_end" check in do_page_add_anon_rmap()
      just once.  The stack showed khugepaged allocation trying to compact
      pages: the call to page_add_anon_rmap() coming from remove_migration_pte().
      
      That path holds anon_vma lock, but does not hold mmap_sem: it can
      therefore race with a split_vma(), and in commit 5f70b962 "mmap:
      avoid unnecessary anon_vma lock" we just took away the anon_vma lock
      protection when adjusting vma->vm_end.
      
      I don't think that particular BUG_ON ever caught anything interesting,
      so better replace it by a comment, than reinstate the anon_vma locking.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5dbe0af4
    • H
      tmpfs: fix race between truncate and writepage · 826267cf
      Hugh Dickins 提交于
      While running fsx on tmpfs with a memhog then swapoff, swapoff was hanging
      (interruptibly), repeatedly failing to locate the owner of a 0xff entry in
      the swap_map.
      
      Although shmem_writepage() does abandon when it sees incoming page index
      is beyond eof, there was still a window in which shmem_truncate_range()
      could come in between writepage's dropping lock and updating swap_map,
      find the half-completed swap_map entry, and in trying to free it,
      leave it in a state that swap_shmem_alloc() could not correct.
      
      Arguably a bug in __swap_duplicate()'s and swap_entry_free()'s handling
      of the different cases, but easiest to fix by moving swap_shmem_alloc()
      under cover of the lock.
      
      More interesting than the bug: it's been there since 2.6.33, why could
      I not see it with earlier kernels?  The mmotm of two weeks ago seems to
      have some magic for generating races, this is just one of three I found.
      
      With yesterday's git I first saw this in mainline, bisected in search of
      that magic, but the easy reproducibility evaporated.  Oh well, fix the bug.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      826267cf
    • A
      Cache xattr security drop check for write v2 · 69b45732
      Andi Kleen 提交于
      Some recent benchmarking on btrfs showed that a major scaling bottleneck
      on large systems on btrfs is currently the xattr lookup on every write.
      
      Why xattr lookup on every write I hear you ask?
      
      write wants to drop suid and security related xattrs that could set o
      capabilities for executables.  To do that it currently looks up
      security.capability on EVERY write (even for non executables) to decide
      whether to drop it or not.
      
      In btrfs this causes an additional tree walk, hitting some per file system
      locks and quite bad scalability. In a simple read workload on a 8S
      system I saw over 90% CPU time in spinlocks related to that.
      
      Chris Mason tells me this is also a problem in ext4, where it hits
      the global mbcache lock.
      
      This patch adds a simple per inode to avoid this problem.  We only
      do the lookup once per file and then if there is no xattr cache
      the decision. All xattr changes clear the flag.
      
      I also used the same flag to avoid the suid check, although
      that one is pretty cheap.
      
      A file system can also set this flag when it creates the inode,
      if it has a cheap way to do so.  This is done for some common file systems
      in followon patches.
      
      With this patch a major part of the lock contention disappears
      for btrfs. Some testing on smaller systems didn't show significant
      performance changes, but at least it helps the larger systems
      and is generally more efficient.
      
      v2: Rename is_sgid. add file system helper.
      Cc: chris.mason@oracle.com
      Cc: josef@redhat.com
      Cc: viro@zeniv.linux.org.uk
      Cc: agruen@linbit.com
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      69b45732
  7. 28 5月, 2011 1 次提交
  8. 27 5月, 2011 12 次提交