1. 10 10月, 2014 40 次提交
    • J
      mm: memcontrol: fix transparent huge page allocations under pressure · b70a2a21
      Johannes Weiner 提交于
      In a memcg with even just moderate cache pressure, success rates for
      transparent huge page allocations drop to zero, wasting a lot of effort
      that the allocator puts into assembling these pages.
      
      The reason for this is that the memcg reclaim code was never designed for
      higher-order charges.  It reclaims in small batches until there is room
      for at least one page.  Huge page charges only succeed when these batches
      add up over a series of huge faults, which is unlikely under any
      significant load involving order-0 allocations in the group.
      
      Remove that loop on the memcg side in favor of passing the actual reclaim
      goal to direct reclaim, which is already set up and optimized to meet
      higher-order goals efficiently.
      
      This brings memcg's THP policy in line with the system policy: if the
      allocator painstakingly assembles a hugepage, memcg will at least make an
      honest effort to charge it.  As a result, transparent hugepage allocation
      rates amid cache activity are drastically improved:
      
                                            vanilla                 patched
      pgalloc                 4717530.80 (  +0.00%)   4451376.40 (  -5.64%)
      pgfault                  491370.60 (  +0.00%)    225477.40 ( -54.11%)
      pgmajfault                    2.00 (  +0.00%)         1.80 (  -6.67%)
      thp_fault_alloc               0.00 (  +0.00%)       531.60 (+100.00%)
      thp_fault_fallback          749.00 (  +0.00%)       217.40 ( -70.88%)
      
      [ Note: this may in turn increase memory consumption from internal
        fragmentation, which is an inherent risk of transparent hugepages.
        Some setups may have to adjust the memcg limits accordingly to
        accomodate this - or, if the machine is already packed to capacity,
        disable the transparent huge page feature. ]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b70a2a21
    • J
      mm: memcontrol: simplify detecting when the memory+swap limit is hit · 3fbe7244
      Johannes Weiner 提交于
      When attempting to charge pages, we first charge the memory counter and
      then the memory+swap counter.  If one of the counters is at its limit, we
      enter reclaim, but if it's the memory+swap counter, reclaim shouldn't swap
      because that wouldn't change the situation.  However, if the counters have
      the same limits, we never get to the memory+swap limit.  To know whether
      reclaim should swap or not, there is a state flag that indicates whether
      the limits are equal and whether hitting the memory limit implies hitting
      the memory+swap limit.
      
      Just try the memory+swap counter first.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fbe7244
    • M
      mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache · aabfb572
      Michal Hocko 提交于
      free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
      This is not a big deal for the normal release path but it completely kills
      memcg uncharge batching which reduces res_counter spin_lock contention.
      Dave has noticed this with his page fault scalability test case on a large
      machine when the lock was basically dominating on all CPUs:
      
          80.18%    80.18%  [kernel]               [k] _raw_spin_lock
                        |
                        --- _raw_spin_lock
                           |
                           |--66.59%-- res_counter_uncharge_until
                           |          res_counter_uncharge
                           |          uncharge_batch
                           |          uncharge_list
                           |          mem_cgroup_uncharge_list
                           |          release_pages
                           |          free_pages_and_swap_cache
                           |          tlb_flush_mmu_free
                           |          |
                           |          |--90.12%-- unmap_single_vma
                           |          |          unmap_vmas
                           |          |          unmap_region
                           |          |          do_munmap
                           |          |          vm_munmap
                           |          |          sys_munmap
                           |          |          system_call_fastpath
                           |          |          __GI___munmap
                           |          |
                           |           --9.88%-- tlb_flush_mmu
                           |                     tlb_finish_mmu
                           |                     unmap_region
                           |                     do_munmap
                           |                     vm_munmap
                           |                     sys_munmap
                           |                     system_call_fastpath
                           |                     __GI___munmap
      
      In his case the load was running in the root memcg and that part has been
      handled by reverting 05b84301 ("mm: memcontrol: use root_mem_cgroup
      res_counter") because this is a clear regression, but the problem remains
      inside dedicated memcgs.
      
      There is no reason to limit release_pages to PAGEVEC_SIZE batches other
      than lru_lock held times.  This logic, however, can be moved inside the
      function.  mem_cgroup_uncharge_list and free_hot_cold_page_list do not
      hold any lock for the whole pages_to_free list so it is safe to call them
      in a single run.
      
      The release_pages() code was previously breaking the lru_lock each
      PAGEVEC_SIZE pages (ie, 14 pages).  However this code has no usage of
      pagevecs so switch to breaking the lock at least every SWAP_CLUSTER_MAX
      (32) pages.  This means that the lock acquisition frequency is
      approximately halved and the max hold times are approximately doubled.
      
      The now unneeded batching is removed from free_pages_and_swap_cache().
      
      Also update the grossly out-of-date release_pages documentation.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NDave Hansen <dave@sr71.net>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aabfb572
    • S
      mm: dmapool: add/remove sysfs file outside of the pool lock lock · 01c2965f
      Sebastian Andrzej Siewior 提交于
      cat /sys/.../pools followed by removal the device leads to:
      
      |======================================================
      |[ INFO: possible circular locking dependency detected ]
      |3.17.0-rc4+ #1498 Not tainted
      |-------------------------------------------------------
      |rmmod/2505 is trying to acquire lock:
      | (s_active#28){++++.+}, at: [<c017f754>] kernfs_remove_by_name_ns+0x3c/0x88
      |
      |but task is already holding lock:
      | (pools_lock){+.+.+.}, at: [<c011494c>] dma_pool_destroy+0x18/0x17c
      |
      |which lock already depends on the new lock.
      |the existing dependency chain (in reverse order) is:
      |
      |-> #1 (pools_lock){+.+.+.}:
      |   [<c0114ae8>] show_pools+0x30/0xf8
      |   [<c0313210>] dev_attr_show+0x1c/0x48
      |   [<c0180e84>] sysfs_kf_seq_show+0x88/0x10c
      |   [<c017f960>] kernfs_seq_show+0x24/0x28
      |   [<c013efc4>] seq_read+0x1b8/0x480
      |   [<c011e820>] vfs_read+0x8c/0x148
      |   [<c011ea10>] SyS_read+0x40/0x8c
      |   [<c000e960>] ret_fast_syscall+0x0/0x48
      |
      |-> #0 (s_active#28){++++.+}:
      |   [<c017e9ac>] __kernfs_remove+0x258/0x2ec
      |   [<c017f754>] kernfs_remove_by_name_ns+0x3c/0x88
      |   [<c0114a7c>] dma_pool_destroy+0x148/0x17c
      |   [<c03ad288>] hcd_buffer_destroy+0x20/0x34
      |   [<c03a4780>] usb_remove_hcd+0x110/0x1a4
      
      The problem is the lock order of pools_lock and kernfs_mutex in
      dma_pool_destroy() vs show_pools() call path.
      
      This patch breaks out the creation of the sysfs file outside of the
      pools_lock mutex.  The newly added pools_reg_lock ensures that there is no
      race of create vs destroy code path in terms whether or not the sysfs file
      has to be deleted (and was it deleted before we try to create a new one)
      and what to do if device_create_file() failed.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01c2965f
    • V
      memcg: move memcg_update_cache_size() to slab_common.c · 6f817f4c
      Vladimir Davydov 提交于
      `While growing per memcg caches arrays, we jump between memcontrol.c and
      slab_common.c in a weird way:
      
        memcg_alloc_cache_id - memcontrol.c
          memcg_update_all_caches - slab_common.c
            memcg_update_cache_size - memcontrol.c
      
      There's absolutely no reason why memcg_update_cache_size can't live on the
      slab's side though.  So let's move it there and settle it comfortably amid
      per-memcg cache allocation functions.
      
      Besides, this patch cleans this function up a bit, removing all the
      useless comments from it, and renames it to memcg_update_cache_params to
      conform to memcg_alloc/free_cache_params, which we already have in
      slab_common.c.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f817f4c
    • V
      memcg: don't call memcg_update_all_caches if new cache id fits · f3bb3043
      Vladimir Davydov 提交于
      memcg_update_all_caches grows arrays of per-memcg caches, so we only need
      to call it when memcg_limited_groups_array_size is increased.  However,
      currently we invoke it each time a new kmem-active memory cgroup is
      created.  Then it just iterates over all slab_caches and does nothing
      (memcg_update_cache_size returns immediately).
      
      This patch fixes this insanity.  In the meantime it moves the code dealing
      with id allocations to separate functions, memcg_alloc_cache_id and
      memcg_free_cache_id.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3bb3043
    • V
      memcg: move memcg_{alloc,free}_cache_params to slab_common.c · 33a690c4
      Vladimir Davydov 提交于
      The only reason why they live in memcontrol.c is that we get/put css
      reference to the owner memory cgroup in them.  However, we can do that in
      memcg_{un,}register_cache.  OTOH, there are several reasons to move them
      to slab_common.c.
      
      First, I think that the less public interface functions we have in
      memcontrol.h the better.  Since the functions I move don't depend on
      memcontrol, I think it's worth making them private to slab, especially
      taking into account that the arrays are defined on the slab's side too.
      
      Second, the way how per-memcg arrays are updated looks rather awkward: it
      proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c
      (memcg_update_all_caches) and back to memcontrol.c again
      (memcg_update_array_size).  In the following patches I move the function
      relocating the arrays (memcg_update_array_size) to slab_common.c and
      therefore get rid this circular call path.  I think we should have the
      cache allocation stuff in the same place where we have relocation, because
      it's easier to follow the code then.  So I move arrays alloc/free
      functions to slab_common.c too.
      
      The third point isn't obvious.  I'm going to make the list_lru structure
      per-memcg to allow targeted kmem reclaim.  That means we will have
      per-memcg arrays in list_lrus too.  It turns out that it's much easier to
      update these arrays in list_lru.c rather than in memcontrol.c, because all
      the stuff we need is defined there.  This patch makes memcg caches arrays
      allocation path conform that of the upcoming list_lru.
      
      So let's move these functions to slab_common.c and make them static.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33a690c4
    • A
      mm/debug.c: use pr_emerg() · 7a82ca0d
      Andrew Morton 提交于
      - s/KERN_ALERT/pr_emerg/: we're going BUG so let's maximize the changes
        of getting the message out.
      
      - convert debug.c to pr_foo()
      
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a82ca0d
    • S
      mm: use VM_BUG_ON_MM where possible · 96dad67f
      Sasha Levin 提交于
      Dump the contents of the relevant struct_mm when we hit the bug condition.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96dad67f
    • S
      mm: introduce VM_BUG_ON_MM · 31c9afa6
      Sasha Levin 提交于
      Very similar to VM_BUG_ON_PAGE and VM_BUG_ON_VMA, dump struct_mm when the
      bug is hit.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [mhocko@suse.cz: fix build]
      [mhocko@suse.cz: fix build some more]
      [akpm@linux-foundation.org: do strange things to avoid doing strange things for the comma separators]
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Dave Jones <davej@redhat.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31c9afa6
    • S
      mm: move debug code out of page_alloc.c · 82742a3a
      Sasha Levin 提交于
      dump_page() and dump_vma() are not specific to page_alloc.c, move them out
      so page_alloc.c won't turn into the unofficial debug repository.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82742a3a
    • P
      mm: softdirty: unmapped addresses between VMAs are clean · 81d0fa62
      Peter Feiner 提交于
      If a /proc/pid/pagemap read spans a [VMA, an unmapped region, then a
      VM_SOFTDIRTY VMA], the virtual pages in the unmapped region are reported
      as softdirty.  Here's a program to demonstrate the bug:
      
      int main() {
      	const uint64_t PAGEMAP_SOFTDIRTY = 1ul << 55;
      	uint64_t pme[3];
      	int fd = open("/proc/self/pagemap", O_RDONLY);;
      	char *m = mmap(NULL, 3 * getpagesize(), PROT_READ,
      	               MAP_ANONYMOUS | MAP_SHARED, -1, 0);
      	munmap(m + getpagesize(), getpagesize());
      	pread(fd, pme, 24, (unsigned long) m / getpagesize() * 8);
      	assert(pme[0] & PAGEMAP_SOFTDIRTY);    /* passes */
      	assert(!(pme[1] & PAGEMAP_SOFTDIRTY)); /* fails */
      	assert(pme[2] & PAGEMAP_SOFTDIRTY);    /* passes */
      	return 0;
      }
      
      (Note that all pages in new VMAs are softdirty until cleared).
      
      Tested:
      	Used the program given above. I'm going to include this code in
      	a selftest in the future.
      
      [n-horiguchi@ah.jp.nec.com: prevent pagemap_pte_range() from overrunning]
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Jamie Liu <jamieliu@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81d0fa62
    • M
      mm: page_alloc: default node-ordering on 64-bit NUMA, zone-ordering on 32-bit · 3193913c
      Mel Gorman 提交于
      Zones are allocated by the page allocator in either node or zone order.
      Node ordering is preferred in terms of locality and is applied
      automatically in one of three cases:
      
        1. If a node has only low memory
      
        2. If DMA/DMA32 is a high percentage of memory
      
        3. If low memory on a single node is greater than 70% of the node size
      
      Otherwise zone ordering is used to preserve low memory for devices that
      require it.  Unfortunately a consequence of this is that applications
      running on a machine with balanced NUMA nodes will experience different
      performance characteristics depending on which node they happen to start
      from.
      
      The point of zone ordering is to protect lower zones for devices that
      require DMA/DMA32 memory.  When NUMA was first introduced, this was
      critical as 32-bit NUMA machines existed and exhausting low memory
      triggered OOMs easily as so many allocations required low memory.  On
      64-bit machines the primary concern is devices that are 32-bit only which
      is less severe than the low memory exhaustion problem on 32-bit NUMA.  It
      seems there are really few devices that depends on it.
      
      AGP -- I assume this is getting more rare but even then I think the allocations
      	happen early in boot time where lowmem pressure is less of a problem
      
      DRM -- If the device is 32-bit only then there may be low pressure. I didn't
      	evaluate these in detail but it looks like some of these are mobile
      	graphics card. Not many NUMA laptops out there. DRM folk should know
      	better though.
      
      Some TV cards -- Much demand for 32-bit capable TV cards on NUMA machines?
      
      B43 wireless card -- again not really a NUMA thing.
      
      I cannot find a good reason to incur a performance penalty on all 64-bit NUMA
      machines in case someone throws a brain damanged TV or graphics card in there.
      This patch defaults to node-ordering on 64-bit NUMA machines. I was tempted
      to make it default everywhere but I understand that some embedded arches may
      be using 32-bit NUMA where I cannot predict the consequences.
      
      The performance impact depends on the workload and the characteristics of the
      machine and the machine I tested on had a large Normal zone on node 0 so the
      impact is within the noise for the majority of tests. The allocation stats
      show more allocation requests were from DMA32 and local node. Running SpecJBB
      with multiple JVMs and automatic NUMA balancing disabled the results were
      
      specjbb
                           3.17.0-rc2            3.17.0-rc2
                              vanilla        nodeorder-v1r1
      Min    1      29534.00 (  0.00%)     30020.00 (  1.65%)
      Min    10    115717.00 (  0.00%)    134038.00 ( 15.83%)
      Min    19    109718.00 (  0.00%)    114186.00 (  4.07%)
      Min    28    104459.00 (  0.00%)    103639.00 ( -0.78%)
      Min    37     98245.00 (  0.00%)    103756.00 (  5.61%)
      Min    46     97198.00 (  0.00%)     96197.00 ( -1.03%)
      Mean   1      30953.25 (  0.00%)     31917.75 (  3.12%)
      Mean   10    124432.50 (  0.00%)    140904.00 ( 13.24%)
      Mean   19    116033.50 (  0.00%)    119294.75 (  2.81%)
      Mean   28    108365.25 (  0.00%)    106879.50 ( -1.37%)
      Mean   37    102984.75 (  0.00%)    106924.25 (  3.83%)
      Mean   46    100783.25 (  0.00%)    105368.50 (  4.55%)
      Stddev 1       1260.38 (  0.00%)      1109.66 ( 11.96%)
      Stddev 10      7434.03 (  0.00%)      5171.91 ( 30.43%)
      Stddev 19      8453.84 (  0.00%)      5309.59 ( 37.19%)
      Stddev 28      4184.55 (  0.00%)      2906.63 ( 30.54%)
      Stddev 37      5409.49 (  0.00%)      3192.12 ( 40.99%)
      Stddev 46      4521.95 (  0.00%)      7392.52 (-63.48%)
      Max    1      32738.00 (  0.00%)     32719.00 ( -0.06%)
      Max    10    136039.00 (  0.00%)    148614.00 (  9.24%)
      Max    19    130566.00 (  0.00%)    127418.00 ( -2.41%)
      Max    28    115404.00 (  0.00%)    111254.00 ( -3.60%)
      Max    37    112118.00 (  0.00%)    111732.00 ( -0.34%)
      Max    46    108541.00 (  0.00%)    116849.00 (  7.65%)
      TPut   1     123813.00 (  0.00%)    127671.00 (  3.12%)
      TPut   10    497730.00 (  0.00%)    563616.00 ( 13.24%)
      TPut   19    464134.00 (  0.00%)    477179.00 (  2.81%)
      TPut   28    433461.00 (  0.00%)    427518.00 ( -1.37%)
      TPut   37    411939.00 (  0.00%)    427697.00 (  3.83%)
      TPut   46    403133.00 (  0.00%)    421474.00 (  4.55%)
      
                                  3.17.0-rc2  3.17.0-rc2
                                     vanillanodeorder-v1r1
      DMA allocs                           0           0
      DMA32 allocs                        57     1491992
      Normal allocs                 32543566    30026383
      Movable allocs                       0           0
      Direct pages scanned                 0           0
      Kswapd pages scanned                 0           0
      Kswapd pages reclaimed               0           0
      Direct pages reclaimed               0           0
      Kswapd efficiency                 100%        100%
      Kswapd velocity                  0.000       0.000
      Direct efficiency                 100%        100%
      Direct velocity                  0.000       0.000
      Percentage direct scans             0%          0%
      Zone normal velocity             0.000       0.000
      Zone dma32 velocity              0.000       0.000
      Zone dma velocity                0.000       0.000
      THP fault alloc                  55164       52987
      THP collapse alloc                 139         147
      THP splits                          26          21
      NUMA alloc hit                 4169066     4250692
      NUMA alloc miss                      0           0
      
      Note that there were more DMA32 allocations with the patch applied.  In this
      particular case there was no difference in numa_hit and numa_miss. The
      expectation is that DMA32 was being used at the low watermark instead of
      falling into the slow path. kswapd was not woken but it's not worken for
      THP allocations.
      
      On 32-bit, this patch defaults to zone-ordering as low memory depletion
      can be a serious problem on 32-bit large memory machines. If the default
      ordering was node then processes on node 0 will deplete the Normal zone
      due to normal activity.  The problem is worse if CONFIG_HIGHPTE is not
      set. If combined with large amounts of dirty/writeback pages in Normal
      zone then there is also a high risk of OOM. The heuristics are removed
      as it's not clear they were ever important on 32-bit. They were only
      relevant for setting node-ordering on 64-bit.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3193913c
    • M
      mm: page_alloc: Make paranoid check in move_freepages a VM_BUG_ON · 97ee4ba7
      Mel Gorman 提交于
      Since 2.6.24 there has been a paranoid check in move_freepages that looks
      up the zone of two pages.  This is a very slow path and the only time I've
      seen this bug trigger recently is when memory initialisation was broken
      during patch development.  Despite the fact it's a slow path, this patch
      converts the check to a VM_BUG_ON anyway as it has served its purpose by
      now.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97ee4ba7
    • X
      ocfs2: fix a deadlock while o2net_wq doing direct memory reclaim · b246d3d1
      Xue jiufei 提交于
      Fix a deadlock problem caused by direct memory reclaim in o2net_wq.  The
      situation is as follows:
      
      1) Receive a connect message from another node, node queues a
         work_struct o2net_listen_work.
      
      2) o2net_wq processes this work and call the following functions:
      
      o2net_wq
      -> o2net_accept_one
        -> sock_create_lite
          -> sock_alloc()
            -> kmem_cache_alloc with GFP_KERNEL
              -> ____cache_alloc_node
                ->__alloc_pages_nodemask
                  -> do_try_to_free_pages
                    -> shrink_slab
                      -> evict
                        -> ocfs2_evict_inode
                          -> ocfs2_drop_lock
                            -> dlmunlock
                              -> o2net_send_message_vec
      
         then o2net_wq wait for the unlock reply from master.
      
      3) tcp layer received the reply, call o2net_data_ready() and queue
         sc_rx_work, waiting o2net_wq to process this work.
      
      4) o2net_wq is a single thread workqueue, it process the work one by
         one.  Right now it is still doing o2net_listen_work and cannot handle
         sc_rx_work.  so we deadlock.
      
      Junxiao Bi's patch "mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set"
      (http://ozlabs.org/~akpm/mmots/broken-out/mm-clear-__gfp_fs-when-pf_memalloc_noio-is-set.patch)
      clears __GFP_FS in memalloc_noio_flags() besides __GFP_IO.  We use
      memalloc_noio_save() to set process flag PF_MEMALLOC_NOIO so that all
      allocations done by this process are done as if GFP_NOIO was specified.
      We are not reentering filesystem while doing memory reclaim.
      Signed-off-by: Njoyce.xue <xuejiufei@huawei.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b246d3d1
    • J
      mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set · 934f3072
      Junxiao Bi 提交于
      commit 21caf2fc ("mm: teach mm by current context info to not do I/O
      during memory allocation") introduces PF_MEMALLOC_NOIO flag to avoid doing
      I/O inside memory allocation, __GFP_IO is cleared when this flag is set,
      but __GFP_FS implies __GFP_IO, it should also be cleared.  Or it may still
      run into I/O, like in superblock shrinker.  And this will make the kernel
      run into the deadlock case described in that commit.
      
      See Dave Chinner's comment about io in superblock shrinker:
      
      Filesystem shrinkers do indeed perform IO from the superblock shrinker and
      have for years.  Even clean inodes can require IO before they can be freed
      - e.g.  on an orphan list, need truncation of post-eof blocks, need to
      wait for ordered operations to complete before it can be freed, etc.
      
      IOWs, Ext4, btrfs and XFS all can issue and/or block on arbitrary amounts
      of IO in the superblock shrinker context.  XFS, in particular, has been
      doing transactions and IO from the VFS inode cache shrinker since it was
      first introduced....
      
      Fix this by clearing __GFP_FS in memalloc_noio_flags(), this function has
      masked all the gfp_mask that will be passed into fs for the processes
      setting PF_MEMALLOC_NOIO in the direct reclaim path.
      
      v1 thread at: https://lkml.org/lkml/2014/9/3/32Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: joyce.xue <xuejiufei@huawei.com>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      934f3072
    • X
      mm/compaction.c: fix warning of 'flags' may be used uninitialized · b8b2d825
      Xiubo Li 提交于
      C      mm/compaction.o
      mm/compaction.c: In function isolate_freepages_block:
      mm/compaction.c:364:37: warning: flags may be used uninitialized in this function [-Wmaybe-uninitialized]
             && compact_unlock_should_abort(&cc->zone->lock, flags,
                                           ^
      Signed-off-by: NXiubo Li <Li.Xiubo@freescale.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8b2d825
    • A
      mm/mmap.c: clean up CONFIG_DEBUG_VM_RB checks · ff26f70f
      Andrew Morton 提交于
      - be consistent in printing the test which failed
      
      - one message was actually wrong (a<b != b>a)
      
      - don't print second bogus warning if browse_rb() failed
      
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff26f70f
    • J
      mm: clean up zone flags · 57054651
      Johannes Weiner 提交于
      Page reclaim tests zone_is_reclaim_dirty(), but the site that actually
      sets this state does zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY), sending the
      reader through layers indirection just to track down a simple bit.
      
      Remove all zone flag wrappers and just use bitops against zone->flags
      directly.  It's just as readable and the lines are barely any longer.
      
      Also rename ZONE_TAIL_LRU_DIRTY to ZONE_DIRTY to match ZONE_WRITEBACK, and
      remove the zone_flags_t typedef.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57054651
    • M
      mm/page-writeback.c: use min3/max3 macros to avoid shadow warnings · 7c809968
      Mark Rustad 提交于
      Nested calls to min/max functions result in shadow warnings in W=2 builds.
       Avoid the warning by using the min3 and max3 macros to get the min/max of
      3 values instead of nested calls.
      Signed-off-by: NMark Rustad <mark.d.rustad@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c809968
    • W
      mm: page_alloc: avoid wakeup kswapd on the unintended node · 7ade3c99
      Weijie Yang 提交于
      When entering the page_alloc slowpath, we wakeup kswapd on every pgdat
      according to the zonelist and high_zoneidx.  However, this doesn't take
      nodemask into account, and could prematurely wakeup kswapd on some
      unintended nodes.
      
      This patch uses for_each_zone_zonelist_nodemask() instead of
      for_each_zone_zonelist() in wake_all_kswapds() to avoid the above
      situation.
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ade3c99
    • S
      mm: convert a few VM_BUG_ON callers to VM_BUG_ON_VMA · 81d1b09c
      Sasha Levin 提交于
      Trivially convert a few VM_BUG_ON calls to VM_BUG_ON_VMA to extract
      more information when they trigger.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81d1b09c
    • S
      mm: introduce VM_BUG_ON_VMA · fa3759cc
      Sasha Levin 提交于
      Very similar to VM_BUG_ON_PAGE but dumps VMA information instead.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa3759cc
    • S
      mm: introduce dump_vma · 0bf55139
      Sasha Levin 提交于
      Introduce a helper to dump information about a VMA, this also makes
      dump_page_flags more generic and re-uses that so the output looks very
      similar to dump_page:
      
      [   61.903437] vma ffff88070f88be00 start 00007fff25970000 end 00007fff25992000
      [   61.903437] next ffff88070facd600 prev ffff88070face400 mm ffff88070fade000
      [   61.903437] prot 8000000000000025 anon_vma ffff88070fa1e200 vm_ops           (null)
      [   61.903437] pgoff 7ffffffdd file           (null) private_data           (null)
      [   61.909129] flags: 0x100173(read|write|mayread|maywrite|mayexec|growsdown|account)
      
      [akpm@linux-foundation.org: make dump_vma() require CONFIG_DEBUG_VM]
      [swarren@nvidia.com: fix dump_vma() compilation]
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NStephen Warren <swarren@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bf55139
    • R
      mm/slab.c: use __seq_open_private() instead of seq_open() · b208ce32
      Rob Jones 提交于
      Using __seq_open_private() removes boilerplate code from slabstats_open()
      
      The resultant code is shorter and easier to follow.
      
      This patch does not change any functionality.
      Signed-off-by: NRob Jones <rob.jones@codethink.co.uk>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b208ce32
    • R
      mm/vmalloc.c: use seq_open_private() instead of seq_open() · 703394c1
      Rob Jones 提交于
      Using seq_open_private() removes boilerplate code from vmalloc_open().
      
      The resultant code is shorter and easier to follow.
      
      However, please note that seq_open_private() call kzalloc() rather than
      kmalloc() which may affect timing due to the memory initialisation
      overhead.
      Signed-off-by: NRob Jones <rob.jones@codethink.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      703394c1
    • A
      include/linux/migrate.h: remove migrate_page #define · 1c93923c
      Andrew Morton 提交于
      This is designed to avoid a few ifdefs in .c files but it's obnoxious
      because it can cause unsuspecting "migrate_page" symbols to get turned into
      "NULL".
      
      Just nuke it and use the ifdefs.
      
      Cc: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c93923c
    • O
      mempolicy: unexport get_vma_policy() and remove its "task" arg · dd6eecb9
      Oleg Nesterov 提交于
      - get_vma_policy(task) is not safe if task != current, remove this
        argument.
      
      - get_vma_policy() no longer has callers outside of mempolicy.c,
        make it static.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd6eecb9
    • O
      mempolicy: kill do_set_mempolicy()->down_write(&mm->mmap_sem) · 2c7c3a7d
      Oleg Nesterov 提交于
      Remove down_write(&mm->mmap_sem) in do_set_mempolicy(). This logic
      was never correct and it is no longer needed, see the previous patch.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c7c3a7d
    • O
      mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race · 498f2371
      Oleg Nesterov 提交于
      9e781440 "hold task->mempolicy while numa_maps scans." fixed the
      race with the exiting task but this is not enough.
      
      The current code assumes that get_vma_policy(task) should either see
      task->mempolicy == NULL or it should be equal to ->task_mempolicy saved
      by hold_task_mempolicy(), so we can never race with __mpol_put(). But
      this can only work if we can't race with do_set_mempolicy(), and thus
      we can't race with another do_set_mempolicy() or do_exit() after that.
      
      However, do_set_mempolicy()->down_write(mmap_sem) can not prevent this
      race. This task can exec, change it's ->mm, and call do_set_mempolicy()
      after that; in this case they take 2 different locks.
      
      Change hold_task_mempolicy() to use get_task_policy(), it never returns
      NULL, and change show_numa_map() to use __get_vma_policy() or fall back
      to proc_priv->task_mempolicy.
      
      Note: this is the minimal fix, we will cleanup this code later. I think
      hold_task_mempolicy() and release_task_mempolicy() should die, we can
      move this logic into show_numa_map(). Or we can move get_task_policy()
      outside of ->mmap_sem and !CONFIG_NUMA code at least.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      498f2371
    • O
      mempolicy: introduce __get_vma_policy(), export get_task_policy() · 74d2c3a0
      Oleg Nesterov 提交于
      Extract the code which looks for vma's policy from get_vma_policy()
      into the new helper, __get_vma_policy(). Export get_task_policy().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74d2c3a0
    • O
      mempolicy: remove the "task" arg of vma_policy_mof() and simplify it · 6b6482bb
      Oleg Nesterov 提交于
      1. vma_policy_mof(task) is simply not safe unless task == current,
         it can race with do_exit()->mpol_put(). Remove this arg and update
         its single caller.
      
      2. vma can not be NULL, remove this check and simplify the code.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b6482bb
    • O
      mempolicy: sanitize the usage of get_task_policy() · 8d90274b
      Oleg Nesterov 提交于
      Cleanup + preparation. Every user of get_task_policy() calls it
      unconditionally, even if it is not going to use the result.
      
      get_task_policy() is cheap but still this does not look clean, plus
      the code looks simpler if get_task_policy() is called only when this
      is really needed.
      
      Note: I hope this is correct, but it is not clear why vma_policy_mof()
      doesn't fall back to get_task_policy() if ->get_policy() returns NULL.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d90274b
    • O
      mempolicy: change get_task_policy() to return default_policy rather than NULL · f15ca78e
      Oleg Nesterov 提交于
      Every caller of get_task_policy() falls back to default_policy if it
      returns NULL. Change get_task_policy() to do this.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f15ca78e
    • O
      mempolicy: change alloc_pages_vma() to use mpol_cond_put() · 2386740d
      Oleg Nesterov 提交于
      Trivial cleanup. alloc_pages_vma() can use mpol_cond_put().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2386740d
    • J
      mm: remove noisy remainder of the scan_unevictable interface · 1f13ae39
      Johannes Weiner 提交于
      The deprecation warnings for the scan_unevictable interface triggers by
      scripts doing `sysctl -a | grep something else'.  This is annoying and not
      helpful.
      
      The interface has been defunct since 264e56d8 ("mm: disable user
      interface to manually rescue unevictable pages"), which was in 2011, and
      there haven't been any reports of usecases for it, only reports that the
      deprecation warnings are annying.  It's unlikely that anybody is using
      this interface specifically at this point, so remove it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f13ae39
    • C
      prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation · f606b77f
      Cyrill Gorcunov 提交于
      During development of c/r we've noticed that in case if we need to support
      user namespaces we face a problem with capabilities in prctl(PR_SET_MM,
      ...) call, in particular once new user namespace is created
      capable(CAP_SYS_RESOURCE) no longer passes.
      
      A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values
      in one bundle, which would allow the kernel to make more intensive test
      for sanity of values and same time allow us to support checkpoint/restore
      of user namespaces.
      
      Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of
      prctl_mm_map structure which carries all the members to be updated.
      
      	prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size)
      
      	struct prctl_mm_map {
      		__u64	start_code;
      		__u64	end_code;
      		__u64	start_data;
      		__u64	end_data;
      		__u64	start_brk;
      		__u64	brk;
      		__u64	start_stack;
      		__u64	arg_start;
      		__u64	arg_end;
      		__u64	env_start;
      		__u64	env_end;
      		__u64	*auxv;
      		__u32	auxv_size;
      		__u32	exe_fd;
      	};
      
      All members except @exe_fd correspond ones of struct mm_struct.  To figure
      out which available values these members may take here are meanings of the
      members.
      
       - start_code, end_code: represent bounds of executable code area
       - start_data, end_data: represent bounds of data area
       - start_brk, brk: used to calculate bounds for brk() syscall
       - start_stack: used when accounting space needed for command
         line arguments, environment and shmat() syscall
       - arg_start, arg_end, env_start, env_end: represent memory area
         supplied for command line arguments and environment variables
       - auxv, auxv_size: carries auxiliary vector, Elf format specifics
       - exe_fd: file descriptor number for executable link (/proc/self/exe)
      
      Thus we apply the following requirements to the values
      
      1) Any member except @auxv, @auxv_size, @exe_fd is rather an address
         in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr)
         interval.
      
      2) While @[start|end]_code and @[start|end]_data may point to an nonexisting
         VMAs (say a program maps own new .text and .data segments during execution)
         the rest of members should belong to VMA which must exist.
      
      3) Addresses must be ordered, ie @start_ member must not be greater or
         equal to appropriate @end_ member.
      
      4) As in regular Elf loading procedure we require that @start_brk and
         @brk be greater than @end_data.
      
      5) If RLIMIT_DATA rlimit is set to non-infinity new values should not
         exceed existing limit. Same applies to RLIMIT_STACK.
      
      6) Auxiliary vector size must not exceed existing one (which is
         predefined as AT_VECTOR_SIZE and depends on architecture).
      
      7) File descriptor passed in @exe_file should be pointing
         to executable file (because we use existing prctl_set_mm_exe_file_locked
         helper it ensures that the file we are going to use as exe link has all
         required permission granted).
      
      Now about where these members are involved inside kernel code:
      
       - @start_code and @end_code are used in /proc/$pid/[stat|statm] output;
      
       - @start_data and @end_data are used in /proc/$pid/[stat|statm] output,
         also they are considered if there enough space for brk() syscall
         result if RLIMIT_DATA is set;
      
       - @start_brk shown in /proc/$pid/stat output and accounted in brk()
         syscall if RLIMIT_DATA is set; also this member is tested to
         find a symbolic name of mmap event for perf system (we choose
         if event is generated for "heap" area); one more aplication is
         selinux -- we test if a process has PROCESS__EXECHEAP permission
         if trying to make heap area being executable with mprotect() syscall;
      
       - @brk is a current value for brk() syscall which lays inside heap
         area, it's shown in /proc/$pid/stat. When syscall brk() succesfully
         provides new memory area to a user space upon brk() completion the
         mm::brk is updated to carry new value;
      
         Both @start_brk and @brk are actively used in /proc/$pid/maps
         and /proc/$pid/smaps output to find a symbolic name "heap" for
         VMA being scanned;
      
       - @start_stack is printed out in /proc/$pid/stat and used to
         find a symbolic name "stack" for task and threads in
         /proc/$pid/maps and /proc/$pid/smaps output, and as the same
         as with @start_brk -- perf system uses it for event naming.
         Also kernel treat this member as a start address of where
         to map vDSO pages and to check if there is enough space
         for shmat() syscall;
      
       - @arg_start, @arg_end, @env_start and @env_end are printed out
         in /proc/$pid/stat. Another access to the data these members
         represent is to read /proc/$pid/environ or /proc/$pid/cmdline.
         Any attempt to read these areas kernel tests with access_process_vm
         helper so a user must have enough rights for this action;
      
       - @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly
         speaking kernel doesn't care much about which exactly data is
         sitting there because it is solely for userspace;
      
       - @exe_fd is referred from /proc/$pid/exe and when generating
         coredump. We uses prctl_set_mm_exe_file_locked helper to update
         this member, so exe-file link modification remains one-shot
         action.
      
      Still note that updating exe-file link now doesn't require sys-resource
      capability anymore, after all there is no much profit in preventing setup
      own file link (there are a number of ways to execute own code -- ptrace,
      ld-preload, so that the only reliable way to find which exactly code is
      executed is to inspect running program memory).  Still we require the
      caller to be at least user-namespace root user.
      
      I believe the old interface should be deprecated and ripped off in a
      couple of kernel releases if no one against.
      
      To test if new interface is implemented in the kernel one can pass
      PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently
      supported struct prctl_mm_map.
      
      [akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions]
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NAndrew Vagin <avagin@openvz.org>
      Tested-by: NAndrew Vagin <avagin@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f606b77f
    • C
      prctl: PR_SET_MM -- factor out mmap_sem when updating mm::exe_file · 71fe97e1
      Cyrill Gorcunov 提交于
      Instead of taking mm->mmap_sem inside prctl_set_mm_exe_file() move it out
      and rename the helper to prctl_set_mm_exe_file_locked().  This will allow
      to reuse this function in a next patch.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71fe97e1
    • C
      mm: use may_adjust_brk helper · 8764b338
      Cyrill Gorcunov 提交于
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8764b338
    • C
      mm: introduce check_data_rlimit helper · 9c599024
      Cyrill Gorcunov 提交于
      To eliminate code duplication lets introduce check_data_rlimit helper
      which we will use in brk() and prctl() syscalls.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c599024