1. 01 12月, 2012 3 次提交
    • M
      mm: avoid waking kswapd for THP allocations when compaction is deferred or contended · 782fd304
      Mel Gorman 提交于
      With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
      based on failures" reverted, Zdenek Kabelac reported the following
      
        Hmm,  so it's just took longer to hit the problem and observe
        kswapd0 spinning on my CPU again - it's not as endless like before -
        but still it easily eats minutes - it helps to turn off  Firefox
        or TB  (memory hungry apps) so kswapd0 stops soon - and restart
        those apps again.  (And I still have like >1GB of cached memory)
      
        kswapd0         R  running task        0    30      2 0x00000000
        Call Trace:
          preempt_schedule+0x42/0x60
          _raw_spin_unlock+0x55/0x60
          put_super+0x31/0x40
          drop_super+0x22/0x30
          prune_super+0x149/0x1b0
          shrink_slab+0xba/0x510
      
      The sysrq+m indicates the system has no swap so it'll never reclaim
      anonymous pages as part of reclaim/compaction.  That is one part of the
      problem but not the root cause as file-backed pages could also be
      reclaimed.
      
      The likely underlying problem is that kswapd is woken up or kept awake
      for each THP allocation request in the page allocator slow path.
      
      If compaction fails for the requesting process then compaction will be
      deferred for a time and direct reclaim is avoided.  However, if there
      are a storm of THP requests that are simply rejected, it will still be
      the the case that kswapd is awake for a prolonged period of time as
      pgdat->kswapd_max_order is updated each time.  This is noticed by the
      main kswapd() loop and it will not call kswapd_try_to_sleep().  Instead
      it will loopp, shrinking a small number of pages and calling
      shrink_slab() on each iteration.
      
      This patch defers when kswapd gets woken up for THP allocations.  For
      !THP allocations, kswapd is always woken up.  For THP allocations,
      kswapd is woken up iff the process is willing to enter into direct
      reclaim/compaction.
      
      [akpm@linux-foundation.org: fix typo in comment]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Zdenek Kabelac <zkabelac@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      782fd304
    • A
      revert "Revert "mm: remove __GFP_NO_KSWAPD"" · a5091539
      Andrew Morton 提交于
      It apepars that this patch was innocent, and we hope that "mm: avoid
      waking kswapd for THP allocations when compaction is deferred or
      contended" will fix the final kswapd-spinning cause.
      
      Cc: Zdenek Kabelac <zkabelac@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5091539
    • M
      mm: compaction: fix return value of capture_free_page() · 58d00209
      Mel Gorman 提交于
      Commit ef6c5be6 ("fix incorrect NR_FREE_PAGES accounting (appears
      like memory leak)") fixes a NR_FREE_PAGE accounting leak but missed the
      return value which was also missed by this reviewer until today.
      
      That return value is used by compaction when adding pages to a list of
      isolated free pages and without this follow-up fix, there is a risk of
      free list corruption.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58d00209
  2. 27 11月, 2012 1 次提交
    • M
      Revert "mm: remove __GFP_NO_KSWAPD" · 82b212f4
      Mel Gorman 提交于
      With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
      based on failures" reverted, Zdenek Kabelac reported the following
      
        Hmm,  so it's just took longer to hit the problem and observe
        kswapd0 spinning on my CPU again - it's not as endless like before -
        but still it easily eats minutes - it helps to	turn off  Firefox
        or TB  (memory hungry apps) so kswapd0 stops soon - and restart
        those apps again.  (And I still have like >1GB of cached memory)
      
        kswapd0         R  running task        0    30      2 0x00000000
        Call Trace:
          preempt_schedule+0x42/0x60
          _raw_spin_unlock+0x55/0x60
          put_super+0x31/0x40
          drop_super+0x22/0x30
          prune_super+0x149/0x1b0
          shrink_slab+0xba/0x510
      
      The sysrq+m indicates the system has no swap so it'll never reclaim
      anonymous pages as part of reclaim/compaction.  That is one part of the
      problem but not the root cause as file-backed pages could also be
      reclaimed.
      
      The likely underlying problem is that kswapd is woken up or kept awake
      for each THP allocation request in the page allocator slow path.
      
      If compaction fails for the requesting process then compaction will be
      deferred for a time and direct reclaim is avoided.  However, if there
      are a storm of THP requests that are simply rejected, it will still be
      the the case that kswapd is awake for a prolonged period of time as
      pgdat->kswapd_max_order is updated each time.  This is noticed by the
      main kswapd() loop and it will not call kswapd_try_to_sleep().  Instead
      it will loopp, shrinking a small number of pages and calling
      shrink_slab() on each iteration.
      
      The temptation is to supply a patch that checks if kswapd was woken for
      THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
      backed up by proper testing.  As 3.7 is very close to release and this
      is not a bug we should release with, a safer path is to revert "mm:
      remove __GFP_NO_KSWAPD" for now and revisit it with the view to ironing
      out the balance_pgdat() logic in general.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Zdenek Kabelac <zkabelac@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82b212f4
  3. 22 11月, 2012 1 次提交
    • D
      fix incorrect NR_FREE_PAGES accounting (appears like memory leak) · ef6c5be6
      Dave Hansen 提交于
      There have been some 3.7-rc reports of vm issues, including some kswapd
      bugs and, more importantly, some memory "leaks":
      
      	http://www.spinics.net/lists/linux-mm/msg46187.html
      	https://bugzilla.kernel.org/show_bug.cgi?id=50181
      
      Commit 1fb3f8ca ("mm: compaction: capture a suitable high-order page
      immediately when it is made available") took split_free_page() and
      reused it for the compaction code.  It does something curious with
      capture_free_page() (previously known as split_free_page()):
      
        int capture_free_page(struct page *page, int alloc_order,
        ...
                __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
      
        -       /* Split into individual pages */
        -       set_page_refcounted(page);
        -       split_page(page, order);
        +       if (alloc_order != order)
        +               expand(zone, page, alloc_order, order,
        +                       &zone->free_area[order], migratetype);
      
      Note that expand() puts the pages _back_ in the allocator, but it does
      not bump NR_FREE_PAGES.  We "return" 'alloc_order' worth of pages, but
      we accounted for removing 'order' in the __mod_zone_page_state() call.
      
      For the old split_page()-style use (order==alloc_order) the bug will not
      trigger.  But, when called from the compaction code where we
      occasionally get a larger page out of the buddy allocator than we need,
      we will run in to this.
      
      This patch simply changes the NR_FREE_PAGES manipulation to the correct
      'alloc_order' instead of 'order'.
      
      I've been able to repeatedly trigger this in my testing environment.
      The amount "leaked" very closely tracks the imbalance I see in buddy
      pages vs.  NR_FREE_PAGES.  I have confirmed that this patch fixes the
      imbalance
      Signed-off-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef6c5be6
  4. 17 11月, 2012 2 次提交
    • A
      revert "mm: fix-up zone present pages" · 5576646f
      Andrew Morton 提交于
      Revert commit 7f1290f2 ("mm: fix-up zone present pages")
      
      That patch tried to fix a issue when calculating zone->present_pages,
      but it caused a regression on 32bit systems with HIGHMEM.  With that
      change, reset_zone_present_pages() resets all zone->present_pages to
      zero, and fixup_zone_present_pages() is called to recalculate
      zone->present_pages when the boot allocator frees core memory pages into
      buddy allocator.  Because highmem pages are not freed by bootmem
      allocator, all highmem zones' present_pages becomes zero.
      
      Various options for improving the situation are being discussed but for
      now, let's return to the 3.6 code.
      
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Tested-by: NChris Clayton <chris2553@googlemail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5576646f
    • H
      memcg: fix hotplugged memory zone oops · bea8c150
      Hugh Dickins 提交于
      When MEMCG is configured on (even when it's disabled by boot option),
      when adding or removing a page to/from its lru list, the zone pointer
      used for stats updates is nowadays taken from the struct lruvec.  (On
      many configurations, calculating zone from page is slower.)
      
      But we have no code to update all the lruvecs (per zone, per memcg) when
      a memory node is hotadded.  Here's an extract from the oops which
      results when running numactl to bind a program to a newly onlined node:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
        IP:  __mod_zone_page_state+0x9/0x60
        Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
        Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
        Call Trace:
          __pagevec_lru_add_fn+0xdf/0x140
          pagevec_lru_move_fn+0xb1/0x100
          __pagevec_lru_add+0x1c/0x30
          lru_add_drain_cpu+0xa3/0x130
          lru_add_drain+0x2f/0x40
         ...
      
      The natural solution might be to use a memcg callback whenever memory is
      hotadded; but that solution has not been scoped out, and it happens that
      we do have an easy location at which to update lruvec->zone.  The lruvec
      pointer is discovered either by mem_cgroup_zone_lruvec() or by
      mem_cgroup_page_lruvec(), and both of those do know the right zone.
      
      So check and set lruvec->zone in those; and remove the inadequate
      attempt to set lruvec->zone from lruvec_init(), which is called before
      NODE_DATA(node) has been allocated in such cases.
      
      Ah, there was one exceptionr.  For no particularly good reason,
      mem_cgroup_force_empty_list() has its own code for deciding lruvec.
      Change it to use the standard mem_cgroup_zone_lruvec() and
      mem_cgroup_get_lru_size() too.  In fact it was already safe against such
      an oops (the lru lists in danger could only be empty), but we're better
      proofed against future changes this way.
      
      I've marked this for stable (3.6) since we introduced the problem in 3.5
      (now closed to stable); but I have no idea if this is the only fix
      needed to get memory hotadd working with memcg in 3.6, and received no
      answer when I enquired twice before.
      Reported-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bea8c150
  5. 26 10月, 2012 2 次提交
  6. 23 10月, 2012 1 次提交
  7. 09 10月, 2012 21 次提交
  8. 18 9月, 2012 1 次提交
  9. 22 8月, 2012 2 次提交
    • M
      mm: compaction: Abort async compaction if locks are contended or taking too long · c67fe375
      Mel Gorman 提交于
      Jim Schutt reported a problem that pointed at compaction contending
      heavily on locks.  The workload is straight-forward and in his own words;
      
      	The systems in question have 24 SAS drives spread across 3 HBAs,
      	running 24 Ceph OSD instances, one per drive.  FWIW these servers
      	are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
      	Ceph Linux clients doing dd simultaneously to a Ceph file system
      	backed by 12 of these servers.
      
      Early in the test everything looks fine
      
        procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
         r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
        31 15          0     287216        576   38606628    0    0     2  1158    2   14   1  3  95  0  0
        27 15          0     225288        576   38583384    0    0    18 2222016 203357 134876  11 56  17 15  0
        28 17          0     219256        576   38544736    0    0    11 2305932 203141 146296  11 49  23 17  0
         6 18          0     215596        576   38552872    0    0     7 2363207 215264 166502  12 45  22 20  0
        22 18          0     226984        576   38596404    0    0     3 2445741 223114 179527  12 43  23 22  0
      
      and then it goes to pot
      
        procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
         r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
        163  8          0     464308        576   36791368    0    0    11 22210  866  536   3 13  79  4  0
        207 14          0     917752        576   36181928    0    0   712 1345376 134598 47367   7 90   1  2  0
        123 12          0     685516        576   36296148    0    0   429 1386615 158494 60077   8 84   5  3  0
        123 12          0     598572        576   36333728    0    0  1107 1233281 147542 62351   7 84   5  4  0
        622  7          0     660768        576   36118264    0    0   557 1345548 151394 59353   7 85   4  3  0
        223 11          0     283960        576   36463868    0    0    46 1107160 121846 33006   6 93   1  1  0
      
      Note that system CPU usage is very high blocks being written out has
      dropped by 42%. He analysed this with perf and found
      
        perf record -g -a sleep 10
        perf report --sort symbol --call-graph fractal,5
          34.63%  [k] _raw_spin_lock_irqsave
                  |
                  |--97.30%-- isolate_freepages
                  |          compaction_alloc
                  |          unmap_and_move
                  |          migrate_pages
                  |          compact_zone
                  |          compact_zone_order
                  |          try_to_compact_pages
                  |          __alloc_pages_direct_compact
                  |          __alloc_pages_slowpath
                  |          __alloc_pages_nodemask
                  |          alloc_pages_vma
                  |          do_huge_pmd_anonymous_page
                  |          handle_mm_fault
                  |          do_page_fault
                  |          page_fault
                  |          |
                  |          |--87.39%-- skb_copy_datagram_iovec
                  |          |          tcp_recvmsg
                  |          |          inet_recvmsg
                  |          |          sock_recvmsg
                  |          |          sys_recvfrom
                  |          |          system_call
                  |          |          __recv
                  |          |          |
                  |          |           --100.00%-- (nil)
                  |          |
                  |           --12.61%-- memcpy
                   --2.70%-- [...]
      
      There was other data but primarily it is all showing that compaction is
      contended heavily on the zone->lock and zone->lru_lock.
      
      commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
      while isolating pages for migration] noted that it was possible for
      migration to hold the lru_lock for an excessive amount of time. Very
      broadly speaking this patch expands the concept.
      
      This patch introduces compact_checklock_irqsave() to check if a lock
      is contended or the process needs to be scheduled. If either condition
      is true then async compaction is aborted and the caller is informed.
      The page allocator will fail a THP allocation if compaction failed due
      to contention. This patch also introduces compact_trylock_irqsave()
      which will acquire the lock only if it is not contended and the process
      does not need to schedule.
      Reported-by: NJim Schutt <jaschut@sandia.gov>
      Tested-by: NJim Schutt <jaschut@sandia.gov>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c67fe375
    • A
      mm: correct page->pfmemalloc to fix deactivate_slab regression · b121186a
      Alex Shi 提交于
      Commit cfd19c5a ("mm: only set page->pfmemalloc when
      ALLOC_NO_WATERMARKS was used") tried to narrow down page->pfmemalloc
      setting, but it missed some places the pfmemalloc should be set.
      
      So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
      cause incorrect deactivate_slab() on our core2 server:
      
          64.73%           fio  [kernel.kallsyms]     [k] _raw_spin_lock
                           |
                           --- _raw_spin_lock
                              |
                              |---0.34%-- deactivate_slab
                              |          __slab_alloc
                              |          kmem_cache_alloc
                              |          |
      
      That causes our fio sync write performance to have a 40% regression.
      
      Move the checking in get_page_from_freelist() which resolves this issue.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Tested-by: NSage Weil <sage@inktank.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b121186a
  10. 03 8月, 2012 1 次提交
    • L
      mm: remove node_start_pfn checking in new WARN_ON for now · 8783b6e2
      Linus Torvalds 提交于
      Borislav Petkov reports that the new warning added in commit
      88fdf75d ("mm: warn if pg_data_t isn't initialized with zero")
      triggers for him, and it is the node_start_pfn field that has already
      been initialized once.
      
      The call trace looks like this:
      
        x86_64_start_kernel ->
          x86_64_start_reservations ->
          start_kernel ->
          setup_arch ->
          paging_init ->
          zone_sizes_init ->
          free_area_init_nodes ->
          free_area_init_node
      
      and (with the warning replaced by debug output), Borislav sees
      
        On node 0 totalpages: 4193848
          DMA zone: 64 pages used for memmap
          DMA zone: 6 pages reserved
          DMA zone: 3890 pages, LIFO batch:0
          DMA32 zone: 16320 pages used for memmap
          DMA32 zone: 798464 pages, LIFO batch:31
          Normal zone: 52736 pages used for memmap
          Normal zone: 3322368 pages, LIFO batch:31
        free_area_init_node: pgdat->node_start_pfn: 4423680      <----
        On node 1 totalpages: 4194304
          Normal zone: 65536 pages used for memmap
          Normal zone: 4128768 pages, LIFO batch:31
        free_area_init_node: pgdat->node_start_pfn: 8617984      <----
        On node 2 totalpages: 4194304
          Normal zone: 65536 pages used for memmap
          Normal zone: 4128768 pages, LIFO batch:31
        free_area_init_node: pgdat->node_start_pfn: 12812288     <----
        On node 3 totalpages: 4194304
          Normal zone: 65536 pages used for memmap
          Normal zone: 4128768 pages, LIFO batch:31
      
      so remove the bogus warning for now to avoid annoying people.  Minchan
      Kim is looking at it.
      Reported-by: NBorislav Petkov <bp@amd64.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8783b6e2
  11. 01 8月, 2012 5 次提交
    • M
      mm: remove redundant initialization · 6527af5d
      Minchan Kim 提交于
      pg_data_t is zeroed before reaching free_area_init_core(), so remove the
      now unnecessary initializations.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6527af5d
    • M
      mm: warn if pg_data_t isn't initialized with zero · 88fdf75d
      Minchan Kim 提交于
      Warn if memory-hotplug/boot code doesn't initialize pg_data_t with zero
      when it is allocated.  Arch code and memory hotplug already initiailize
      pg_data_t.  So this warning should never happen.  I select fields randomly
      near the beginning, middle and end of pg_data_t for checking.
      
      This patch isn't for performance but for removing initialization code
      which is necessary to add whenever we adds new field to pg_data_t or zone.
      
      Firstly, Andrew suggested clearing out of pg_data_t in MM core part but
      Tejun doesn't like it because in the future, some archs can initialize
      some fields in arch code and pass them into general MM part so blindly
      clearing it out in mm core part would be very annoying.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88fdf75d
    • M
      mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is... · 5515061d
      Mel Gorman 提交于
      mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage
      
      If swap is backed by network storage such as NBD, there is a risk that a
      large number of reclaimers can hang the system by consuming all
      PF_MEMALLOC reserves.  To avoid these hangs, the administrator must tune
      min_free_kbytes in advance which is a bit fragile.
      
      This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
      are in use.  If the system is routinely getting throttled the system
      administrator can increase min_free_kbytes so degradation is smoother but
      the system will keep running.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5515061d
    • M
      mm: ignore mempolicies when using ALLOC_NO_WATERMARK · 183f6371
      Mel Gorman 提交于
      The reserve is proportionally distributed over all !highmem zones in the
      system.  So we need to allow an emergency allocation access to all zones.
      In order to do that we need to break out of any mempolicy boundaries we
      might have.
      
      In my opinion that does not break mempolicies as those are user oriented
      and not system oriented.  That is, system allocations are not guaranteed
      to be within mempolicy boundaries.  For instance IRQs do not even have a
      mempolicy.
      
      So breaking out of mempolicy boundaries for 'rare' emergency allocations,
      which are always system allocations (as opposed to user) is ok.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      183f6371
    • M
      mm: only set page->pfmemalloc when ALLOC_NO_WATERMARKS was used · cfd19c5a
      Mel Gorman 提交于
      __alloc_pages_slowpath() is called when the number of free pages is below
      the low watermark.  If the caller is entitled to use ALLOC_NO_WATERMARKS
      then the page will be marked page->pfmemalloc.  This protects more pages
      than are strictly necessary as we only need to protect pages allocated
      below the min watermark (the pfmemalloc reserves).
      
      This patch only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was
      required to allocate the page.
      
      [rientjes@google.com: David noticed the problem during review]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cfd19c5a