1. 05 6月, 2014 40 次提交
    • M
      mm/msync.c: sync only the requested range in msync() · 7fc34a62
      Matthew Wilcox 提交于
      msync() currently syncs more than POSIX requires or BSD or Solaris
      implement.  It is supposed to be equivalent to fdatasync(), not fsync(),
      and it is only supposed to sync the portion of the file that overlaps the
      range passed to msync.
      
      If the VMA is non-linear, fall back to syncing the entire file, but we
      still optimise to only fdatasync() the entire file, not the full fsync().
      
      akpm: there are obvious concerns with bck-compatibility: is anyone relying
      on the undocumented side-effect for their data integrity?  And how would
      they ever know if this change broke their data integrity?
      
      We think the risk is reasonably low, and this patch brings the kernel into
      line with other OS's and with what the manpage has always said...
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Chris Mason <clm@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7fc34a62
    • C
      hwpoison: remove unused global variable in do_machine_check() · 65eb7182
      Chen Yucong 提交于
      Remove an unused global variable mce_entry and relative operations in
      do_machine_check().
      Signed-off-by: NChen Yucong <slaoub@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65eb7182
    • V
      mm, compaction: properly signal and act upon lock and need_sched() contention · be976572
      Vlastimil Babka 提交于
      Compaction uses compact_checklock_irqsave() function to periodically check
      for lock contention and need_resched() to either abort async compaction,
      or to free the lock, schedule and retake the lock.  When aborting,
      cc->contended is set to signal the contended state to the caller.  Two
      problems have been identified in this mechanism.
      
      First, compaction also calls directly cond_resched() in both scanners when
      no lock is yet taken.  This call either does not abort async compaction,
      or set cc->contended appropriately.  This patch introduces a new
      compact_should_abort() function to achieve both.  In isolate_freepages(),
      the check frequency is reduced to once by SWAP_CLUSTER_MAX pageblocks to
      match what the migration scanner does in the preliminary page checks.  In
      case a pageblock is found suitable for calling isolate_freepages_block(),
      the checks within there are done on higher frequency.
      
      Second, isolate_freepages() does not check if isolate_freepages_block()
      aborted due to contention, and advances to the next pageblock.  This
      violates the principle of aborting on contention, and might result in
      pageblocks not being scanned completely, since the scanning cursor is
      advanced.  This problem has been noticed in the code by Joonsoo Kim when
      reviewing related patches.  This patch makes isolate_freepages_block()
      check the cc->contended flag and abort.
      
      In case isolate_freepages() has already isolated some pages before
      aborting due to contention, page migration will proceed, which is OK since
      we do not want to waste the work that has been done, and page migration
      has own checks for contention.  However, we do not want another isolation
      attempt by either of the scanners, so cc->contended flag check is added
      also to compaction_alloc() and compact_finished() to make sure compaction
      is aborted right after the migration.
      
      The outcome of the patch should be reduced lock contention by async
      compaction and lower latencies for higher-order allocations where direct
      compaction is involved.
      
      [akpm@linux-foundation.org: fix typo in comment]
      Reported-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Tested-by: NShawn Guo <shawn.guo@linaro.org>
      Tested-by: NKevin Hilman <khilman@linaro.org>
      Tested-by: NStephen Warren <swarren@nvidia.com>
      Tested-by: NFabio Estevam <fabio.estevam@freescale.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be976572
    • F
      fs/hugetlbfs/inode.c: remove null test before kfree · 6e6870d4
      Fabian Frederick 提交于
      Fix checkpatch warning:
      WARNING: kfree(NULL) is safe this check is probably not required
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e6870d4
    • F
      fs/hugetlbfs/inode.c: use static const for dentry_operations · be1d2cf5
      Fabian Frederick 提交于
      ...like other filesystems.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be1d2cf5
    • F
      fs/hugetlbfs/inode.c: add static to hugetlbfs_i_mmap_mutex_key · 422b2448
      Fabian Frederick 提交于
      hugetlbfs_i_mmap_mutex_key is only used in inode.c
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      422b2448
    • J
      mm/vmscan.c: use DIV_ROUND_UP for calculation of zone's balance_gap and correct comments. · 4be89a34
      Jianyu Zhan 提交于
      Currently, we use (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1)
      / KSWAPD_ZONE_BALANCE_GAP_RATIO to avoid a zero gap value.  It's better to
      use DIV_ROUND_UP macro for neater code and clear meaning.
      
      Besides, the gap value is calculated against the per-zone "managed pages",
      not "present pages".  This patch also corrects the comment and do some
      rephrasing.
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4be89a34
    • A
      include/linux/gfp.h: exclude duplicate header · b7596fb4
      Andy Shevchenko 提交于
      mmdebug.h is included twice.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7596fb4
    • J
      mm, hugetlb: move the error handle logic out of normal code path · 8f34af6f
      Jianyu Zhan 提交于
      alloc_huge_page() now mixes normal code path with error handle logic.
      This patches move out the error handle logic, to make normal code path
      more clean and redue code duplicate.
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Acked-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f34af6f
    • N
      mm/memory-failure.c: move comment · 6edd6cc6
      Naoya Horiguchi 提交于
      The comment about pages under writeback is far from the relevant code, so
      let's move it to the right place.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6edd6cc6
    • M
      mm: avoid unnecessary atomic operations during end_page_writeback() · 888cf2db
      Mel Gorman 提交于
      If a page is marked for immediate reclaim then it is moved to the tail of
      the LRU list.  This occurs when the system is under enough memory pressure
      for pages under writeback to reach the end of the LRU but we test for this
      using atomic operations on every writeback.  This patch uses an optimistic
      non-atomic test first.  It'll miss some pages in rare cases but the
      consequences are not severe enough to warrant such a penalty.
      
      While the function does not dominate profiles during a simple dd test the
      cost of it is reduced.
      
      73048     0.7428  vmlinux-3.15.0-rc5-mmotm-20140513 end_page_writeback
      23740     0.2409  vmlinux-3.15.0-rc5-lessatomic     end_page_writeback
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      888cf2db
    • M
      mm: page_alloc: calculate classzone_idx once from the zonelist ref · d8846374
      Mel Gorman 提交于
      There is no need to calculate zone_idx(preferred_zone) multiple times
      or use the pgdat to figure it out.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8846374
    • M
      mm: non-atomically mark page accessed during page cache allocation where possible · 2457aec6
      Mel Gorman 提交于
      aops->write_begin may allocate a new page and make it visible only to have
      mark_page_accessed called almost immediately after.  Once the page is
      visible the atomic operations are necessary which is noticable overhead
      when writing to an in-memory filesystem like tmpfs but should also be
      noticable with fast storage.  The objective of the patch is to initialse
      the accessed information with non-atomic operations before the page is
      visible.
      
      The bulk of filesystems directly or indirectly use
      grab_cache_page_write_begin or find_or_create_page for the initial
      allocation of a page cache page.  This patch adds an init_page_accessed()
      helper which behaves like the first call to mark_page_accessed() but may
      called before the page is visible and can be done non-atomically.
      
      The primary APIs of concern in this care are the following and are used
      by most filesystems.
      
      	find_get_page
      	find_lock_page
      	find_or_create_page
      	grab_cache_page_nowait
      	grab_cache_page_write_begin
      
      All of them are very similar in detail to the patch creates a core helper
      pagecache_get_page() which takes a flags parameter that affects its
      behavior such as whether the page should be marked accessed or not.  Then
      old API is preserved but is basically a thin wrapper around this core
      function.
      
      Each of the filesystems are then updated to avoid calling
      mark_page_accessed when it is known that the VM interfaces have already
      done the job.  There is a slight snag in that the timing of the
      mark_page_accessed() has now changed so in rare cases it's possible a page
      gets to the end of the LRU as PageReferenced where as previously it might
      have been repromoted.  This is expected to be rare but it's worth the
      filesystem people thinking about it in case they see a problem with the
      timing change.  It is also the case that some filesystems may be marking
      pages accessed that previously did not but it makes sense that filesystems
      have consistent behaviour in this regard.
      
      The test case used to evaulate this is a simple dd of a large file done
      multiple times with the file deleted on each iterations.  The size of the
      file is 1/10th physical memory to avoid dirty page balancing.  In the
      async case it will be possible that the workload completes without even
      hitting the disk and will have variable results but highlight the impact
      of mark_page_accessed for async IO.  The sync results are expected to be
      more stable.  The exception is tmpfs where the normal case is for the "IO"
      to not hit the disk.
      
      The test machine was single socket and UMA to avoid any scheduling or NUMA
      artifacts.  Throughput and wall times are presented for sync IO, only wall
      times are shown for async as the granularity reported by dd and the
      variability is unsuitable for comparison.  As async results were variable
      do to writback timings, I'm only reporting the maximum figures.  The sync
      results were stable enough to make the mean and stddev uninteresting.
      
      The performance results are reported based on a run with no profiling.
      Profile data is based on a separate run with oprofile running.
      
      async dd
                                          3.15.0-rc3            3.15.0-rc3
                                             vanilla           accessed-v2
      ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
      tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
      btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
      ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
      xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)
      
      The XFS figure is a bit strange as it managed to avoid a worst case by
      sheer luck but the average figures looked reasonable.
      
              samples percentage
      ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      
      [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Tested-by: NPrabhakar Lad <prabhakar.csengg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2457aec6
    • M
      fs: buffer: do not use unnecessary atomic operations when discarding buffers · e7470ee8
      Mel Gorman 提交于
      Discarding buffers uses a bunch of atomic operations when discarding
      buffers because ......  I can't think of a reason.  Use a cmpxchg loop to
      clear all the necessary flags.  In most (all?) cases this will be a single
      atomic operations.
      
      [akpm@linux-foundation.org: move BUFFER_FLAGS_DISCARD into the .c file]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7470ee8
    • M
      mm: do not use unnecessary atomic operations when adding pages to the LRU · 6fb81a17
      Mel Gorman 提交于
      When adding pages to the LRU we clear the active bit unconditionally.
      As the page could be reachable from other paths we cannot use unlocked
      operations without risk of corruption such as a parallel
      mark_page_accessed.  This patch tests if is necessary to clear the
      active flag before using an atomic operation.  This potentially opens a
      tiny race when PageActive is checked as mark_page_accessed could be
      called after PageActive was checked.  The race already exists but this
      patch changes it slightly.  The consequence is that that the page may be
      promoted to the active list that might have been left on the inactive
      list before the patch.  It's too tiny a race and too marginal a
      consequence to always use atomic operations for.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6fb81a17
    • M
      mm: do not use atomic operations when releasing pages · e3741b50
      Mel Gorman 提交于
      There should be no references to it any more and a parallel mark should
      not be reordered against us.  Use non-locked varient to clear page active.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3741b50
    • M
      mm: shmem: avoid atomic operation during shmem_getpage_gfp · 07a42788
      Mel Gorman 提交于
      shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
      before it's even added to the LRU or visible.  This is unnecessary as what
      could it possible race against?  Use an unlocked variant.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07a42788
    • M
      mm: page_alloc: convert hot/cold parameter and immediate callers to bool · b745bc85
      Mel Gorman 提交于
      cold is a bool, make it one.  Make the likely case the "if" part of the
      block instead of the else as according to the optimisation manual this is
      preferred.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b745bc85
    • M
      mm: page_alloc: use unsigned int for order in more places · 7aeb09f9
      Mel Gorman 提交于
      X86 prefers the use of unsigned types for iterators and there is a
      tendency to mix whether a signed or unsigned type if used for page order.
      This converts a number of sites in mm/page_alloc.c to use unsigned int for
      order where possible.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7aeb09f9
    • M
      mm: page_alloc: lookup pageblock migratetype with IRQs enabled during free · cfc47a28
      Mel Gorman 提交于
      get_pageblock_migratetype() is called during free with IRQs disabled.
      This is unnecessary and disables IRQs for longer than necessary.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cfc47a28
    • M
      mm: page_alloc: reduce number of times page_to_pfn is called · dc4b0caf
      Mel Gorman 提交于
      In the free path we calculate page_to_pfn multiple times. Reduce that.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc4b0caf
    • M
      mm: page_alloc: use word-based accesses for get/set pageblock bitmaps · e58469ba
      Mel Gorman 提交于
      The test_bit operations in get/set pageblock flags are expensive.  This
      patch reads the bitmap on a word basis and use shifts and masks to isolate
      the bits of interest.  Similarly masks are used to set a local copy of the
      bitmap and then use cmpxchg to update the bitmap if there have been no
      other changes made in parallel.
      
      In a test running dd onto tmpfs the overhead of the pageblock-related
      functions went from 1.27% in profiles to 0.5%.
      
      In addition to the performance benefits, this patch closes races that are
      possible between:
      
      a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
         reads part of the bits before and other part of the bits after
         set_pageblock_migratetype() has updated them.
      
      b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
         read-modify-update set bit operation in set_pageblock_skip() will cause
         lost updates to some bits changed in the set_pageblock_migratetype().
      
      Joonsoo Kim first reported the case a) via code inspection.  Vlastimil
      Babka's testing with a debug patch showed that either a) or b) occurs
      roughly once per mmtests' stress-highalloc benchmark (although not
      necessarily in the same pageblock).  Furthermore during development of
      unrelated compaction patches, it was observed that frequent calls to
      {start,undo}_isolate_page_range() the race occurs several thousands of
      times and has resulted in NULL pointer dereferences in move_freepages()
      and free_one_page() in places where free_list[migratetype] is
      manipulated by e.g.  list_move().  Further debugging confirmed that
      migratetype had invalid value of 6, causing out of bounds access to the
      free_list array.
      
      That confirmed that the race exist, although it may be extremely rare,
      and currently only fatal where page isolation is performed due to
      memory hot remove.  Races on pageblocks being updated by
      set_pageblock_migratetype(), where both old and new migratetype are
      lower MIGRATE_RESERVE, currently cannot result in an invalid value
      being observed, although theoretically they may still lead to
      unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
      Furthermore, things could get suddenly worse when memory isolation is
      used more, or when new migratetypes are added.
      
      After this patch, the race has no longer been observed in testing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-and-tested-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e58469ba
    • M
      mm: page_alloc: take the ALLOC_NO_WATERMARK check out of the fast path · 5dab2911
      Mel Gorman 提交于
      ALLOC_NO_WATERMARK is set in a few cases.  Always by kswapd, always for
      __GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc.  Each of these
      cases are relatively rare events but the ALLOC_NO_WATERMARK check is an
      unlikely branch in the fast path.  This patch moves the check out of the
      fast path and after it has been determined that the watermarks have not
      been met.  This helps the common fast path at the cost of making the slow
      path slower and hitting kswapd with a performance cost.  It's a reasonable
      tradeoff.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5dab2911
    • M
      mm: page_alloc: only check the alloc flags and gfp_mask for dirty once · a6e21b14
      Mel Gorman 提交于
      Currently it's calculated once per zone in the zonelist.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6e21b14
    • M
      mm: page_alloc: only check the zone id check if pages are buddies · d34c5fa0
      Mel Gorman 提交于
      A node/zone index is used to check if pages are compatible for merging
      but this happens unconditionally even if the buddy page is not free. Defer
      the calculation as long as possible. Ideally we would check the zone boundary
      but nodes can overlap.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d34c5fa0
    • M
      mm: page_alloc: use jump labels to avoid checking number_of_cpusets · 664eedde
      Mel Gorman 提交于
      If cpusets are not in use then we still check a global variable on every
      page allocation.  Use jump labels to avoid the overhead.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      664eedde
    • M
      include/linux/jump_label.h: expose the reference count · ea5e9539
      Mel Gorman 提交于
      This patch exposes the jump_label reference count in preparation for the
      next patch.  cpusets cares about both the jump_label being enabled and how
      many users of the cpusets there currently are.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea5e9539
    • M
      mm: page_alloc: do not treat a zone that cannot be used for dirty pages as "full" · 800a1e75
      Mel Gorman 提交于
      If a zone cannot be used for a dirty page then it gets marked "full" which
      is cached in the zlc and later potentially skipped by allocation requests
      that have nothing to do with dirty zones.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      800a1e75
    • M
      mm: page_alloc: do not update zlc unless the zlc is active · 65bb3719
      Mel Gorman 提交于
      The zlc is used on NUMA machines to quickly skip over zones that are full.
       However it is always updated, even for the first zone scanned when the
      zlc might not even be active.  As it's a write to a bitmap that
      potentially bounces cache line it's deceptively expensive and most
      machines will not care.  Only update the zlc if it was active.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65bb3719
    • V
      slab: delete cache from list after __kmem_cache_shutdown succeeds · 0bd62b11
      Vladimir Davydov 提交于
      Currently, on kmem_cache_destroy we delete the cache from the slab_list
      before __kmem_cache_shutdown, inserting it back to the list on failure.
      Initially, this was done, because we could release the slab_mutex in
      __kmem_cache_shutdown to delete sysfs slub entry, but since commit
      41a21285 ("slub: use sysfs'es release mechanism for kmem_cache") we
      remove sysfs entry later in kmem_cache_destroy after dropping the
      slab_mutex, so that no implementation of __kmem_cache_shutdown can ever
      release the lock.  Therefore we can simplify the code a bit by moving
      list_del after __kmem_cache_shutdown.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bd62b11
    • V
      memcg: cleanup kmem cache creation/destruction functions naming · 776ed0f0
      Vladimir Davydov 提交于
      Current names are rather inconsistent. Let's try to improve them.
      
      Brief change log:
      
      ** old name **                          ** new name **
      
      kmem_cache_create_memcg                 memcg_create_kmem_cache
      memcg_kmem_create_cache                 memcg_regsiter_cache
      memcg_kmem_destroy_cache                memcg_unregister_cache
      
      kmem_cache_destroy_memcg_children       memcg_cleanup_cache_params
      mem_cgroup_destroy_all_caches           memcg_unregister_all_caches
      
      create_work                             memcg_register_cache_work
      memcg_create_cache_work_func            memcg_register_cache_func
      memcg_create_cache_enqueue              memcg_schedule_register_cache
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      776ed0f0
    • A
      mm/dmapool.c: reuse devres_release() to free resources · 172cb4b3
      Andy Shevchenko 提交于
      Instead of calling an additional routine in dmam_pool_destroy() rely on
      what dmam_pool_release() is doing.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      172cb4b3
    • M
      cma: increase CMA_ALIGNMENT upper limit to 12 · fe54b1fd
      Marc Carino 提交于
      Some systems require a larger maximum PAGE_SIZE order for CMA allocations.
       To accommodate such systems, increase the upper-bound of the
      CMA_ALIGNMENT range to 12 (which ends up being 16MB on systems with 4K
      pages).
      Signed-off-by: NMarc Carino <marc.ceeeee@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe54b1fd
    • D
      swap: change swap_list_head to plist, add swap_avail_head · 18ab4d4c
      Dan Streetman 提交于
      Originally get_swap_page() started iterating through the singly-linked
      list of swap_info_structs using swap_list.next or highest_priority_index,
      which both were intended to point to the highest priority active swap
      target that was not full.  The first patch in this series changed the
      singly-linked list to a doubly-linked list, and removed the logic to start
      at the highest priority non-full entry; it starts scanning at the highest
      priority entry each time, even if the entry is full.
      
      Replace the manually ordered swap_list_head with a plist, swap_active_head.
      Add a new plist, swap_avail_head.  The original swap_active_head plist
      contains all active swap_info_structs, as before, while the new
      swap_avail_head plist contains only swap_info_structs that are active and
      available, i.e. not full.  Add a new spinlock, swap_avail_lock, to protect
      the swap_avail_head list.
      
      Mel Gorman suggested using plists since they internally handle ordering
      the list entries based on priority, which is exactly what swap was doing
      manually.  All the ordering code is now removed, and swap_info_struct
      entries and simply added to their corresponding plist and automatically
      ordered correctly.
      
      Using a new plist for available swap_info_structs simplifies and
      optimizes get_swap_page(), which no longer has to iterate over full
      swap_info_structs.  Using a new spinlock for swap_avail_head plist
      allows each swap_info_struct to add or remove themselves from the
      plist when they become full or not-full; previously they could not
      do so because the swap_info_struct->lock is held when they change
      from full<->not-full, and the swap_lock protecting the main
      swap_active_head must be ordered before any swap_info_struct->lock.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18ab4d4c
    • D
      lib/plist: add plist_requeue · a75f232c
      Dan Streetman 提交于
      Add plist_requeue(), which moves the specified plist_node after all other
      same-priority plist_nodes in the list.  This is essentially an optimized
      plist_del() followed by plist_add().
      
      This is needed by swap, which (with the next patch in this set) uses a
      plist of available swap devices.  When a swap device (either a swap
      partition or swap file) are added to the system with swapon(), the device
      is added to a plist, ordered by the swap device's priority.  When swap
      needs to allocate a page from one of the swap devices, it takes the page
      from the first swap device on the plist, which is the highest priority
      swap device.  The swap device is left in the plist until all its pages are
      used, and then removed from the plist when it becomes full.
      
      However, as described in man 2 swapon, swap must allocate pages from swap
      devices with the same priority in round-robin order; to do this, on each
      swap page allocation, swap uses a page from the first swap device in the
      plist, and then calls plist_requeue() to move that swap device entry to
      after any other same-priority swap devices.  The next swap page allocation
      will again use a page from the first swap device in the plist and requeue
      it, and so on, resulting in round-robin usage of equal-priority swap
      devices.
      
      Also add plist_test_requeue() test function, for use by plist_test() to
      test plist_requeue() function.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a75f232c
    • D
      lib/plist: add helper functions · fd16618e
      Dan Streetman 提交于
      Add PLIST_HEAD() to plist.h, equivalent to LIST_HEAD() from list.h, to
      define and initialize a struct plist_head.
      
      Add plist_for_each_continue() and plist_for_each_entry_continue(),
      equivalent to list_for_each_continue() and list_for_each_entry_continue(),
      to iterate over a plist continuing after the current position.
      
      Add plist_prev() and plist_next(), equivalent to (struct list_head*)->prev
      and ->next, implemented by list_prev_entry() and list_next_entry(), to
      access the prev/next struct plist_node entry.  These are needed because
      unlike struct list_head, direct access of the prev/next struct plist_node
      isn't possible; the list must be navigated via the contained struct
      list_head.  e.g.  instead of accessing the prev by list_prev_entry(node,
      node_list) it can be accessed by plist_prev(node).
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd16618e
    • D
      swap: change swap_info singly-linked list to list_head · adfab836
      Dan Streetman 提交于
      The logic controlling the singly-linked list of swap_info_struct entries
      for all active, i.e.  swapon'ed, swap targets is rather complex, because:
      
       - it stores the entries in priority order
       - there is a pointer to the highest priority entry
       - there is a pointer to the highest priority not-full entry
       - there is a highest_priority_index variable set outside the swap_lock
       - swap entries of equal priority should be used equally
      
      this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
      where different priority swap targets are incorrectly used equally.
      
      That bug probably could be solved with the existing singly-linked lists,
      but I think it would only add more complexity to the already difficult to
      understand get_swap_page() swap_list iteration logic.
      
      The first patch changes from a singly-linked list to a doubly-linked list
      using list_heads; the highest_priority_index and related code are removed
      and get_swap_page() starts each iteration at the highest priority
      swap_info entry, even if it's full.  While this does introduce unnecessary
      list iteration (i.e.  Schlemiel the painter's algorithm) in the case where
      one or more of the highest priority entries are full, the iteration and
      manipulation code is much simpler and behaves correctly re: the above bug;
      and the fourth patch removes the unnecessary iteration.
      
      The second patch adds some minor plist helper functions; nothing new
      really, just functions to match existing regular list functions.  These
      are used by the next two patches.
      
      The third patch adds plist_requeue(), which is used by get_swap_page() in
      the next patch - it performs the requeueing of same-priority entries
      (which moves the entry to the end of its priority in the plist), so that
      all equal-priority swap_info_structs get used equally.
      
      The fourth patch converts the main list into a plist, and adds a new plist
      that contains only swap_info entries that are both active and not full.
      As Mel suggested using plists allows removing all the ordering code from
      swap - plists handle ordering automatically.  The list naming is also
      clarified now that there are two lists, with the original list changed
      from swap_list_head to swap_active_head and the new list named
      swap_avail_head.  A new spinlock is also added for the new list, so
      swap_info entries can be added or removed from the new list immediately as
      they become full or not full.
      
      This patch (of 4):
      
      Replace the singly-linked list tracking active, i.e.  swapon'ed,
      swap_info_struct entries with a doubly-linked list using struct
      list_heads.  Simplify the logic iterating and manipulating the list of
      entries, especially get_swap_page(), by using standard list_head
      functions, and removing the highest priority iteration logic.
      
      The change fixes the bug:
      https://lkml.org/lkml/2014/2/13/181
      in which different priority swap entries after the highest priority entry
      are incorrectly used equally in pairs.  The swap behavior is now as
      advertised, i.e. different priority swap entries are used in order, and
      equal priority swap targets are used concurrently.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      adfab836
    • J
      mm: fold mlocked_vma_newpage() into its only call site · 7ee07a44
      Jianyu Zhan 提交于
      In previous commit(mm: use the light version __mod_zone_page_state in
      mlocked_vma_newpage()) a irq-unsafe __mod_zone_page_state is used.  And as
      suggested by Andrew, to reduce the risks that new call sites incorrectly
      using mlocked_vma_newpage() without knowing they are adding racing, this
      patch folds mlocked_vma_newpage() into its only call site,
      page_add_new_anon_rmap, to make it open-cocded for people to know what is
      going on.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ee07a44
    • J
      mm: use the light version __mod_zone_page_state in mlocked_vma_newpage() · bea04b07
      Jianyu Zhan 提交于
      mlocked_vma_newpage() is called with pte lock held(a spinlock), which
      implies preemtion disabled, and the vm stat counter is not modified from
      interrupt context, so we need not use an irq-safe mod_zone_page_state()
      here, using a light-weight version __mod_zone_page_state() would be OK.
      
      This patch also documents __mod_zone_page_state() and some of its
      callsites.  The comment above __mod_zone_page_state() is from Hugh
      Dickins, and acked by Christoph.
      
      Most credits to Hugh and Christoph for the clarification on the usage of
      the __mod_zone_page_state().
      
      [akpm@linux-foundation.org: coding-style fixes]
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bea04b07
    • V
      mm/compaction: avoid rescanning pageblocks in isolate_freepages · e9ade569
      Vlastimil Babka 提交于
      The compaction free scanner in isolate_freepages() currently remembers PFN
      of the highest pageblock where it successfully isolates, to be used as the
      starting pageblock for the next invocation.  The rationale behind this is
      that page migration might return free pages to the allocator when
      migration fails and we don't want to skip them if the compaction
      continues.
      
      Since migration now returns free pages back to compaction code where they
      can be reused, this is no longer a concern.  This patch changes
      isolate_freepages() so that the PFN for restarting is updated with each
      pageblock where isolation is attempted.  Using stress-highalloc from
      mmtests, this resulted in 10% reduction of the pages scanned by the free
      scanner.
      
      Note that the somewhat similar functionality that records highest
      successful pageblock in zone->compact_cached_free_pfn, remains unchanged.
      This cache is used when the whole compaction is restarted, not for
      multiple invocations of the free scanner during single compaction.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9ade569