1. 16 4月, 2015 2 次提交
  2. 11 2月, 2015 1 次提交
  3. 21 1月, 2015 1 次提交
  4. 10 10月, 2014 1 次提交
    • M
      mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache · aabfb572
      Michal Hocko 提交于
      free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
      This is not a big deal for the normal release path but it completely kills
      memcg uncharge batching which reduces res_counter spin_lock contention.
      Dave has noticed this with his page fault scalability test case on a large
      machine when the lock was basically dominating on all CPUs:
      
          80.18%    80.18%  [kernel]               [k] _raw_spin_lock
                        |
                        --- _raw_spin_lock
                           |
                           |--66.59%-- res_counter_uncharge_until
                           |          res_counter_uncharge
                           |          uncharge_batch
                           |          uncharge_list
                           |          mem_cgroup_uncharge_list
                           |          release_pages
                           |          free_pages_and_swap_cache
                           |          tlb_flush_mmu_free
                           |          |
                           |          |--90.12%-- unmap_single_vma
                           |          |          unmap_vmas
                           |          |          unmap_region
                           |          |          do_munmap
                           |          |          vm_munmap
                           |          |          sys_munmap
                           |          |          system_call_fastpath
                           |          |          __GI___munmap
                           |          |
                           |           --9.88%-- tlb_flush_mmu
                           |                     tlb_finish_mmu
                           |                     unmap_region
                           |                     do_munmap
                           |                     vm_munmap
                           |                     sys_munmap
                           |                     system_call_fastpath
                           |                     __GI___munmap
      
      In his case the load was running in the root memcg and that part has been
      handled by reverting 05b84301 ("mm: memcontrol: use root_mem_cgroup
      res_counter") because this is a clear regression, but the problem remains
      inside dedicated memcgs.
      
      There is no reason to limit release_pages to PAGEVEC_SIZE batches other
      than lru_lock held times.  This logic, however, can be moved inside the
      function.  mem_cgroup_uncharge_list and free_hot_cold_page_list do not
      hold any lock for the whole pages_to_free list so it is safe to call them
      in a single run.
      
      The release_pages() code was previously breaking the lru_lock each
      PAGEVEC_SIZE pages (ie, 14 pages).  However this code has no usage of
      pagevecs so switch to breaking the lock at least every SWAP_CLUSTER_MAX
      (32) pages.  This means that the lock acquisition frequency is
      approximately halved and the max hold times are approximately doubled.
      
      The now unneeded batching is removed from free_pages_and_swap_cache().
      
      Also update the grossly out-of-date release_pages documentation.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NDave Hansen <dave@sr71.net>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aabfb572
  5. 09 8月, 2014 3 次提交
    • J
      mm: memcontrol: use page lists for uncharge batching · 747db954
      Johannes Weiner 提交于
      Pages are now uncharged at release time, and all sources of batched
      uncharges operate on lists of pages.  Directly use those lists, and
      get rid of the per-task batching state.
      
      This also batches statistics accounting, in addition to the res
      counter charges, to reduce IRQ-disabling and re-enabling.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      747db954
    • J
      mm: memcontrol: rewrite uncharge API · 0a31bc97
      Johannes Weiner 提交于
      The memcg uncharging code that is involved towards the end of a page's
      lifetime - truncation, reclaim, swapout, migration - is impressively
      complicated and fragile.
      
      Because anonymous and file pages were always charged before they had their
      page->mapping established, uncharges had to happen when the page type
      could still be known from the context; as in unmap for anonymous, page
      cache removal for file and shmem pages, and swap cache truncation for swap
      pages.  However, these operations happen well before the page is actually
      freed, and so a lot of synchronization is necessary:
      
      - Charging, uncharging, page migration, and charge migration all need
        to take a per-page bit spinlock as they could race with uncharging.
      
      - Swap cache truncation happens during both swap-in and swap-out, and
        possibly repeatedly before the page is actually freed.  This means
        that the memcg swapout code is called from many contexts that make
        no sense and it has to figure out the direction from page state to
        make sure memory and memory+swap are always correctly charged.
      
      - On page migration, the old page might be unmapped but then reused,
        so memcg code has to prevent untimely uncharging in that case.
        Because this code - which should be a simple charge transfer - is so
        special-cased, it is not reusable for replace_page_cache().
      
      But now that charged pages always have a page->mapping, introduce
      mem_cgroup_uncharge(), which is called after the final put_page(), when we
      know for sure that nobody is looking at the page anymore.
      
      For page migration, introduce mem_cgroup_migrate(), which is called after
      the migration is successful and the new page is fully rmapped.  Because
      the old page is no longer uncharged after migration, prevent double
      charges by decoupling the page's memcg association (PCG_USED and
      pc->mem_cgroup) from the page holding an actual charge.  The new bits
      PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
      to the new page during migration.
      
      mem_cgroup_migrate() is suitable for replace_page_cache() as well,
      which gets rid of mem_cgroup_replace_page_cache().  However, care
      needs to be taken because both the source and the target page can
      already be charged and on the LRU when fuse is splicing: grab the page
      lock on the charge moving side to prevent changing pc->mem_cgroup of a
      page under migration.  Also, the lruvecs of both pages change as we
      uncharge the old and charge the new during migration, and putback may
      race with us, so grab the lru lock and isolate the pages iff on LRU to
      prevent races and ensure the pages are on the right lruvec afterward.
      
      Swap accounting is massively simplified: because the page is no longer
      uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
      transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
      before the final put_page() in page reclaim.
      
      Finally, page_cgroup changes are now protected by whatever protection the
      page itself offers: anonymous pages are charged under the page table lock,
      whereas page cache insertions, swapin, and migration hold the page lock.
      Uncharging happens under full exclusion with no outstanding references.
      Charging and uncharging also ensure that the page is off-LRU, which
      serializes against charge migration.  Remove the very costly page_cgroup
      lock and set pc->flags non-atomically.
      
      [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
      [vdavydov@parallels.com: fix flags definition]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Tested-by: NJet Chen <jet.chen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NFelipe Balbi <balbi@ti.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a31bc97
    • J
      mm: memcontrol: rewrite charge API · 00501b53
      Johannes Weiner 提交于
      These patches rework memcg charge lifetime to integrate more naturally
      with the lifetime of user pages.  This drastically simplifies the code and
      reduces charging and uncharging overhead.  The most expensive part of
      charging and uncharging is the page_cgroup bit spinlock, which is removed
      entirely after this series.
      
      Here are the top-10 profile entries of a stress test that reads a 128G
      sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
       executing in the root memcg).  Before:
      
          15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.31%              cat  [kernel.kallsyms]   [k] memset
          11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage
           4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.38%              cat  [kernel.kallsyms]   [k] put_page
           2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge
           2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common
           1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
      
      After:
      
          15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.48%           cat  [kernel.kallsyms]   [k] memset
          11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage
           3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.46%           cat  [kernel.kallsyms]   [k] put_page
           2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
           1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk
           1.30%           cat  [kernel.kallsyms]   [k] kfree
      
      As you can see, the memcg footprint has shrunk quite a bit.
      
         text    data     bss     dec     hex filename
        37970    9892     400   48262    bc86 mm/memcontrol.o.old
        35239    9892     400   45531    b1db mm/memcontrol.o
      
      This patch (of 4):
      
      The memcg charge API charges pages before they are rmapped - i.e.  have an
      actual "type" - and so every callsite needs its own set of charge and
      uncharge functions to know what type is being operated on.  Worse,
      uncharge has to happen from a context that is still type-specific, rather
      than at the end of the page's lifetime with exclusive access, and so
      requires a lot of synchronization.
      
      Rewrite the charge API to provide a generic set of try_charge(),
      commit_charge() and cancel_charge() transaction operations, much like
      what's currently done for swap-in:
      
        mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
        pages from the memcg if necessary.
      
        mem_cgroup_commit_charge() commits the page to the charge once it
        has a valid page->mapping and PageAnon() reliably tells the type.
      
        mem_cgroup_cancel_charge() aborts the transaction.
      
      This reduces the charge API and enables subsequent patches to
      drastically simplify uncharging.
      
      As pages need to be committed after rmap is established but before they
      are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
      additions again.  Revive lru_cache_add_active_or_unevictable().
      
      [hughd@google.com: fix shmem_unuse]
      [hughd@google.com: Add comments on the private use of -EAGAIN]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00501b53
  6. 07 8月, 2014 2 次提交
    • M
      mm: pagemap: avoid unnecessary overhead when tracepoints are deactivated · 24b7e581
      Mel Gorman 提交于
      This was formerly the series "Improve sequential read throughput" which
      noted some major differences in performance of tiobench since 3.0.
      While there are a number of factors, two that dominated were the
      introduction of the fair zone allocation policy and changes to CFQ.
      
      The behaviour of fair zone allocation policy makes more sense than
      tiobench as a benchmark and CFQ defaults were not changed due to
      insufficient benchmarking.
      
      This series is what's left.  It's one functional fix to the fair zone
      allocation policy when used on NUMA machines and a reduction of overhead
      in general.  tiobench was used for the comparison despite its flaws as
      an IO benchmark as in this case we are primarily interested in the
      overhead of page allocator and page reclaim activity.
      
      On UMA, it makes little difference to overhead
      
                3.16.0-rc3   3.16.0-rc3
                   vanilla lowercost-v5
      User          383.61      386.77
      System        403.83      401.74
      Elapsed      5411.50     5413.11
      
      On a 4-socket NUMA machine it's a bit more noticable
      
                3.16.0-rc3   3.16.0-rc3
                   vanilla lowercost-v5
      User          746.94      802.00
      System      65336.22    40852.33
      Elapsed     27553.52    27368.46
      
      This patch (of 6):
      
      The LRU insertion and activate tracepoints take PFN as a parameter
      forcing the overhead to the caller.  Move the overhead to the tracepoint
      fast-assign method to ensure the cost is only incurred when the
      tracepoint is active.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24b7e581
    • H
      mm: replace init_page_accessed by __SetPageReferenced · eb39d618
      Hugh Dickins 提交于
      Do we really need an exported alias for __SetPageReferenced()? Its
      callers better know what they're doing, in which case the page would not
      be already marked referenced.  Kill init_page_accessed(), just
      __SetPageReferenced() inline.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Prabhakar Lad <prabhakar.csengg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb39d618
  7. 05 6月, 2014 9 次提交
    • M
      mm: non-atomically mark page accessed during page cache allocation where possible · 2457aec6
      Mel Gorman 提交于
      aops->write_begin may allocate a new page and make it visible only to have
      mark_page_accessed called almost immediately after.  Once the page is
      visible the atomic operations are necessary which is noticable overhead
      when writing to an in-memory filesystem like tmpfs but should also be
      noticable with fast storage.  The objective of the patch is to initialse
      the accessed information with non-atomic operations before the page is
      visible.
      
      The bulk of filesystems directly or indirectly use
      grab_cache_page_write_begin or find_or_create_page for the initial
      allocation of a page cache page.  This patch adds an init_page_accessed()
      helper which behaves like the first call to mark_page_accessed() but may
      called before the page is visible and can be done non-atomically.
      
      The primary APIs of concern in this care are the following and are used
      by most filesystems.
      
      	find_get_page
      	find_lock_page
      	find_or_create_page
      	grab_cache_page_nowait
      	grab_cache_page_write_begin
      
      All of them are very similar in detail to the patch creates a core helper
      pagecache_get_page() which takes a flags parameter that affects its
      behavior such as whether the page should be marked accessed or not.  Then
      old API is preserved but is basically a thin wrapper around this core
      function.
      
      Each of the filesystems are then updated to avoid calling
      mark_page_accessed when it is known that the VM interfaces have already
      done the job.  There is a slight snag in that the timing of the
      mark_page_accessed() has now changed so in rare cases it's possible a page
      gets to the end of the LRU as PageReferenced where as previously it might
      have been repromoted.  This is expected to be rare but it's worth the
      filesystem people thinking about it in case they see a problem with the
      timing change.  It is also the case that some filesystems may be marking
      pages accessed that previously did not but it makes sense that filesystems
      have consistent behaviour in this regard.
      
      The test case used to evaulate this is a simple dd of a large file done
      multiple times with the file deleted on each iterations.  The size of the
      file is 1/10th physical memory to avoid dirty page balancing.  In the
      async case it will be possible that the workload completes without even
      hitting the disk and will have variable results but highlight the impact
      of mark_page_accessed for async IO.  The sync results are expected to be
      more stable.  The exception is tmpfs where the normal case is for the "IO"
      to not hit the disk.
      
      The test machine was single socket and UMA to avoid any scheduling or NUMA
      artifacts.  Throughput and wall times are presented for sync IO, only wall
      times are shown for async as the granularity reported by dd and the
      variability is unsuitable for comparison.  As async results were variable
      do to writback timings, I'm only reporting the maximum figures.  The sync
      results were stable enough to make the mean and stddev uninteresting.
      
      The performance results are reported based on a run with no profiling.
      Profile data is based on a separate run with oprofile running.
      
      async dd
                                          3.15.0-rc3            3.15.0-rc3
                                             vanilla           accessed-v2
      ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
      tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
      btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
      ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
      xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)
      
      The XFS figure is a bit strange as it managed to avoid a worst case by
      sheer luck but the average figures looked reasonable.
      
              samples percentage
      ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      
      [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Tested-by: NPrabhakar Lad <prabhakar.csengg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2457aec6
    • M
      mm: do not use unnecessary atomic operations when adding pages to the LRU · 6fb81a17
      Mel Gorman 提交于
      When adding pages to the LRU we clear the active bit unconditionally.
      As the page could be reachable from other paths we cannot use unlocked
      operations without risk of corruption such as a parallel
      mark_page_accessed.  This patch tests if is necessary to clear the
      active flag before using an atomic operation.  This potentially opens a
      tiny race when PageActive is checked as mark_page_accessed could be
      called after PageActive was checked.  The race already exists but this
      patch changes it slightly.  The consequence is that that the page may be
      promoted to the active list that might have been left on the inactive
      list before the patch.  It's too tiny a race and too marginal a
      consequence to always use atomic operations for.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6fb81a17
    • M
      mm: do not use atomic operations when releasing pages · e3741b50
      Mel Gorman 提交于
      There should be no references to it any more and a parallel mark should
      not be reordered against us.  Use non-locked varient to clear page active.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3741b50
    • M
      mm: page_alloc: convert hot/cold parameter and immediate callers to bool · b745bc85
      Mel Gorman 提交于
      cold is a bool, make it one.  Make the likely case the "if" part of the
      block instead of the else as according to the optimisation manual this is
      preferred.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b745bc85
    • J
      mm: introdule compound_head_by_tail() · d2ee40ea
      Jianyu Zhan 提交于
      Currently, in put_compound_page(), we have
      
      ======
      if (likely(!PageTail(page))) {                  <------  (1)
              if (put_page_testzero(page)) {
                       /*
                       ¦* By the time all refcounts have been released
                       ¦* split_huge_page cannot run anymore from under us.
                       ¦*/
                       if (PageHead(page))
                               __put_compound_page(page);
                       else
                               __put_single_page(page);
               }
               return;
      }
      
      /* __split_huge_page_refcount can run under us */
      page_head = compound_head(page);        <------------ (2)
      ======
      
      if at (1) ,  we fail the check, this means page is *likely* a tail page.
      
      Then at (2), as compoud_head(page) is inlined, it is :
      
      ======
      static inline struct page *compound_head(struct page *page)
      {
                if (unlikely(PageTail(page))) {           <----------- (3)
                    struct page *head = page->first_page;
      
                      smp_rmb();
                      if (likely(PageTail(page)))
                              return head;
              }
              return page;
      }
      ======
      
      here, the (3) unlikely in the case is a negative hint, because it is
      *likely* a tail page.  So the check (3) in this case is not good, so I
      introduce a helper for this case.
      
      So this patch introduces compound_head_by_tail() which deals with a
      possible tail page(though it could be spilt by a racy thread), and make
      compound_head() a wrapper on it.
      
      This patch has no functional change, and it reduces the object
      size slightly:
         text    data     bss     dec     hex  filename
        11003    1328      16   12347    303b  mm/swap.o.orig
        10971    1328      16   12315    301b  mm/swap.o.patched
      
      I've ran "perf top -e branch-miss" to observe branch-miss in this case.
      As Michael points out, it's a slow path, so only very few times this case
      happens.  But I grep'ed the code base, and found there still are some
      other call sites could be benifited from this helper.  And given that it
      only bloating up the source by only 5 lines, but with a reduced object
      size.  I still believe this helper deserves to exsit.
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2ee40ea
    • J
      mm/swap.c: split put_compound_page() · 4bd3e8f7
      Jianyu Zhan 提交于
      Currently, put_compound_page() carefully handles tricky cases to avoid
      racing with compound page releasing or splitting, which makes it quite
      lenthy (about 200+ lines) and needs deep tab indention, which makes it
      quite hard to follow and maintain.
      
      Now based on two helpers introduced in the previous patch ("mm/swap.c:
      introduce put_[un]refcounted_compound_page helpers for spliting
      put_compound_page"), this patch replaces those two lengthy code paths with
      these two helpers, respectively.  Also, it has some comment rephrasing.
      
      After this patch, the put_compound_page() is very compact, thus easy to
      read and maintain.
      
      After splitting, the object file is of same size as the original one.
      Actually, I've diff'ed put_compound_page()'s orginal disassemble code and
      the patched disassemble code, the are 100% the same!
      
      This fact shows that this splitting has no functional change, but it
      brings readability.
      
      This patch and the previous one blow the code by 32 lines, mostly due to
      comments.
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bd3e8f7
    • J
      mm/swap.c: introduce put_[un]refcounted_compound_page helpers for splitting put_compound_page() · c747ce79
      Jianyu Zhan 提交于
      Currently, put_compound_page() carefully handles tricky cases to avoid
      racing with compound page releasing or splitting, which makes it quite
      lenthy (about 200+ lines) and needs deep tab indention, which makes it
      quite hard to follow and maintain.
      
      This patch and the next patch refactor this function.
      
      Based on the code skeleton of put_compound_page:
      
      put_compound_pge:
              if !PageTail(page)
              	put head page fastpath;
      		return;
      
              /* else PageTail */
              page_head = compound_head(page)
              if !__compound_tail_refcounted(page_head)
      		put head page optimal path; <---(1)
      		return;
              else
      		put head page slowpath; <--- (2)
                      return;
      
      This patch introduces two helpers, put_[un]refcounted_compound_page,
      handling the code path (1) and code path (2), respectively.  They both are
      tagged __always_inline, thus elmiating function call overhead, making them
      operating the same way as before.
      
      They are almost copied verbatim(except one place, a "goto out_put_single"
      is expanded), with some comments rephrasing.
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c747ce79
    • C
      mm: replace __get_cpu_var uses with this_cpu_ptr · 7c8e0181
      Christoph Lameter 提交于
      Replace places where __get_cpu_var() is used for an address calculation
      with this_cpu_ptr().
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c8e0181
    • J
      mm/swap.c: clean up *lru_cache_add* functions · 2329d375
      Jianyu Zhan 提交于
      In mm/swap.c, __lru_cache_add() is exported, but actually there are no
      users outside this file.
      
      This patch unexports __lru_cache_add(), and makes it static.  It also
      exports lru_cache_add_file(), as it is use by cifs and fuse, which can
      loaded as modules.
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2329d375
  8. 04 4月, 2014 2 次提交
    • J
      mm: thrash detection-based file cache sizing · a528910e
      Johannes Weiner 提交于
      The VM maintains cached filesystem pages on two types of lists.  One
      list holds the pages recently faulted into the cache, the other list
      holds pages that have been referenced repeatedly on that first list.
      The idea is to prefer reclaiming young pages over those that have shown
      to benefit from caching in the past.  We call the recently usedbut
      ultimately was not significantly better than a FIFO policy and still
      thrashed cache based on eviction speed, rather than actual demand for
      cache.
      
      This patch solves one half of the problem by decoupling the ability to
      detect working set changes from the inactive list size.  By maintaining
      a history of recently evicted file pages it can detect frequently used
      pages with an arbitrarily small inactive list size, and subsequently
      apply pressure on the active list based on actual demand for cache, not
      just overall eviction speed.
      
      Every zone maintains a counter that tracks inactive list aging speed.
      When a page is evicted, a snapshot of this counter is stored in the
      now-empty page cache radix tree slot.  On refault, the minimum access
      distance of the page can be assessed, to evaluate whether the page
      should be part of the active list or not.
      
      This fixes the VM's blindness towards working set changes in excess of
      the inactive list.  And it's the foundation to further improve the
      protection ability and reduce the minimum inactive list size of 50%.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a528910e
    • J
      mm + fs: prepare for non-page entries in page cache radix trees · 0cd6144a
      Johannes Weiner 提交于
      shmem mappings already contain exceptional entries where swap slot
      information is remembered.
      
      To be able to store eviction information for regular page cache, prepare
      every site dealing with the radix trees directly to handle entries other
      than pages.
      
      The common lookup functions will filter out non-page entries and return
      NULL for page cache holes, just as before.  But provide a raw version of
      the API which returns non-page entries as well, and switch shmem over to
      use it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cd6144a
  9. 04 3月, 2014 1 次提交
    • D
      mm: close PageTail race · 668f9abb
      David Rientjes 提交于
      Commit bf6bddf1 ("mm: introduce compaction and migration for
      ballooned pages") introduces page_count(page) into memory compaction
      which dereferences page->first_page if PageTail(page).
      
      This results in a very rare NULL pointer dereference on the
      aforementioned page_count(page).  Indeed, anything that does
      compound_head(), including page_count() is susceptible to racing with
      prep_compound_page() and seeing a NULL or dangling page->first_page
      pointer.
      
      This patch uses Andrea's implementation of compound_trans_head() that
      deals with such a race and makes it the default compound_head()
      implementation.  This includes a read memory barrier that ensures that
      if PageTail(head) is true that we return a head page that is neither
      NULL nor dangling.  The patch then adds a store memory barrier to
      prep_compound_page() to ensure page->first_page is set.
      
      This is the safest way to ensure we see the head page that we are
      expecting, PageTail(page) is already in the unlikely() path and the
      memory barriers are unfortunately required.
      
      Hugetlbfs is the exception, we don't enforce a store memory barrier
      during init since no race is possible.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      668f9abb
  10. 24 1月, 2014 1 次提交
  11. 22 1月, 2014 4 次提交
    • A
      mm/swap.c: reorganize put_compound_page() · 26296ad2
      Andrew Morton 提交于
      Tweak it so save a tab stop, make code layout slightly less nutty.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26296ad2
    • A
      mm: hugetlbfs: use __compound_tail_refcounted in __get_page_tail too · 3bfcd13e
      Andrea Arcangeli 提交于
      Also remove hugetlb.h which isn't needed anymore as PageHeadHuge is
      handled in mm.h.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3bfcd13e
    • A
      mm: tail page refcounting optimization for slab and hugetlbfs · 44518d2b
      Andrea Arcangeli 提交于
      This skips the _mapcount mangling for slab and hugetlbfs pages.
      
      The main trouble in doing this is to guarantee that PageSlab and
      PageHeadHuge remains constant for all get_page/put_page run on the tail
      of slab or hugetlbfs compound pages.  Otherwise if they're set during
      get_page but not set during put_page, the _mapcount of the tail page
      would underflow.
      
      PageHeadHuge will remain true until the compound page is released and
      enters the buddy allocator so it won't risk to change even if the tail
      page is the last reference left on the page.
      
      PG_slab instead is cleared before the slab frees the head page with
      put_page, so if the tail pin is released after the slab freed the page,
      we would have a problem.  But in the slab case the tail pin cannot be
      the last reference left on the page.  This is because the slab code is
      free to reuse the compound page after a kfree/kmem_cache_free without
      having to check if there's any tail pin left.  In turn all tail pins
      must be always released while the head is still pinned by the slab code
      and so we know PG_slab will be still set too.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44518d2b
    • A
      mm: hugetlbfs: move the put/get_page slab and hugetlbfs optimization in a faster path · ebf360f9
      Andrea Arcangeli 提交于
      We don't actually need a reference on the head page in the slab and
      hugetlbfs paths, as long as we add a smp_rmb() which should be faster
      than get_page_unless_zero.
      
      [akpm@linux-foundation.org: fix typo in comment]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebf360f9
  12. 22 11月, 2013 1 次提交
    • A
      mm: hugetlbfs: fix hugetlbfs optimization · 27c73ae7
      Andrea Arcangeli 提交于
      Commit 7cb2ef56 ("mm: fix aio performance regression for database
      caused by THP") can cause dereference of a dangling pointer if
      split_huge_page runs during PageHuge() if there are updates to the
      tail_page->private field.
      
      Also it is repeating compound_head twice for hugetlbfs and it is running
      compound_head+compound_trans_head for THP when a single one is needed in
      both cases.
      
      The new code within the PageSlab() check doesn't need to verify that the
      THP page size is never bigger than the smallest hugetlbfs page size, to
      avoid memory corruption.
      
      A longstanding theoretical race condition was found while fixing the
      above (see the change right after the skip_unlock label, that is
      relevant for the compound_lock path too).
      
      By re-establishing the _mapcount tail refcounting for all compound
      pages, this also fixes the below problem:
      
        echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
      
        BUG: Bad page state in process bash  pfn:59a01
        page:ffffea000139b038 count:0 mapcount:10 mapping:          (null) index:0x0
        page flags: 0x1c00000000008000(tail)
        Modules linked in:
        CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x55/0x76
          bad_page+0xd5/0x130
          free_pages_prepare+0x213/0x280
          __free_pages+0x36/0x80
          update_and_free_page+0xc1/0xd0
          free_pool_huge_page+0xc2/0xe0
          set_max_huge_pages.part.58+0x14c/0x220
          nr_hugepages_store_common.isra.60+0xd0/0xf0
          nr_hugepages_store+0x13/0x20
          kobj_attr_store+0xf/0x20
          sysfs_write_file+0x189/0x1e0
          vfs_write+0xc5/0x1f0
          SyS_write+0x55/0xb0
          system_call_fastpath+0x16/0x1b
      Signed-off-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27c73ae7
  13. 08 11月, 2013 1 次提交
  14. 13 9月, 2013 1 次提交
  15. 12 9月, 2013 1 次提交
    • K
      mm: fix aio performance regression for database caused by THP · 7cb2ef56
      Khalid Aziz 提交于
      I am working with a tool that simulates oracle database I/O workload.
      This tool (orion to be specific -
      <http://docs.oracle.com/cd/E11882_01/server.112/e16638/iodesign.htm#autoId24>)
      allocates hugetlbfs pages using shmget() with SHM_HUGETLB flag.  It then
      does aio into these pages from flash disks using various common block
      sizes used by database.  I am looking at performance with two of the most
      common block sizes - 1M and 64K.  aio performance with these two block
      sizes plunged after Transparent HugePages was introduced in the kernel.
      Here are performance numbers:
      
      		pre-THP		2.6.39		3.11-rc5
      1M read		8384 MB/s	5629 MB/s	6501 MB/s
      64K read	7867 MB/s	4576 MB/s	4251 MB/s
      
      I have narrowed the performance impact down to the overheads introduced by
      THP in __get_page_tail() and put_compound_page() routines.  perf top shows
      >40% of cycles being spent in these two routines.  Every time direct I/O
      to hugetlbfs pages starts, kernel calls get_page() to grab a reference to
      the pages and calls put_page() when I/O completes to put the reference
      away.  THP introduced significant amount of locking overhead to get_page()
      and put_page() when dealing with compound pages because hugepages can be
      split underneath get_page() and put_page().  It added this overhead
      irrespective of whether it is dealing with hugetlbfs pages or transparent
      hugepages.  This resulted in 20%-45% drop in aio performance when using
      hugetlbfs pages.
      
      Since hugetlbfs pages can not be split, there is no reason to go through
      all the locking overhead for these pages from what I can see.  I added
      code to __get_page_tail() and put_compound_page() to bypass all the
      locking code when working with hugetlbfs pages.  This improved performance
      significantly.  Performance numbers with this patch:
      
      		pre-THP		3.11-rc5	3.11-rc5 + Patch
      1M read		8384 MB/s	6501 MB/s	8371 MB/s
      64K read	7867 MB/s	4251 MB/s	6510 MB/s
      
      Performance with 64K read is still lower than what it was before THP, but
      still a 53% improvement.  It does mean there is more work to be done but I
      will take a 53% improvement for now.
      
      Please take a look at the following patch and let me know if it looks
      reasonable.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cb2ef56
  16. 01 8月, 2013 2 次提交
    • K
      thp, mm: avoid PageUnevictable on active/inactive lru lists · e180cf80
      Kirill A. Shutemov 提交于
      active/inactive lru lists can contain unevicable pages (i.e.  ramfs pages
      that have been placed on the LRU lists when first allocated), but these
      pages must not have PageUnevictable set - otherwise shrink_[in]active_list
      goes crazy:
      
      kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!
      
      1090 static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
      1091                 struct lruvec *lruvec, struct list_head *dst,
      1092                 unsigned long *nr_scanned, struct scan_control *sc,
      1093                 isolate_mode_t mode, enum lru_list lru)
      1094 {
      ...
      1108                 switch (__isolate_lru_page(page, mode)) {
      1109                 case 0:
      ...
      1116                 case -EBUSY:
      ...
      1121                 default:
      1122                         BUG();
      1123                 }
      1124         }
      ...
      1130 }
      
      __isolate_lru_page() returns EINVAL for PageUnevictable(page).
      
      For lru_add_page_tail(), it means we should not set PageUnevictable()
      for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
      Let's just copy PG_active and PG_unevictable from head page in
      __split_huge_page_refcount(), it will simplify lru_add_page_tail().
      
      This will fix one more bug in lru_add_page_tail(): if
      page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
      will go to the same lru as page, but nobody cares to sync page_tail
      active/inactive state with page.  So we can end up with inactive page on
      active lru.  The patch will fix it as well since we copy PG_active from
      head page.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e180cf80
    • N
      mm/swap.c: clear PageActive before adding pages onto unevictable list · ef2a2cbd
      Naoya Horiguchi 提交于
      As a result of commit 13f7f789 ("mm: pagevec: defer deciding which
      LRU to add a page to until pagevec drain time"), pages on unevictable
      lists can have both of PageActive and PageUnevictable set.  This is not
      only confusing, but also corrupts page migration and
      shrink_[in]active_list.
      
      This patch fixes the problem by adding ClearPageActive before adding
      pages into unevictable list.  It also cleans up VM_BUG_ONs.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef2a2cbd
  17. 04 7月, 2013 5 次提交
    • M
      mm: remove lru parameter from __lru_cache_add and lru_cache_add_lru · c53954a0
      Mel Gorman 提交于
      Similar to __pagevec_lru_add, this patch removes the LRU parameter from
      __lru_cache_add and lru_cache_add_lru as the caller does not control the
      exact LRU the page gets added to.  lru_cache_add_lru gets renamed to
      lru_cache_add the name is silly without the lru parameter.  With the
      parameter removed, it is required that the caller indicate if they want
      the page added to the active or inactive list by setting or clearing
      PageActive respectively.
      
      [akpm@linux-foundation.org: Suggested the patch]
      [gang.chen@asianux.com: fix used-unintialized warning]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NChen Gang <gang.chen@asianux.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c53954a0
    • M
      mm: remove lru parameter from __pagevec_lru_add and remove parts of pagevec API · a0b8cab3
      Mel Gorman 提交于
      Now that the LRU to add a page to is decided at LRU-add time, remove the
      misleading lru parameter from __pagevec_lru_add.  A consequence of this
      is that the pagevec_lru_add_file, pagevec_lru_add_anon and similar
      helpers are misleading as the caller no longer has direct control over
      what LRU the page is added to.  Unused helpers are removed by this patch
      and existing users of pagevec_lru_add_file() are converted to use
      lru_cache_add_file() directly and use the per-cpu pagevecs instead of
      creating their own pagevec.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0b8cab3
    • M
      mm: activate !PageLRU pages on mark_page_accessed if page is on local pagevec · 059285a2
      Mel Gorman 提交于
      If a page is on a pagevec then it is !PageLRU and mark_page_accessed()
      may fail to move a page to the active list as expected.  Now that the
      LRU is selected at LRU drain time, mark pages PageActive if they are on
      the local pagevec so it gets moved to the correct list at LRU drain
      time.  Using a debugging patch it was found that for a simple git
      checkout based workload that pages were never added to the active file
      list in practice but with this patch applied they are.
      
      				before   after
      LRU Add Active File                  0      750583
      LRU Add Active Anon            2640587     2702818
      LRU Add Inactive File          8833662     8068353
      LRU Add Inactive Anon              207         200
      
      Note that only pages on the local pagevec are considered on purpose.  A
      !PageLRU page could be in the process of being released, reclaimed,
      migrated or on a remote pagevec that is currently being drained.
      Marking it PageActive is vunerable to races where PageLRU and Active
      bits are checked at the wrong time.  Page reclaim will trigger
      VM_BUG_ONs but depending on when the race hits, it could also free a
      PageActive page to the page allocator and trigger a bad_page warning.
      Similarly a potential race exists between a per-cpu drain on a pagevec
      list and an activation on a remote CPU.
      
      				lru_add_drain_cpu
      				__pagevec_lru_add
      				  lru = page_lru(page);
      mark_page_accessed
        if (PageLRU(page))
          activate_page
        else
          SetPageActive
      				  SetPageLRU(page);
      				  add_page_to_lru_list(page, lruvec, lru);
      
      In this case a PageActive page is added to the inactivate list and later
      the inactive/active stats will get skewed.  While the PageActive checks
      in vmscan could be removed and potentially dealt with, a skew in the
      statistics would be very difficult to detect.  Hence this patch deals
      just with the common case where a page being marked accessed has just
      been added to the local pagevec.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      059285a2
    • M
      mm: pagevec: defer deciding which LRU to add a page to until pagevec drain time · 13f7f789
      Mel Gorman 提交于
      mark_page_accessed() cannot activate an inactive page that is located on
      an inactive LRU pagevec.  Hints from filesystems may be ignored as a
      result.  In preparation for fixing that problem, this patch removes the
      per-LRU pagevecs and leaves just one pagevec.  The final LRU the page is
      added to is deferred until the pagevec is drained.
      
      This means that fewer pagevecs are available and potentially there is
      greater contention on the LRU lock.  However, this only applies in the
      case where there is an almost perfect mix of file, anon, active and
      inactive pages being added to the LRU.  In practice I expect that we are
      adding stream of pages of a particular time and that the changes in
      contention will barely be measurable.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13f7f789
    • M
      mm: add tracepoints for LRU activation and insertions · c6286c98
      Mel Gorman 提交于
      Andrew Perepechko reported a problem whereby pages are being prematurely
      evicted as the mark_page_accessed() hint is ignored for pages that are
      currently on a pagevec --
      http://www.spinics.net/lists/linux-ext4/msg37340.html .
      
      Alexey Lyahkov and Robin Dong have also reported problems recently that
      could be due to hot pages reaching the end of the inactive list too
      quickly and be reclaimed.
      
      Rather than addressing this on a per-filesystem basis, this series aims
      to fix the mark_page_accessed() interface by deferring what LRU a page
      is added to pagevec drain time and allowing mark_page_accessed() to call
      SetPageActive on a pagevec page.
      
      Patch 1 adds two tracepoints for LRU page activation and insertion. Using
      	these processes it's possible to build a model of pages in the
      	LRU that can be processed offline.
      
      Patch 2 defers making the decision on what LRU to add a page to until when
      	the pagevec is drained.
      
      Patch 3 searches the local pagevec for pages to mark PageActive on
      	mark_page_accessed. The changelog explains why only the local
      	pagevec is examined.
      
      Patches 4 and 5 tidy up the API.
      
      postmark, a dd-based test and fs-mark both single and threaded mode were
      run but none of them showed any performance degradation or gain as a
      result of the patch.
      
      Using patch 1, I built a *very* basic model of the LRU to examine
      offline what the average age of different page types on the LRU were in
      milliseconds.  Of course, capturing the trace distorts the test as it's
      written to local disk but it does not matter for the purposes of this
      test.  The average age of pages in milliseconds were
      
      				    vanilla deferdrain
      Average age mapped anon:               1454       1250
      Average age mapped file:             127841     155552
      Average age unmapped anon:               85        235
      Average age unmapped file:            73633      38884
      Average age unmapped buffers:         74054     116155
      
      The LRU activity was mostly files which you'd expect for a dd-based
      workload.  Note that the average age of buffer pages is increased by the
      series and it is expected this is due to the fact that the buffer pages
      are now getting added to the active list when drained from the pagevecs.
      Note that the average age of the unmapped file data is decreased as they
      are still added to the inactive list and are reclaimed before the
      buffers.
      
      There is no guarantee this is a universal win for all workloads and it
      would be nice if the filesystem people gave some thought as to whether
      this decision is generally a win or a loss.
      
      This patch:
      
      Using these tracepoints it is possible to model LRU activity and the
      average residency of pages of different types.  This can be used to
      debug problems related to premature reclaim of pages of particular
      types.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6286c98
  18. 08 5月, 2013 1 次提交
  19. 30 4月, 2013 1 次提交