1. 23 10月, 2017 1 次提交
  2. 07 9月, 2017 7 次提交
    • D
      mm: add /proc/pid/smaps_rollup · 493b0e9d
      Daniel Colascione 提交于
      /proc/pid/smaps_rollup is a new proc file that improves the performance
      of user programs that determine aggregate memory statistics (e.g., total
      PSS) of a process.
      
      Android regularly "samples" the memory usage of various processes in
      order to balance its memory pool sizes.  This sampling process involves
      opening /proc/pid/smaps and summing certain fields.  For very large
      processes, sampling memory use this way can take several hundred
      milliseconds, due mostly to the overhead of the seq_printf calls in
      task_mmu.c.
      
      smaps_rollup improves the situation.  It contains most of the fields of
      /proc/pid/smaps, but instead of a set of fields for each VMA,
      smaps_rollup instead contains one synthetic smaps-format entry
      representing the whole process.  In the single smaps_rollup synthetic
      entry, each field is the summation of the corresponding field in all of
      the real-smaps VMAs.  Using a common format for smaps_rollup and smaps
      allows userspace parsers to repurpose parsers meant for use with
      non-rollup smaps for smaps_rollup, and it allows userspace to switch
      between smaps_rollup and smaps at runtime (say, based on the
      availability of smaps_rollup in a given kernel) with minimal fuss.
      
      By using smaps_rollup instead of smaps, a caller can avoid the
      significant overhead of formatting, reading, and parsing each of a large
      process's potentially very numerous memory mappings.  For sampling
      system_server's PSS in Android, we measured a 12x speedup, representing
      a savings of several hundred milliseconds.
      
      One alternative to a new per-process proc file would have been including
      PSS information in /proc/pid/status.  We considered this option but
      thought that PSS would be too expensive (by a few orders of magnitude)
      to collect relative to what's already emitted as part of
      /proc/pid/status, and slowing every user of /proc/pid/status for the
      sake of readers that happen to want PSS feels wrong.
      
      The code itself works by reusing the existing VMA-walking framework we
      use for regular smaps generation and keeping the mem_size_stats
      structure around between VMA walks instead of using a fresh one for each
      VMA.  In this way, summation happens automatically.  We let seq_file
      walk over the VMAs just as it does for regular smaps and just emit
      nothing to the seq_file until we hit the last VMA.
      
      Benchmarks:
      
          using smaps:
          iterations:1000 pid:1163 pss:220023808
          0m29.46s real 0m08.28s user 0m20.98s system
      
          using smaps_rollup:
          iterations:1000 pid:1163 pss:220702720
          0m04.39s real 0m00.03s user 0m04.31s system
      
      We're using the PSS samples we collect asynchronously for
      system-management tasks like fine-tuning oom_adj_score, memory use
      tracking for debugging, application-level memory-use attribution, and
      deciding whether we want to kill large processes during system idle
      maintenance windows.  Android has been using PSS for these purposes for
      a long time; as the average process VMA count has increased and and
      devices become more efficiency-conscious, PSS-collection inefficiency
      has started to matter more.  IMHO, it'd be a lot safer to optimize the
      existing PSS-collection model, which has been fine-tuned over the years,
      instead of changing the memory tracking approach entirely to work around
      smaps-generation inefficiency.
      
      Tim said:
      
      : There are two main reasons why Android gathers PSS information:
      :
      : 1. Android devices can show the user the amount of memory used per
      :    application via the settings app.  This is a less important use case.
      :
      : 2. We log PSS to help identify leaks in applications.  We have found
      :    an enormous number of bugs (in the Android platform, in Google's own
      :    apps, and in third-party applications) using this data.
      :
      : To do this, system_server (the main process in Android userspace) will
      : sample the PSS of a process three seconds after it changes state (for
      : example, app is launched and becomes the foreground application) and about
      : every ten minutes after that.  The net result is that PSS collection is
      : regularly running on at least one process in the system (usually a few
      : times a minute while the screen is on, less when screen is off due to
      : suspend).  PSS of a process is an incredibly useful stat to track, and we
      : aren't going to get rid of it.  We've looked at some very hacky approaches
      : using RSS ("take the RSS of the target process, subtract the RSS of the
      : zygote process that is the parent of all Android apps") to reduce the
      : accounting time, but it regularly overestimated the memory used by 20+
      : percent.  Accordingly, I don't think that there's a good alternative to
      : using PSS.
      :
      : We started looking into PSS collection performance after we noticed random
      : frequency spikes while a phone's screen was off; occasionally, one of the
      : CPU clusters would ramp to a high frequency because there was 200-300ms of
      : constant CPU work from a single thread in the main Android userspace
      : process.  The work causing the spike (which is reasonable governor
      : behavior given the amount of CPU time needed) was always PSS collection.
      : As a result, Android is burning more power than we should be on PSS
      : collection.
      :
      : The other issue (and why I'm less sure about improving smaps as a
      : long-term solution) is that the number of VMAs per process has increased
      : significantly from release to release.  After trying to figure out why we
      : were seeing these 200-300ms PSS collection times on Android O but had not
      : noticed it in previous versions, we found that the number of VMAs in the
      : main system process increased by 50% from Android N to Android O (from
      : ~1800 to ~2700) and varying increases in every userspace process.  Android
      : M to N also had an increase in the number of VMAs, although not as much.
      : I'm not sure why this is increasing so much over time, but thinking about
      : ASLR and ways to make ASLR better, I expect that this will continue to
      : increase going forward.  I would not be surprised if we hit 5000 VMAs on
      : the main Android process (system_server) by 2020.
      :
      : If we assume that the number of VMAs is going to increase over time, then
      : doing anything we can do to reduce the overhead of each VMA during PSS
      : collection seems like the right way to go, and that means outputting an
      : aggregate statistic (to avoid whatever overhead there is per line in
      : writing smaps and in reading each line from userspace).
      
      Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.comSigned-off-by: NDaniel Colascione <dancol@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sonny Rao <sonnyrao@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      493b0e9d
    • A
      swap: choose swap device according to numa node · a2468cc9
      Aaron Lu 提交于
      If the system has more than one swap device and swap device has the node
      information, we can make use of this information to decide which swap
      device to use in get_swap_pages() to get better performance.
      
      The current code uses a priority based list, swap_avail_list, to decide
      which swap device to use and if multiple swap devices share the same
      priority, they are used round robin.  This patch changes the previous
      single global swap_avail_list into a per-numa-node list, i.e.  for each
      numa node, it sees its own priority based list of available swap
      devices.  Swap device's priority can be promoted on its matching node's
      swap_avail_list.
      
      The current swap device's priority is set as: user can set a >=0 value,
      or the system will pick one starting from -1 then downwards.  The
      priority value in the swap_avail_list is the negated value of the swap
      device's due to plist being sorted from low to high.  The new policy
      doesn't change the semantics for priority >=0 cases, the previous
      starting from -1 then downwards now becomes starting from -2 then
      downwards and -1 is reserved as the promoted value.
      
      Take 4-node EX machine as an example, suppose 4 swap devices are
      available, each sit on a different node:
      swapA on node 0
      swapB on node 1
      swapC on node 2
      swapD on node 3
      
      After they are all swapped on in the sequence of ABCD.
      
      Current behaviour:
      their priorities will be:
      swapA: -1
      swapB: -2
      swapC: -3
      swapD: -4
      And their position in the global swap_avail_list will be:
      swapA   -> swapB   -> swapC   -> swapD
      prio:1     prio:2     prio:3     prio:4
      
      New behaviour:
      their priorities will be(note that -1 is skipped):
      swapA: -2
      swapB: -3
      swapC: -4
      swapD: -5
      And their positions in the 4 swap_avail_lists[nid] will be:
      swap_avail_lists[0]: /* node 0's available swap device list */
      swapA   -> swapB   -> swapC   -> swapD
      prio:1     prio:3     prio:4     prio:5
      swap_avali_lists[1]: /* node 1's available swap device list */
      swapB   -> swapA   -> swapC   -> swapD
      prio:1     prio:2     prio:4     prio:5
      swap_avail_lists[2]: /* node 2's available swap device list */
      swapC   -> swapA   -> swapB   -> swapD
      prio:1     prio:2     prio:3     prio:5
      swap_avail_lists[3]: /* node 3's available swap device list */
      swapD   -> swapA   -> swapB   -> swapC
      prio:1     prio:2     prio:3     prio:4
      
      To see the effect of the patch, a test that starts N process, each mmap
      a region of anonymous memory and then continually write to it at random
      position to trigger both swap in and out is used.
      
      On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
      are used as swap devices with each attached to a different node, the
      result is:
      
      runtime=30m/processes=32/total test size=128G/each process mmap region=4G
      kernel         throughput
      vanilla        13306
      auto-binding   15169 +14%
      
      runtime=30m/processes=64/total test size=128G/each process mmap region=2G
      kernel         throughput
      vanilla        11885
      auto-binding   14879 +25%
      
      [aaron.lu@intel.com: v2]
        Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
        Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
      [akpm@linux-foundation.org: use kmalloc_array()]
      Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
      Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Cc: "Chen, Tim C" <tim.c.chen@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2468cc9
    • H
      mm, swap: add sysfs interface for VMA based swap readahead · d9bfcfdc
      Huang Ying 提交于
      The sysfs interface to control the VMA based swap readahead is added as
      follow,
      
      /sys/kernel/mm/swap/vma_ra_enabled
      
      Enable the VMA based swap readahead algorithm, or use the original
      global swap readahead algorithm.
      
      /sys/kernel/mm/swap/vma_ra_max_order
      
      Set the max order of the readahead window size for the VMA based swap
      readahead algorithm.
      
      The corresponding ABI documentation is added too.
      
      Link: http://lkml.kernel.org/r/20170807054038.1843-5-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9bfcfdc
    • J
      fscache: remove unused ->now_uncached callback · 26b433d0
      Jan Kara 提交于
      Patch series "Ranged pagevec lookup", v2.
      
      In this series I make pagevec_lookup() update the index (to be
      consistent with pagevec_lookup_tag() and also as a preparation for
      ranged lookups), provide ranged variant of pagevec_lookup() and use it
      in places where it makes sense.  This not only removes some common code
      but is also a measurable performance win for some use cases (see patch
      4/10) where radix tree is sparse and searching & grabing of a page after
      the end of the range has measurable overhead.
      
      This patch (of 10):
      
      The callback doesn't ever get called.  Remove it.
      
      Link: http://lkml.kernel.org/r/20170726114704.7626-2-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26b433d0
    • M
      mm, page_alloc: rip out ZONELIST_ORDER_ZONE · c9bff3ee
      Michal Hocko 提交于
      Patch series "cleanup zonelists initialization", v1.
      
      This is aimed at cleaning up the zonelists initialization code we have
      but the primary motivation was bug report [2] which got resolved but the
      usage of stop_machine is just too ugly to live.  Most patches are
      straightforward but 3 of them need a special consideration.
      
      Patch 1 removes zone ordered zonelists completely.  I am CCing linux-api
      because this is a user visible change.  As I argue in the patch
      description I do not think we have a strong usecase for it these days.
      I have kept sysctl in place and warn into the log if somebody tries to
      configure zone lists ordering.  If somebody has a real usecase for it we
      can revert this patch but I do not expect anybody will actually notice
      runtime differences.  This patch is not strictly needed for the rest but
      it made patch 6 easier to implement.
      
      Patch 7 removes stop_machine from build_all_zonelists without adding any
      special synchronization between iterators and updater which I _believe_
      is acceptable as explained in the changelog.  I hope I am not missing
      anything.
      
      Patch 8 then removes zonelists_mutex which is kind of ugly as well and
      not really needed AFAICS but a care should be taken when double checking
      my thinking.
      
      This patch (of 9):
      
      Supporting zone ordered zonelists costs us just a lot of code while the
      usefulness is arguable if existent at all.  Mel has already made node
      ordering default on 64b systems.  32b systems are still using
      ZONELIST_ORDER_ZONE because it is considered better to fallback to a
      different NUMA node rather than consume precious lowmem zones.
      
      This argument is, however, weaken by the fact that the memory reclaim
      has been reworked to be node rather than zone oriented.  This means that
      lowmem requests have to skip over all highmem pages on LRUs already and
      so zone ordering doesn't save the reclaim time much.  So the only
      advantage of the zone ordering is under a light memory pressure when
      highmem requests do not ever hit into lowmem zones and the lowmem
      pressure doesn't need to reclaim.
      
      Considering that 32b NUMA systems are rather suboptimal already and it
      is generally advisable to use 64b kernel on such a HW I believe we
      should rather care about the code maintainability and just get rid of
      ZONELIST_ORDER_ZONE altogether.  Keep systcl in place and warn if
      somebody tries to set zone ordering either from kernel command line or
      the sysctl.
      
      [mhocko@suse.com: reading vm.numa_zonelist_order will never terminate]
      Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9bff3ee
    • M
      zram: add config and doc file for writeback feature · 5a47074f
      Minchan Kim 提交于
      This patch adds document and kconfig for using of writeback feature.
      
      Link: http://lkml.kernel.org/r/1498459987-24562-10-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a47074f
    • R
      dax: use common 4k zero page for dax mmap reads · 91d25ba8
      Ross Zwisler 提交于
      When servicing mmap() reads from file holes the current DAX code
      allocates a page cache page of all zeroes and places the struct page
      pointer in the mapping->page_tree radix tree.
      
      This has three major drawbacks:
      
      1) It consumes memory unnecessarily. For every 4k page that is read via
         a DAX mmap() over a hole, we allocate a new page cache page. This
         means that if you read 1GiB worth of pages, you end up using 1GiB of
         zeroed memory. This is easily visible by looking at the overall
         memory consumption of the system or by looking at /proc/[pid]/smaps:
      
      	7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:             1048576 kB
      	Pss:             1048576 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:   1048576 kB
      	Private_Dirty:         0 kB
      	Referenced:      1048576 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      2) It is slower than using a common zero page because each page fault
         has more work to do. Instead of just inserting a common zero page we
         have to allocate a page cache page, zero it, and then insert it. Here
         are the average latencies of dax_load_hole() as measured by ftrace on
         a random test box:
      
          Old method, using zeroed page cache pages:	3.4 us
          New method, using the common 4k zero page:	0.8 us
      
         This was the average latency over 1 GiB of sequential reads done by
         this simple fio script:
      
           [global]
           size=1G
           filename=/root/dax/data
           fallocate=none
           [io]
           rw=read
           ioengine=mmap
      
      3) The fact that we had to check for both DAX exceptional entries and
         for page cache pages in the radix tree made the DAX code more
         complex.
      
      Solve these issues by following the lead of the DAX PMD code and using a
      common 4k zero page instead.  As with the PMD code we will now insert a
      DAX exceptional entry into the radix tree instead of a struct page
      pointer which allows us to remove all the special casing in the DAX
      code.
      
      Note that we do still pretty aggressively check for regular pages in the
      DAX radix tree, especially where we take action based on the bits set in
      the page.  If we ever find a regular page in our radix tree now that
      most likely means that someone besides DAX is inserting pages (which has
      happened lots of times in the past), and we want to find that out early
      and fail loudly.
      
      This solution also removes the extra memory consumption.  Here is that
      same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
      code:
      
      	7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:                   0 kB
      	Pss:                   0 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:         0 kB
      	Private_Dirty:         0 kB
      	Referenced:            0 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      Overall system memory consumption is similarly improved.
      
      Another major change is that we remove dax_pfn_mkwrite() from our fault
      flow, and instead rely on the page fault itself to make the PTE dirty
      and writeable.  The following description from the patch adding the
      vm_insert_mixed_mkwrite() call explains this a little more:
      
         "To be able to use the common 4k zero page in DAX we need to have our
          PTE fault path look more like our PMD fault path where a PTE entry
          can be marked as dirty and writeable as it is first inserted rather
          than waiting for a follow-up dax_pfn_mkwrite() =>
          finish_mkwrite_fault() call.
      
          Right now we can rely on having a dax_pfn_mkwrite() call because we
          can distinguish between these two cases in do_wp_page():
      
                  case 1: 4k zero page => writable DAX storage
                  case 2: read-only DAX storage => writeable DAX storage
      
          This distinction is made by via vm_normal_page(). vm_normal_page()
          returns false for the common 4k zero page, though, just as it does
          for DAX ptes. Instead of special casing the DAX + 4k zero page case
          we will simplify our DAX PTE page fault sequence so that it matches
          our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
          We will instead use dax_iomap_fault() to handle write-protection
          faults.
      
          This means that insert_pfn() needs to follow the lead of
          insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
          'mkwrite' is set insert_pfn() will do the work that was previously
          done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91d25ba8
  3. 05 9月, 2017 32 次提交