1. 07 9月, 2017 40 次提交
    • R
      x86,mpx: make mpx depend on x86-64 to free up VMA flag · df3735c5
      Rik van Riel 提交于
      Patch series "mm,fork,security: introduce MADV_WIPEONFORK", v4.
      
      If a child process accesses memory that was MADV_WIPEONFORK, it will get
      zeroes.  The address ranges are still valid, they are just empty.
      
      If a child process accesses memory that was MADV_DONTFORK, it will get a
      segmentation fault, since those address ranges are no longer valid in
      the child after fork.
      
      Since MADV_DONTFORK also seems to be used to allow very large programs
      to fork in systems with strict memory overcommit restrictions, changing
      the semantics of MADV_DONTFORK might break existing programs.
      
      The use case is libraries that store or cache information, and want to
      know that they need to regenerate it in the child process after fork.
      
      Examples of this would be:
       - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
         check, which is too slow without a PID cache)
       - PKCS#11 API reinitialization check (mandated by specification)
       - glibc's upcoming PRNG (reseed after fork)
       - OpenSSL PRNG (reseed after fork)
      
      The security benefits of a forking server having a re-inialized PRNG in
      every child process are pretty obvious.  However, due to libraries
      having all kinds of internal state, and programs getting compiled with
      many different versions of each library, it is unreasonable to expect
      calling programs to re-initialize everything manually after fork.
      
      A further complication is the proliferation of clone flags, programs
      bypassing glibc's functions to call clone directly, and programs calling
      unshare, causing the glibc pthread_atfork hook to not get called.
      
      It would be better to have the kernel take care of this automatically.
      
      The patchset also adds MADV_KEEPONFORK, to undo the effects of a prior
      MADV_WIPEONFORK.
      
      This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
      
          https://man.openbsd.org/minherit.2
      
      This patch (of 2):
      
      MPX only seems to be available on 64 bit CPUs, starting with Skylake and
      Goldmont.  Move VM_MPX into the 64 bit only portion of vma->vm_flags, in
      order to free up a VMA flag.
      
      Link: http://lkml.kernel.org/r/20170811212829.29186-2-riel@redhat.comSigned-off-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Colm MacCártaigh <colm@allcosts.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df3735c5
    • D
      mm: add /proc/pid/smaps_rollup · 493b0e9d
      Daniel Colascione 提交于
      /proc/pid/smaps_rollup is a new proc file that improves the performance
      of user programs that determine aggregate memory statistics (e.g., total
      PSS) of a process.
      
      Android regularly "samples" the memory usage of various processes in
      order to balance its memory pool sizes.  This sampling process involves
      opening /proc/pid/smaps and summing certain fields.  For very large
      processes, sampling memory use this way can take several hundred
      milliseconds, due mostly to the overhead of the seq_printf calls in
      task_mmu.c.
      
      smaps_rollup improves the situation.  It contains most of the fields of
      /proc/pid/smaps, but instead of a set of fields for each VMA,
      smaps_rollup instead contains one synthetic smaps-format entry
      representing the whole process.  In the single smaps_rollup synthetic
      entry, each field is the summation of the corresponding field in all of
      the real-smaps VMAs.  Using a common format for smaps_rollup and smaps
      allows userspace parsers to repurpose parsers meant for use with
      non-rollup smaps for smaps_rollup, and it allows userspace to switch
      between smaps_rollup and smaps at runtime (say, based on the
      availability of smaps_rollup in a given kernel) with minimal fuss.
      
      By using smaps_rollup instead of smaps, a caller can avoid the
      significant overhead of formatting, reading, and parsing each of a large
      process's potentially very numerous memory mappings.  For sampling
      system_server's PSS in Android, we measured a 12x speedup, representing
      a savings of several hundred milliseconds.
      
      One alternative to a new per-process proc file would have been including
      PSS information in /proc/pid/status.  We considered this option but
      thought that PSS would be too expensive (by a few orders of magnitude)
      to collect relative to what's already emitted as part of
      /proc/pid/status, and slowing every user of /proc/pid/status for the
      sake of readers that happen to want PSS feels wrong.
      
      The code itself works by reusing the existing VMA-walking framework we
      use for regular smaps generation and keeping the mem_size_stats
      structure around between VMA walks instead of using a fresh one for each
      VMA.  In this way, summation happens automatically.  We let seq_file
      walk over the VMAs just as it does for regular smaps and just emit
      nothing to the seq_file until we hit the last VMA.
      
      Benchmarks:
      
          using smaps:
          iterations:1000 pid:1163 pss:220023808
          0m29.46s real 0m08.28s user 0m20.98s system
      
          using smaps_rollup:
          iterations:1000 pid:1163 pss:220702720
          0m04.39s real 0m00.03s user 0m04.31s system
      
      We're using the PSS samples we collect asynchronously for
      system-management tasks like fine-tuning oom_adj_score, memory use
      tracking for debugging, application-level memory-use attribution, and
      deciding whether we want to kill large processes during system idle
      maintenance windows.  Android has been using PSS for these purposes for
      a long time; as the average process VMA count has increased and and
      devices become more efficiency-conscious, PSS-collection inefficiency
      has started to matter more.  IMHO, it'd be a lot safer to optimize the
      existing PSS-collection model, which has been fine-tuned over the years,
      instead of changing the memory tracking approach entirely to work around
      smaps-generation inefficiency.
      
      Tim said:
      
      : There are two main reasons why Android gathers PSS information:
      :
      : 1. Android devices can show the user the amount of memory used per
      :    application via the settings app.  This is a less important use case.
      :
      : 2. We log PSS to help identify leaks in applications.  We have found
      :    an enormous number of bugs (in the Android platform, in Google's own
      :    apps, and in third-party applications) using this data.
      :
      : To do this, system_server (the main process in Android userspace) will
      : sample the PSS of a process three seconds after it changes state (for
      : example, app is launched and becomes the foreground application) and about
      : every ten minutes after that.  The net result is that PSS collection is
      : regularly running on at least one process in the system (usually a few
      : times a minute while the screen is on, less when screen is off due to
      : suspend).  PSS of a process is an incredibly useful stat to track, and we
      : aren't going to get rid of it.  We've looked at some very hacky approaches
      : using RSS ("take the RSS of the target process, subtract the RSS of the
      : zygote process that is the parent of all Android apps") to reduce the
      : accounting time, but it regularly overestimated the memory used by 20+
      : percent.  Accordingly, I don't think that there's a good alternative to
      : using PSS.
      :
      : We started looking into PSS collection performance after we noticed random
      : frequency spikes while a phone's screen was off; occasionally, one of the
      : CPU clusters would ramp to a high frequency because there was 200-300ms of
      : constant CPU work from a single thread in the main Android userspace
      : process.  The work causing the spike (which is reasonable governor
      : behavior given the amount of CPU time needed) was always PSS collection.
      : As a result, Android is burning more power than we should be on PSS
      : collection.
      :
      : The other issue (and why I'm less sure about improving smaps as a
      : long-term solution) is that the number of VMAs per process has increased
      : significantly from release to release.  After trying to figure out why we
      : were seeing these 200-300ms PSS collection times on Android O but had not
      : noticed it in previous versions, we found that the number of VMAs in the
      : main system process increased by 50% from Android N to Android O (from
      : ~1800 to ~2700) and varying increases in every userspace process.  Android
      : M to N also had an increase in the number of VMAs, although not as much.
      : I'm not sure why this is increasing so much over time, but thinking about
      : ASLR and ways to make ASLR better, I expect that this will continue to
      : increase going forward.  I would not be surprised if we hit 5000 VMAs on
      : the main Android process (system_server) by 2020.
      :
      : If we assume that the number of VMAs is going to increase over time, then
      : doing anything we can do to reduce the overhead of each VMA during PSS
      : collection seems like the right way to go, and that means outputting an
      : aggregate statistic (to avoid whatever overhead there is per line in
      : writing smaps and in reading each line from userspace).
      
      Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.comSigned-off-by: NDaniel Colascione <dancol@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sonny Rao <sonnyrao@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      493b0e9d
    • H
      mm: hugetlb: clear target sub-page last when clearing huge page · c79b57e4
      Huang Ying 提交于
      Huge page helps to reduce TLB miss rate, but it has higher cache
      footprint, sometimes this may cause some issue.  For example, when
      clearing huge page on x86_64 platform, the cache footprint is 2M.  But
      on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
      LLC (last level cache).  That is, in average, there are 2.5M LLC for
      each core and 1.25M LLC for each thread.
      
      If the cache pressure is heavy when clearing the huge page, and we clear
      the huge page from the begin to the end, it is possible that the begin
      of huge page is evicted from the cache after we finishing clearing the
      end of the huge page.  And it is possible for the application to access
      the begin of the huge page after clearing the huge page.
      
      To help the above situation, in this patch, when we clear a huge page,
      the order to clear sub-pages is changed.  In quite some situation, we
      can get the address that the application will access after we clear the
      huge page, for example, in a page fault handler.  Instead of clearing
      the huge page from begin to end, we will clear the sub-pages farthest
      from the the sub-page to access firstly, and clear the sub-page to
      access last.  This will make the sub-page to access most cache-hot and
      sub-pages around it more cache-hot too.  If we cannot know the address
      the application will access, the begin of the huge page is assumed to be
      the the address the application will access.
      
      With this patch, the throughput increases ~28.3% in vm-scalability
      anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
      system (36 cores, 72 threads).  The test case creates 72 processes, each
      process mmap a big anonymous memory area and writes to it from the begin
      to the end.  For each process, other processes could be seen as other
      workload which generates heavy cache pressure.  At the same time, the
      cache miss rate reduced from ~33.4% to ~31.7%, the IPC (instruction per
      cycle) increased from 0.56 to 0.74, and the time spent in user space is
      reduced ~7.9%
      
      Christopher Lameter suggests to clear bytes inside a sub-page from end
      to begin too.  But tests show no visible performance difference in the
      tests.  May because the size of page is small compared with the cache
      size.
      
      Thanks Andi Kleen to propose to use address to access to determine the
      order of sub-pages to clear.
      
      The hugetlbfs access address could be improved, will do that in another
      patch.
      
      [ying.huang@intel.com: improve readability of clear_huge_page()]
        Link: http://lkml.kernel.org/r/20170830051842.1397-1-ying.huang@intel.com
      Link: http://lkml.kernel.org/r/20170815014618.15842-1-ying.huang@intel.comSuggested-by: NAndi Kleen <andi.kleen@intel.com>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c79b57e4
    • A
      mm: oom: let oom_reap_task and exit_mmap run concurrently · 21292580
      Andrea Arcangeli 提交于
      This is purely required because exit_aio() may block and exit_mmap() may
      never start, if the oom_reap_task cannot start running on a mm with
      mm_users == 0.
      
      At the same time if the OOM reaper doesn't wait at all for the memory of
      the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would
      generate a spurious OOM kill.
      
      If it wasn't because of the exit_aio or similar blocking functions in
      the last mmput, it would be enough to change the oom_reap_task() in the
      case it finds mm_users == 0, to wait for a timeout or to wait for
      __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the
      problem here so the concurrency of exit_mmap and oom_reap_task is
      apparently warranted.
      
      It's a non standard runtime, exit_mmap() runs without mmap_sem, and
      oom_reap_task runs with the mmap_sem for reading as usual (kind of
      MADV_DONTNEED).
      
      The race between the two is solved with a combination of
      tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP
      (serialized by a dummy down_write/up_write cycle on the same lines of
      the ksm_exit method).
      
      If the oom_reap_task() may be running concurrently during exit_mmap,
      exit_mmap will wait it to finish in down_write (before taking down mm
      structures that would make the oom_reap_task fail with use after free).
      
      If exit_mmap comes first, oom_reap_task() will skip the mm if
      MMF_OOM_SKIP is already set and in turn all memory is already freed and
      furthermore the mm data structures may already have been taken down by
      free_pgtables.
      
      [aarcange@redhat.com: incremental one liner]
        Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com
      [rientjes@google.com: remove unused mmput_async]
        Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com
      [aarcange@redhat.com: microoptimization]
        Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com
      Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com
      Fixes: 26db62f1 ("oom: keep mm of the killed task available")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Tested-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21292580
    • A
      swap: choose swap device according to numa node · a2468cc9
      Aaron Lu 提交于
      If the system has more than one swap device and swap device has the node
      information, we can make use of this information to decide which swap
      device to use in get_swap_pages() to get better performance.
      
      The current code uses a priority based list, swap_avail_list, to decide
      which swap device to use and if multiple swap devices share the same
      priority, they are used round robin.  This patch changes the previous
      single global swap_avail_list into a per-numa-node list, i.e.  for each
      numa node, it sees its own priority based list of available swap
      devices.  Swap device's priority can be promoted on its matching node's
      swap_avail_list.
      
      The current swap device's priority is set as: user can set a >=0 value,
      or the system will pick one starting from -1 then downwards.  The
      priority value in the swap_avail_list is the negated value of the swap
      device's due to plist being sorted from low to high.  The new policy
      doesn't change the semantics for priority >=0 cases, the previous
      starting from -1 then downwards now becomes starting from -2 then
      downwards and -1 is reserved as the promoted value.
      
      Take 4-node EX machine as an example, suppose 4 swap devices are
      available, each sit on a different node:
      swapA on node 0
      swapB on node 1
      swapC on node 2
      swapD on node 3
      
      After they are all swapped on in the sequence of ABCD.
      
      Current behaviour:
      their priorities will be:
      swapA: -1
      swapB: -2
      swapC: -3
      swapD: -4
      And their position in the global swap_avail_list will be:
      swapA   -> swapB   -> swapC   -> swapD
      prio:1     prio:2     prio:3     prio:4
      
      New behaviour:
      their priorities will be(note that -1 is skipped):
      swapA: -2
      swapB: -3
      swapC: -4
      swapD: -5
      And their positions in the 4 swap_avail_lists[nid] will be:
      swap_avail_lists[0]: /* node 0's available swap device list */
      swapA   -> swapB   -> swapC   -> swapD
      prio:1     prio:3     prio:4     prio:5
      swap_avali_lists[1]: /* node 1's available swap device list */
      swapB   -> swapA   -> swapC   -> swapD
      prio:1     prio:2     prio:4     prio:5
      swap_avail_lists[2]: /* node 2's available swap device list */
      swapC   -> swapA   -> swapB   -> swapD
      prio:1     prio:2     prio:3     prio:5
      swap_avail_lists[3]: /* node 3's available swap device list */
      swapD   -> swapA   -> swapB   -> swapC
      prio:1     prio:2     prio:3     prio:4
      
      To see the effect of the patch, a test that starts N process, each mmap
      a region of anonymous memory and then continually write to it at random
      position to trigger both swap in and out is used.
      
      On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
      are used as swap devices with each attached to a different node, the
      result is:
      
      runtime=30m/processes=32/total test size=128G/each process mmap region=4G
      kernel         throughput
      vanilla        13306
      auto-binding   15169 +14%
      
      runtime=30m/processes=64/total test size=128G/each process mmap region=2G
      kernel         throughput
      vanilla        11885
      auto-binding   14879 +25%
      
      [aaron.lu@intel.com: v2]
        Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
        Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
      [akpm@linux-foundation.org: use kmalloc_array()]
      Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
      Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Cc: "Chen, Tim C" <tim.c.chen@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2468cc9
    • M
      mm: replace TIF_MEMDIE checks by tsk_is_oom_victim · da99ecf1
      Michal Hocko 提交于
      TIF_MEMDIE is set only to the tasks whick were either directly selected
      by the OOM killer or passed through mark_oom_victim from the allocator
      path.  tsk_is_oom_victim is more generic and allows to identify all
      tasks (threads) which share the mm with the oom victim.
      
      Please note that the freezer still needs to check TIF_MEMDIE because we
      cannot thaw tasks which do not participage in oom_victims counting
      otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns.
      
      Link: http://lkml.kernel.org/r/20170810075019.28998-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da99ecf1
    • M
      mm, oom: do not rely on TIF_MEMDIE for memory reserves access · cd04ae1e
      Michal Hocko 提交于
      For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
      victims and then, among other things, to give these threads full access
      to memory reserves.  There are few shortcomings of this implementation,
      though.
      
      First of all and the most serious one is that the full access to memory
      reserves is quite dangerous because we leave no safety room for the
      system to operate and potentially do last emergency steps to move on.
      
      Secondly this flag is per task_struct while the OOM killer operates on
      mm_struct granularity so all processes sharing the given mm are killed.
      Giving the full access to all these task_structs could lead to a quick
      memory reserves depletion.  We have tried to reduce this risk by giving
      TIF_MEMDIE only to the main thread and the currently allocating task but
      that doesn't really solve this problem while it surely opens up a room
      for corner cases - e.g.  GFP_NO{FS,IO} requests might loop inside the
      allocator without access to memory reserves because a particular thread
      was not the group leader.
      
      Now that we have the oom reaper and that all oom victims are reapable
      after 1b51e65e ("oom, oom_reaper: allow to reap mm shared by the
      kthreads") we can be more conservative and grant only partial access to
      memory reserves because there are reasonable chances of the parallel
      memory freeing.  We still want some access to reserves because we do not
      want other consumers to eat up the victim's freed memory.  oom victims
      will still contend with __GFP_HIGH users but those shouldn't be so
      aggressive to starve oom victims completely.
      
      Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
      the half of the reserves.  This makes the access to reserves independent
      on which task has passed through mark_oom_victim.  Also drop any usage
      of TIF_MEMDIE from the page allocator proper and replace it by
      tsk_is_oom_victim as well which will make page_alloc.c completely
      TIF_MEMDIE free finally.
      
      CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
      ALLOC_NO_WATERMARKS approach.
      
      There is a demand to make the oom killer memcg aware which will imply
      many tasks killed at once.  This change will allow such a usecase
      without worrying about complete memory reserves depletion.
      
      Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd04ae1e
    • V
      z3fold: use per-cpu unbuddied lists · d30561c5
      Vitaly Wool 提交于
      It's been noted that z3fold doesn't scale well when it's run in a large
      number of threads on many cores, which can be easily reproduced with fio
      'randrw' test with --numjobs=32.  E.g.  the result for 1 cluster (4 cores)
      is:
      
      Run status group 0 (all jobs):
         READ: io=244785MB, aggrb=496883KB/s, minb=15527KB/s, ...
        WRITE: io=246735MB, aggrb=500841KB/s, minb=15651KB/s, ...
      
      While for 8 cores (2 clusters) the result is:
      
      Run status group 0 (all jobs):
         READ: io=244785MB, aggrb=265942KB/s, minb=8310KB/s, ...
        WRITE: io=246735MB, aggrb=268060KB/s, minb=8376KB/s, ...
      
      The bottleneck here is the pool lock which many threads become waiting
      upon.  To reduce that spin lock contention, z3fold can operate only on
      the lists local to the current CPU whenever possible.  Due to the nature
      of z3fold unbuddied list handling (it only takes the first entry off the
      list on a hot path), if the z3fold pool is big enough and balanced well
      enough, limiting search to only local unbuddied list doesn't lead to a
      significant compression ratio degrade (2.57x vs 2.65x in our
      measurements).
      
      This patch also introduces two worker threads: one for async in-page
      object layout optimization and one for releasing freed pages.  This is
      done to speed up z3fold_free() which is often on a hot path.
      
      The fio results for 8-core case are now the following:
      
      Run status group 0 (all jobs):
         READ: io=244785MB, aggrb=1568.3MB/s, minb=50182KB/s, ...
        WRITE: io=246735MB, aggrb=1580.8MB/s, minb=50582KB/s, ...
      
      So we're in for almost 6x performance increase.
      
      Link: http://lkml.kernel.org/r/20170806181443.f9b65018f8bde25ef990f9e8@gmail.comSigned-off-by: NVitaly Wool <vitalywool@gmail.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d30561c5
    • H
      mm, swap: don't use VMA based swap readahead if HDD is used as swap · 81a0298b
      Huang Ying 提交于
      VMA based swap readahead will readahead the virtual pages that is
      continuous in the virtual address space.  While the original swap
      readahead will readahead the swap slots that is continuous in the swap
      device.  Although VMA based swap readahead is more correct for the swap
      slots to be readahead, it will trigger more small random readings, which
      may cause the performance of HDD (hard disk) to degrade heavily, and may
      finally exceed the benefit.
      
      To avoid the issue, in this patch, if the HDD is used as swap, the VMA
      based swap readahead will be disabled, and the original swap readahead
      will be used instead.
      
      Link: http://lkml.kernel.org/r/20170807054038.1843-6-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81a0298b
    • H
      mm, swap: add sysfs interface for VMA based swap readahead · d9bfcfdc
      Huang Ying 提交于
      The sysfs interface to control the VMA based swap readahead is added as
      follow,
      
      /sys/kernel/mm/swap/vma_ra_enabled
      
      Enable the VMA based swap readahead algorithm, or use the original
      global swap readahead algorithm.
      
      /sys/kernel/mm/swap/vma_ra_max_order
      
      Set the max order of the readahead window size for the VMA based swap
      readahead algorithm.
      
      The corresponding ABI documentation is added too.
      
      Link: http://lkml.kernel.org/r/20170807054038.1843-5-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9bfcfdc
    • H
      mm, swap: VMA based swap readahead · ec560175
      Huang Ying 提交于
      The swap readahead is an important mechanism to reduce the swap in
      latency.  Although pure sequential memory access pattern isn't very
      popular for anonymous memory, the space locality is still considered
      valid.
      
      In the original swap readahead implementation, the consecutive blocks in
      swap device are readahead based on the global space locality estimation.
      But the consecutive blocks in swap device just reflect the order of page
      reclaiming, don't necessarily reflect the access pattern in virtual
      memory.  And the different tasks in the system may have different access
      patterns, which makes the global space locality estimation incorrect.
      
      In this patch, when page fault occurs, the virtual pages near the fault
      address will be readahead instead of the swap slots near the fault swap
      slot in swap device.  This avoid to readahead the unrelated swap slots.
      At the same time, the swap readahead is changed to work on per-VMA from
      globally.  So that the different access patterns of the different VMAs
      could be distinguished, and the different readahead policy could be
      applied accordingly.  The original core readahead detection and scaling
      algorithm is reused, because it is an effect algorithm to detect the
      space locality.
      
      The test and result is as follow,
      
      Common test condition
      =====================
      
      Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
      NVMe disk
      
      Micro-benchmark with combined access pattern
      ============================================
      
      vm-scalability, sequential swap test case, 4 processes to eat 50G
      virtual memory space, repeat the sequential memory writing until 300
      seconds.  The first round writing will trigger swap out, the following
      rounds will trigger sequential swap in and out.
      
      At the same time, run vm-scalability random swap test case in
      background, 8 processes to eat 30G virtual memory space, repeat the
      random memory write until 300 seconds.  This will trigger random swap-in
      in the background.
      
      This is a combined workload with sequential and random memory accessing
      at the same time.  The result (for sequential workload) is as follow,
      
      			Base		Optimized
      			----		---------
      throughput		345413 KB/s	414029 KB/s (+19.9%)
      latency.average		97.14 us	61.06 us (-37.1%)
      latency.50th		2 us		1 us
      latency.60th		2 us		1 us
      latency.70th		98 us		2 us
      latency.80th		160 us		2 us
      latency.90th		260 us		217 us
      latency.95th		346 us		369 us
      latency.99th		1.34 ms		1.09 ms
      ra_hit%			52.69%		99.98%
      
      The original swap readahead algorithm is confused by the background
      random access workload, so readahead hit rate is lower.  The VMA-base
      readahead algorithm works much better.
      
      Linpack
      =======
      
      The test memory size is bigger than RAM to trigger swapping.
      
      			Base		Optimized
      			----		---------
      elapsed_time		393.49 s	329.88 s (-16.2%)
      ra_hit%			86.21%		98.82%
      
      The score of base and optimized kernel hasn't visible changes.  But the
      elapsed time reduced and readahead hit rate improved, so the optimized
      kernel runs better for startup and tear down stages.  And the absolute
      value of readahead hit rate is high, shows that the space locality is
      still valid in some practical workloads.
      
      Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec560175
    • H
      mm, swap: fix swap readahead marking · c4fa6309
      Huang Ying 提交于
      In the original implementation, it is possible that the existing pages
      in the swap cache (not newly readahead) could be marked as the readahead
      pages.  This will cause the statistics of swap readahead be wrong and
      influence the swap readahead algorithm too.
      
      This is fixed via marking a page as the readahead page only if it is
      newly allocated and read from the disk.
      
      When testing with linpack, after the fixing the swap readahead hit rate
      increased from ~66% to ~86%.
      
      Link: http://lkml.kernel.org/r/20170807054038.1843-3-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4fa6309
    • H
      mm, swap: add swap readahead hit statistics · cbc65df2
      Huang Ying 提交于
      Patch series "mm, swap: VMA based swap readahead", v4.
      
      The swap readahead is an important mechanism to reduce the swap in
      latency.  Although pure sequential memory access pattern isn't very
      popular for anonymous memory, the space locality is still considered
      valid.
      
      In the original swap readahead implementation, the consecutive blocks in
      swap device are readahead based on the global space locality estimation.
      But the consecutive blocks in swap device just reflect the order of page
      reclaiming, don't necessarily reflect the access pattern in virtual
      memory space.  And the different tasks in the system may have different
      access patterns, which makes the global space locality estimation
      incorrect.
      
      In this patchset, when page fault occurs, the virtual pages near the
      fault address will be readahead instead of the swap slots near the fault
      swap slot in swap device.  This avoid to readahead the unrelated swap
      slots.  At the same time, the swap readahead is changed to work on
      per-VMA from globally.  So that the different access patterns of the
      different VMAs could be distinguished, and the different readahead
      policy could be applied accordingly.  The original core readahead
      detection and scaling algorithm is reused, because it is an effect
      algorithm to detect the space locality.
      
      In addition to the swap readahead changes, some new sysfs interface is
      added to show the efficiency of the readahead algorithm and some other
      swap statistics.
      
      This new implementation will incur more small random read, on SSD, the
      improved correctness of estimation and readahead target should beat the
      potential increased overhead, this is also illustrated in the test
      results below.  But on HDD, the overhead may beat the benefit, so the
      original implementation will be used by default.
      
      The test and result is as follow,
      
      Common test condition
      =====================
      
      Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
      Swap device: NVMe disk
      
      Micro-benchmark with combined access pattern
      ============================================
      
      vm-scalability, sequential swap test case, 4 processes to eat 50G
      virtual memory space, repeat the sequential memory writing until 300
      seconds.  The first round writing will trigger swap out, the following
      rounds will trigger sequential swap in and out.
      
      At the same time, run vm-scalability random swap test case in
      background, 8 processes to eat 30G virtual memory space, repeat the
      random memory write until 300 seconds.  This will trigger random swap-in
      in the background.
      
      This is a combined workload with sequential and random memory accessing
      at the same time.  The result (for sequential workload) is as follow,
      
      			Base		Optimized
      			----		---------
      throughput		345413 KB/s	414029 KB/s (+19.9%)
      latency.average		97.14 us	61.06 us (-37.1%)
      latency.50th		2 us		1 us
      latency.60th		2 us		1 us
      latency.70th		98 us		2 us
      latency.80th		160 us		2 us
      latency.90th		260 us		217 us
      latency.95th		346 us		369 us
      latency.99th		1.34 ms		1.09 ms
      ra_hit%			52.69%		99.98%
      
      The original swap readahead algorithm is confused by the background
      random access workload, so readahead hit rate is lower.  The VMA-base
      readahead algorithm works much better.
      
      Linpack
      =======
      
      The test memory size is bigger than RAM to trigger swapping.
      
      			Base		Optimized
      			----		---------
      elapsed_time		393.49 s	329.88 s (-16.2%)
      ra_hit%			86.21%		98.82%
      
      The score of base and optimized kernel hasn't visible changes.  But the
      elapsed time reduced and readahead hit rate improved, so the optimized
      kernel runs better for startup and tear down stages.  And the absolute
      value of readahead hit rate is high, shows that the space locality is
      still valid in some practical workloads.
      
      This patch (of 5):
      
      The statistics for total readahead pages and total readahead hits are
      recorded and exported via the following sysfs interface.
      
      /sys/kernel/mm/swap/ra_hits
      /sys/kernel/mm/swap/ra_total
      
      With them, the efficiency of the swap readahead could be measured, so
      that the swap readahead algorithm and parameters could be tuned
      accordingly.
      
      [akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n]
      Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbc65df2
    • B
      mm/vmalloc.c: don't reinvent the wheel but use existing llist API · 894e58c1
      Byungchul Park 提交于
      Although llist provides proper APIs, they are not used.  Make them used.
      
      Link: http://lkml.kernel.org/r/1502095374-16112-1-git-send-email-byungchul.park@lge.comSigned-off-by: NByungchul Park <byungchul.park@lge.com>
      Cc: zijun_hu <zijun_hu@htc.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      894e58c1
    • S
      mm/vmstat.c: fix wrong comment · f113e641
      SeongJae Park 提交于
      Comment for pagetypeinfo_showblockcount() is mistakenly duplicated from
      pagetypeinfo_show_free()'s comment.  This commit fixes it.
      
      Link: http://lkml.kernel.org/r/20170809185816.11244-1-sj38.park@gmail.com
      Fixes: 467c996c ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
      Signed-off-by: NSeongJae Park <sj38.park@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f113e641
    • M
      selftests/memfd: add memfd_create hugetlbfs selftest · 1f522a48
      Mike Kravetz 提交于
      With the addition of hugetlbfs support in memfd_create, the memfd
      selftests should verify correct functionality with hugetlbfs.
      
      Instead of writing a separate memfd hugetlbfs test, modify the
      memfd_test program to take an optional argument 'hugetlbfs'.  If the
      hugetlbfs argument is specified, basic memfd_create functionality will
      be exercised on hugetlbfs.  If hugetlbfs is not specified, the current
      functionality of the test is unchanged.
      
      Note that many of the tests in memfd_test test file sealing operations.
      hugetlbfs does not support file sealing, therefore for hugetlbfs all
      sealing related tests are skipped.
      
      In order to test on hugetlbfs, there needs to be preallocated huge
      pages.  A new script (run_tests) is added.  This script will first run
      the existing memfd_create tests.  It will then, attempt to allocate the
      required number of huge pages before running the hugetlbfs test.  At the
      end of testing, it will release any huge pages allocated for testing
      purposes.
      
      Link: http://lkml.kernel.org/r/1502495772-24736-3-git-send-email-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f522a48
    • M
      mm/shmem: add hugetlbfs support to memfd_create() · 749df87b
      Mike Kravetz 提交于
      This patch came out of discussions in this e-mail thread:
        http://lkml.kernel.org/r/1499357846-7481-1-git-send-email-mike.kravetz%40oracle.com
      
      The Oracle JVM team is developing a new garbage collection model.  This
      new model requires multiple mappings of the same anonymous memory.  One
      straight forward way to accomplish this is with memfd_create.  They can
      use the returned fd to create multiple mappings of the same memory.
      
      The JVM today has an option to use (static hugetlb) huge pages.  If this
      option is specified, they would like to use the same garbage collection
      model requiring multiple mappings to the same memory.  Using hugetlbfs,
      it is possible to explicitly mount a filesystem and specify file paths
      in order to get an fd that can be used for multiple mappings.  However,
      this introduces additional system admin work and coordination.
      
      Ideally they would like to get a hugetlbfs fd without requiring explicit
      mounting of a filesystem.  Today, mmap and shmget can make use of
      hugetlbfs without explicitly mounting a filesystem.  The patch adds this
      functionality to memfd_create.
      
      Add a new flag MFD_HUGETLB to memfd_create() that will specify the file
      to be created resides in the hugetlbfs filesystem.  This is the generic
      hugetlbfs filesystem not associated with any specific mount point.  As
      with other system calls that request hugetlbfs backed pages, there is
      the ability to encode huge page size in the flag arguments.
      
      hugetlbfs does not support sealing operations, therefore specifying
      MFD_ALLOW_SEALING with MFD_HUGETLB will result in EINVAL.
      
      Of course, the memfd_man page would need updating if this type of
      functionality moves forward.
      
      Link: http://lkml.kernel.org/r/1502149672-7759-2-git-send-email-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      749df87b
    • D
      mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups · ab1b597e
      Dan Williams 提交于
      devm_memremap_pages() records mapped ranges in pgmap_radix with an entry
      per section's worth of memory (128MB).  The key for each of those
      entries is a section number.
      
      This leads to false positives when devm_memremap_pages() is passed a
      section-unaligned range as lookups in the misalignment fail to return
      NULL.  We can close this hole by using the pfn as the key for entries in
      the tree.  The number of entries required to describe a remapped range
      is reduced by leveraging multi-order entries.
      
      In practice this approach usually yields just one entry in the tree if
      the size and starting address are of the same power-of-2 alignment.
      Previously we always needed nr_entries = mapping_size / 128MB.
      
      Link: https://lists.01.org/pipermail/linux-nvdimm/2016-August/006666.html
      Link: http://lkml.kernel.org/r/150215410565.39310.13767886055248249438.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: NToshi Kani <toshi.kani@hpe.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab1b597e
    • W
      mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas() · c568da28
      Wei Yang 提交于
      In pcpu_get_vm_areas(), it checks each range is not overlapped.  To make
      sure it is, only (N^2)/2 comparison is necessary, while current code
      does N^2 times.  By starting from the next range, it achieves the goal
      and the continue could be removed.
      
      Also,
      
       - the overlap check of two ranges could be done with one clause
      
       - one typo in comment is fixed.
      
      Link: http://lkml.kernel.org/r/20170803063822.48702-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c568da28
    • W
      mm/vmstat: fix divide error at __fragmentation_index · 88d6ac40
      Wen Yang 提交于
      When order is -1 or too big, *1UL << order* will be 0, which will cause
      a divide error.  Although it seems that all callers of
      __fragmentation_index() will only do so with a valid order, the patch
      can make it more robust.
      
      Should prevent reoccurrences of
      https://bugzilla.kernel.org/show_bug.cgi?id=196555
      
      Link: http://lkml.kernel.org/r/1501751520-2598-1-git-send-email-wen.yang99@zte.com.cnSigned-off-by: NWen Yang <wen.yang99@zte.com.cn>
      Reviewed-by: NJiang Biao <jiang.biao2@zte.com.cn>
      Suggested-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88d6ac40
    • M
      mm, hugetlb: do not allocate non-migrateable gigantic pages from movable zones · 79b63f12
      Michal Hocko 提交于
      alloc_gigantic_page doesn't consider movability of the gigantic hugetlb
      when scanning eligible ranges for the allocation.  As 1GB hugetlb pages
      are not movable currently this can break the movable zone assumption
      that all allocations are migrateable and as such break memory hotplug.
      
      Reorganize the code and use the standard zonelist allocations scheme
      that we use for standard hugetbl pages.  htlb_alloc_mask will ensure
      that only migratable hugetlb pages will ever see a movable zone.
      
      Link: http://lkml.kernel.org/r/20170803083549.21407-1-mhocko@kernel.org
      Fixes: 944d9fec ("hugetlb: add support for gigantic page allocation at runtime")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79b63f12
    • A
      userfaultfd: provide pid in userfault msg - add feat union · a36985d3
      Andrea Arcangeli 提交于
      No ABI change, but this will make it more explicit to software that ptid
      is only available if requested by passing UFFD_FEATURE_THREAD_ID to
      UFFDIO_API.  The fact it's a union will also self document it shouldn't
      be taken for granted there's a tpid there.
      
      Link: http://lkml.kernel.org/r/20170802165145.22628-7-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Alexey Perevalov <a.perevalov@samsung.com>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a36985d3
    • A
      userfaultfd: provide pid in userfault msg · 9d4ac934
      Alexey Perevalov 提交于
      It could be useful for calculating downtime during postcopy live
      migration per vCPU.  Side observer or application itself will be
      informed about proper task's sleep during userfaultfd processing.
      
      Process's thread id is being provided when user requeste it by setting
      UFFD_FEATURE_THREAD_ID bit into uffdio_api.features.
      
      Link: http://lkml.kernel.org/r/20170802165145.22628-6-aarcange@redhat.comSigned-off-by: NAlexey Perevalov <a.perevalov@samsung.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d4ac934
    • A
      userfaultfd: call userfaultfd_unmap_prep only if __split_vma succeeds · 2376dd7c
      Andrea Arcangeli 提交于
      A __split_vma is not a worthy event to report, and it's definitely not a
      unmap so it would be incorrect to report unmap for the whole region to
      the userfaultfd manager if a __split_vma fails.
      
      So only call userfaultfd_unmap_prep after the __vma_splitting is over
      and do_munmap cannot fail anymore.
      
      Also add unlikely because it's better to optimize for the vast majority
      of apps that aren't using userfaultfd in a non cooperative way.  Ideally
      we should also find a way to eliminate the branch entirely if
      CONFIG_USERFAULTFD=n, but it would complicate things so stick to
      unlikely for now.
      
      Link: http://lkml.kernel.org/r/20170802165145.22628-5-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Alexey Perevalov <a.perevalov@samsung.com>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2376dd7c
    • A
      userfaultfd: selftest: explicit failure if the SIGBUS test failed · d312cb1e
      Andrea Arcangeli 提交于
      Showing zero in the output isn't very self explanatory as a successful
      result.  Show a more explicit error output if the test fails.
      
      Link: http://lkml.kernel.org/r/20170802165145.22628-4-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Alexey Perevalov <a.perevalov@samsung.com>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d312cb1e
    • A
      userfaultfd: selftest: exercise UFFDIO_COPY/ZEROPAGE -EEXIST · 67e80328
      Andrea Arcangeli 提交于
      This will retry the UFFDIO_COPY/ZEROPAGE to verify it returns -EEXIST at
      the first invocation and then later every 10 seconds.
      
      In the filebacked MAP_SHARED case this also verifies the -EEXIST
      triggered in the filesystem pagecache insertion, if the offset in the
      file was not a hole.
      
      shmem MAP_SHARED tries to index the newly allocated pagecache in the
      radix tree before checking the pagetable so it doesn't need any
      assistance to exercise that case.
      
      hugetlbfs checks the pmd to be not none before trying to index the
      hugetlbfs page in the radix tree, so it requires to run UFFDIO_COPY into
      an alias mapping (the alternative would be to use MADV_DONTNEED to only
      zap the pagetables, but that doesn't work on hugetlbfs).
      
      [akpm@linux-foundation.org: fix uffdio_zeropage(), per Mike Kravetz]
      Link: http://lkml.kernel.org/r/20170802165145.22628-3-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Alexey Perevalov <a.perevalov@samsung.com>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67e80328
    • P
      userfaultfd: selftest: add tests for UFFD_FEATURE_SIGBUS feature · 81aac3a1
      Prakash Sangappa 提交于
      Add tests for UFFD_FEATURE_SIGBUS feature.  The tests will verify signal
      delivery instead of userfault events.  Also, test use of UFFDIO_COPY to
      allocate memory and retry accessing monitored area after signal
      delivery.
      
      Also fix a bug in uffd_poll_thread() where 'uffd' is leaked.
      
      Link: http://lkml.kernel.org/r/1501552446-748335-3-git-send-email-prakash.sangappa@oracle.comSigned-off-by: NPrakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81aac3a1
    • P
      mm: userfaultfd: add feature to request for a signal delivery · 2d6d6f5a
      Prakash Sangappa 提交于
      In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
      to the faulting process, instead of the page-fault event.  Dealing with
      page-fault event using a monitor thread can be an overhead in these
      cases.  For example applications like the database could use the
      signaling mechanism for robustness purpose.
      
      Database uses hugetlbfs for performance reason.  Files on hugetlbfs
      filesystem are created and huge pages allocated using fallocate() API.
      Pages are deallocated/freed using fallocate() hole punching support.
      These files are mmapped and accessed by many processes as shared memory.
      The database keeps track of which offsets in the hugetlbfs file have
      pages allocated.
      
      Any access to mapped address over holes in the file, which can occur due
      to bugs in the application, is considered invalid and expect the process
      to simply receive a SIGBUS.  However, currently when a hole in the file
      is accessed via the mapped address, kernel/mm attempts to automatically
      allocate a page at page fault time, resulting in implicitly filling the
      hole in the file.  This may not be the desired behavior for applications
      like the database that want to explicitly manage page allocations of
      hugetlbfs files.
      
      Using userfaultfd mechanism with this support to get a signal, database
      application can prevent pages from being allocated implicitly when
      processes access mapped address over holes in the file.
      
      This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to
      request for a SIGBUS signal.
      
      See following for previous discussion about the database requirement
      leading to this proposal as suggested by Andrea.
      
      http://www.spinics.net/lists/linux-mm/msg129224.html
      
      Link: http://lkml.kernel.org/r/1501552446-748335-2-git-send-email-prakash.sangappa@oracle.comSigned-off-by: NPrakash Sangappa <prakash.sangappa@oracle.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d6d6f5a
    • M
      mm: rename global_page_state to global_zone_page_state · c41f012a
      Michal Hocko 提交于
      global_page_state is error prone as a recent bug report pointed out [1].
      It only returns proper values for zone based counters as the enum it
      gets suggests.  We already have global_node_page_state so let's rename
      global_page_state to global_zone_page_state to be more explicit here.
      All existing users seems to be correct:
      
      $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
            2 NR_BOUNCE
            2 NR_FREE_CMA_PAGES
           11 NR_FREE_PAGES
            1 NR_KERNEL_STACK_KB
            1 NR_MLOCK
            2 NR_PAGETABLE
      
      This patch shouldn't introduce any functional change.
      
      [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp
      
      Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c41f012a
    • M
      mm: shm: use new hugetlb size encoding definitions · 4da243ac
      Mike Kravetz 提交于
      Use the common definitions from hugetlb_encode.h header file for
      encoding hugetlb size definitions in shmget system call flags.
      
      In addition, move these definitions from the internal (kernel) to user
      (uapi) header file.
      
      Link: http://lkml.kernel.org/r/1501527386-10736-4-git-send-email-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Suggested-by: NMatthew Wilcox <willy@infradead.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4da243ac
    • M
      mm: arch: consolidate mmap hugetlb size encodings · aafd4562
      Mike Kravetz 提交于
      A non-default huge page size can be encoded in the flags argument of the
      mmap system call.  The definitions for these encodings are in arch
      specific header files.  However, all architectures use the same values.
      
      Consolidate all the definitions in the primary user header file
      (uapi/linux/mman.h).  Include definitions for all known huge page sizes.
      Use the generic encoding definitions in hugetlb_encode.h as the basis
      for these definitions.
      
      Link: http://lkml.kernel.org/r/1501527386-10736-3-git-send-email-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aafd4562
    • M
      mm: hugetlb: define system call hugetlb size encodings in single file · e652f694
      Mike Kravetz 提交于
      Patch series "Consolidate system call hugetlb page size encodings".
      
      These patches are the result of discussions in
      https://lkml.org/lkml/2017/3/8/548.  The following changes are made in the
      patch set:
      
      1) Put all the log2 encoded huge page size definitions in a common
         header file.  The idea is have a set of definitions that can be use as
         the basis for system call specific definitions such as MAP_HUGE_* and
         SHM_HUGE_*.
      
      2) Remove MAP_HUGE_* definitions in arch specific files.  All these
         definitions are the same.  Consolidate all definitions in the primary
         user header file (uapi/linux/mman.h).
      
      3) Remove SHM_HUGE_* definitions intended for user space from kernel
         header file, and add to user (uapi/linux/shm.h) header file.  Add
         definitions for all known huge page size encodings as in mmap.
      
      This patch (of 3):
      
      If hugetlb pages are requested in mmap or shmget system calls, a huge
      page size other than default can be requested.  This is accomplished by
      encoding the log2 of the huge page size in the upper bits of the flag
      argument.  asm-generic and arch specific headers all define the same
      values for these encodings.
      
      Put common definitions in a single header file.  The primary uapi header
      files for mmap and shm will use these definitions as a basis for
      definitions specific to those system calls.
      
      Link: http://lkml.kernel.org/r/1501527386-10736-2-git-send-email-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e652f694
    • J
    • J
      fs/sync.c: remove unnecessary NULL f_mapping check in sync_file_range · de23abd1
      Jeff Layton 提交于
      fsync codepath assumes that f_mapping can never be NULL, but
      sync_file_range has a check for that.
      
      Remove the one from sync_file_range as I don't see how you'd ever get a
      NULL pointer in here.
      
      Link: http://lkml.kernel.org/r/20170525110509.9434-1-jlayton@redhat.comSigned-off-by: NJeff Layton <jlayton@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de23abd1
    • M
      userfaultfd: selftest: enable testing of UFFDIO_ZEROPAGE for shmem · 824f9739
      Mike Rapoport 提交于
      Link: http://lkml.kernel.org/r/1497939652-16528-8-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      824f9739
    • M
      userfaultfd: report UFFDIO_ZEROPAGE as available for shmem VMAs · ce53e8e6
      Mike Rapoport 提交于
      Now when shmem VMAs can be filled with zero page via userfaultfd we can
      report that UFFDIO_ZEROPAGE is available for those VMAs
      
      Link: http://lkml.kernel.org/r/1497939652-16528-7-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce53e8e6
    • M
      userfaultfd: shmem: wire up shmem_mfill_zeropage_pte · 8fb44e54
      Mike Rapoport 提交于
      For shmem VMAs we can use shmem_mfill_zeropage_pte for UFFDIO_ZEROPAGE
      
      Link: http://lkml.kernel.org/r/1497939652-16528-6-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fb44e54
    • M
      userfaultfd: mcopy_atomic: introduce mfill_atomic_pte helper · 3217d3c7
      Mike Rapoport 提交于
      Shuffle the code a bit to improve readability.
      
      Link: http://lkml.kernel.org/r/1497939652-16528-5-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3217d3c7
    • M
      userfaultfd: shmem: add shmem_mfill_zeropage_pte for userfaultfd support · 8d103963
      Mike Rapoport 提交于
      shmem_mfill_zeropage_pte is the low level routine that implements the
      userfaultfd UFFDIO_ZEROPAGE command.  Since for shmem mappings zero
      pages are always allocated and accounted, the new method is a slight
      extension of the existing shmem_mcopy_atomic_pte.
      
      Link: http://lkml.kernel.org/r/1497939652-16528-4-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d103963
    • M
      shmem: introduce shmem_inode_acct_block · 0f079694
      Mike Rapoport 提交于
      The shmem_acct_block and the update of used_blocks are following one
      another in all the places they are used.  Combine these two into a
      helper function.
      
      Link: http://lkml.kernel.org/r/1497939652-16528-3-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f079694