1. 07 9月, 2017 32 次提交
    • H
      mm, swap: VMA based swap readahead · ec560175
      Huang Ying 提交于
      The swap readahead is an important mechanism to reduce the swap in
      latency.  Although pure sequential memory access pattern isn't very
      popular for anonymous memory, the space locality is still considered
      valid.
      
      In the original swap readahead implementation, the consecutive blocks in
      swap device are readahead based on the global space locality estimation.
      But the consecutive blocks in swap device just reflect the order of page
      reclaiming, don't necessarily reflect the access pattern in virtual
      memory.  And the different tasks in the system may have different access
      patterns, which makes the global space locality estimation incorrect.
      
      In this patch, when page fault occurs, the virtual pages near the fault
      address will be readahead instead of the swap slots near the fault swap
      slot in swap device.  This avoid to readahead the unrelated swap slots.
      At the same time, the swap readahead is changed to work on per-VMA from
      globally.  So that the different access patterns of the different VMAs
      could be distinguished, and the different readahead policy could be
      applied accordingly.  The original core readahead detection and scaling
      algorithm is reused, because it is an effect algorithm to detect the
      space locality.
      
      The test and result is as follow,
      
      Common test condition
      =====================
      
      Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
      NVMe disk
      
      Micro-benchmark with combined access pattern
      ============================================
      
      vm-scalability, sequential swap test case, 4 processes to eat 50G
      virtual memory space, repeat the sequential memory writing until 300
      seconds.  The first round writing will trigger swap out, the following
      rounds will trigger sequential swap in and out.
      
      At the same time, run vm-scalability random swap test case in
      background, 8 processes to eat 30G virtual memory space, repeat the
      random memory write until 300 seconds.  This will trigger random swap-in
      in the background.
      
      This is a combined workload with sequential and random memory accessing
      at the same time.  The result (for sequential workload) is as follow,
      
      			Base		Optimized
      			----		---------
      throughput		345413 KB/s	414029 KB/s (+19.9%)
      latency.average		97.14 us	61.06 us (-37.1%)
      latency.50th		2 us		1 us
      latency.60th		2 us		1 us
      latency.70th		98 us		2 us
      latency.80th		160 us		2 us
      latency.90th		260 us		217 us
      latency.95th		346 us		369 us
      latency.99th		1.34 ms		1.09 ms
      ra_hit%			52.69%		99.98%
      
      The original swap readahead algorithm is confused by the background
      random access workload, so readahead hit rate is lower.  The VMA-base
      readahead algorithm works much better.
      
      Linpack
      =======
      
      The test memory size is bigger than RAM to trigger swapping.
      
      			Base		Optimized
      			----		---------
      elapsed_time		393.49 s	329.88 s (-16.2%)
      ra_hit%			86.21%		98.82%
      
      The score of base and optimized kernel hasn't visible changes.  But the
      elapsed time reduced and readahead hit rate improved, so the optimized
      kernel runs better for startup and tear down stages.  And the absolute
      value of readahead hit rate is high, shows that the space locality is
      still valid in some practical workloads.
      
      Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec560175
    • H
      mm, swap: add swap readahead hit statistics · cbc65df2
      Huang Ying 提交于
      Patch series "mm, swap: VMA based swap readahead", v4.
      
      The swap readahead is an important mechanism to reduce the swap in
      latency.  Although pure sequential memory access pattern isn't very
      popular for anonymous memory, the space locality is still considered
      valid.
      
      In the original swap readahead implementation, the consecutive blocks in
      swap device are readahead based on the global space locality estimation.
      But the consecutive blocks in swap device just reflect the order of page
      reclaiming, don't necessarily reflect the access pattern in virtual
      memory space.  And the different tasks in the system may have different
      access patterns, which makes the global space locality estimation
      incorrect.
      
      In this patchset, when page fault occurs, the virtual pages near the
      fault address will be readahead instead of the swap slots near the fault
      swap slot in swap device.  This avoid to readahead the unrelated swap
      slots.  At the same time, the swap readahead is changed to work on
      per-VMA from globally.  So that the different access patterns of the
      different VMAs could be distinguished, and the different readahead
      policy could be applied accordingly.  The original core readahead
      detection and scaling algorithm is reused, because it is an effect
      algorithm to detect the space locality.
      
      In addition to the swap readahead changes, some new sysfs interface is
      added to show the efficiency of the readahead algorithm and some other
      swap statistics.
      
      This new implementation will incur more small random read, on SSD, the
      improved correctness of estimation and readahead target should beat the
      potential increased overhead, this is also illustrated in the test
      results below.  But on HDD, the overhead may beat the benefit, so the
      original implementation will be used by default.
      
      The test and result is as follow,
      
      Common test condition
      =====================
      
      Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
      Swap device: NVMe disk
      
      Micro-benchmark with combined access pattern
      ============================================
      
      vm-scalability, sequential swap test case, 4 processes to eat 50G
      virtual memory space, repeat the sequential memory writing until 300
      seconds.  The first round writing will trigger swap out, the following
      rounds will trigger sequential swap in and out.
      
      At the same time, run vm-scalability random swap test case in
      background, 8 processes to eat 30G virtual memory space, repeat the
      random memory write until 300 seconds.  This will trigger random swap-in
      in the background.
      
      This is a combined workload with sequential and random memory accessing
      at the same time.  The result (for sequential workload) is as follow,
      
      			Base		Optimized
      			----		---------
      throughput		345413 KB/s	414029 KB/s (+19.9%)
      latency.average		97.14 us	61.06 us (-37.1%)
      latency.50th		2 us		1 us
      latency.60th		2 us		1 us
      latency.70th		98 us		2 us
      latency.80th		160 us		2 us
      latency.90th		260 us		217 us
      latency.95th		346 us		369 us
      latency.99th		1.34 ms		1.09 ms
      ra_hit%			52.69%		99.98%
      
      The original swap readahead algorithm is confused by the background
      random access workload, so readahead hit rate is lower.  The VMA-base
      readahead algorithm works much better.
      
      Linpack
      =======
      
      The test memory size is bigger than RAM to trigger swapping.
      
      			Base		Optimized
      			----		---------
      elapsed_time		393.49 s	329.88 s (-16.2%)
      ra_hit%			86.21%		98.82%
      
      The score of base and optimized kernel hasn't visible changes.  But the
      elapsed time reduced and readahead hit rate improved, so the optimized
      kernel runs better for startup and tear down stages.  And the absolute
      value of readahead hit rate is high, shows that the space locality is
      still valid in some practical workloads.
      
      This patch (of 5):
      
      The statistics for total readahead pages and total readahead hits are
      recorded and exported via the following sysfs interface.
      
      /sys/kernel/mm/swap/ra_hits
      /sys/kernel/mm/swap/ra_total
      
      With them, the efficiency of the swap readahead could be measured, so
      that the swap readahead algorithm and parameters could be tuned
      accordingly.
      
      [akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n]
      Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbc65df2
    • M
      mm/shmem: add hugetlbfs support to memfd_create() · 749df87b
      Mike Kravetz 提交于
      This patch came out of discussions in this e-mail thread:
        http://lkml.kernel.org/r/1499357846-7481-1-git-send-email-mike.kravetz%40oracle.com
      
      The Oracle JVM team is developing a new garbage collection model.  This
      new model requires multiple mappings of the same anonymous memory.  One
      straight forward way to accomplish this is with memfd_create.  They can
      use the returned fd to create multiple mappings of the same memory.
      
      The JVM today has an option to use (static hugetlb) huge pages.  If this
      option is specified, they would like to use the same garbage collection
      model requiring multiple mappings to the same memory.  Using hugetlbfs,
      it is possible to explicitly mount a filesystem and specify file paths
      in order to get an fd that can be used for multiple mappings.  However,
      this introduces additional system admin work and coordination.
      
      Ideally they would like to get a hugetlbfs fd without requiring explicit
      mounting of a filesystem.  Today, mmap and shmget can make use of
      hugetlbfs without explicitly mounting a filesystem.  The patch adds this
      functionality to memfd_create.
      
      Add a new flag MFD_HUGETLB to memfd_create() that will specify the file
      to be created resides in the hugetlbfs filesystem.  This is the generic
      hugetlbfs filesystem not associated with any specific mount point.  As
      with other system calls that request hugetlbfs backed pages, there is
      the ability to encode huge page size in the flag arguments.
      
      hugetlbfs does not support sealing operations, therefore specifying
      MFD_ALLOW_SEALING with MFD_HUGETLB will result in EINVAL.
      
      Of course, the memfd_man page would need updating if this type of
      functionality moves forward.
      
      Link: http://lkml.kernel.org/r/1502149672-7759-2-git-send-email-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      749df87b
    • A
      userfaultfd: provide pid in userfault msg - add feat union · a36985d3
      Andrea Arcangeli 提交于
      No ABI change, but this will make it more explicit to software that ptid
      is only available if requested by passing UFFD_FEATURE_THREAD_ID to
      UFFDIO_API.  The fact it's a union will also self document it shouldn't
      be taken for granted there's a tpid there.
      
      Link: http://lkml.kernel.org/r/20170802165145.22628-7-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Alexey Perevalov <a.perevalov@samsung.com>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a36985d3
    • A
      userfaultfd: provide pid in userfault msg · 9d4ac934
      Alexey Perevalov 提交于
      It could be useful for calculating downtime during postcopy live
      migration per vCPU.  Side observer or application itself will be
      informed about proper task's sleep during userfaultfd processing.
      
      Process's thread id is being provided when user requeste it by setting
      UFFD_FEATURE_THREAD_ID bit into uffdio_api.features.
      
      Link: http://lkml.kernel.org/r/20170802165145.22628-6-aarcange@redhat.comSigned-off-by: NAlexey Perevalov <a.perevalov@samsung.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d4ac934
    • P
      mm: userfaultfd: add feature to request for a signal delivery · 2d6d6f5a
      Prakash Sangappa 提交于
      In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
      to the faulting process, instead of the page-fault event.  Dealing with
      page-fault event using a monitor thread can be an overhead in these
      cases.  For example applications like the database could use the
      signaling mechanism for robustness purpose.
      
      Database uses hugetlbfs for performance reason.  Files on hugetlbfs
      filesystem are created and huge pages allocated using fallocate() API.
      Pages are deallocated/freed using fallocate() hole punching support.
      These files are mmapped and accessed by many processes as shared memory.
      The database keeps track of which offsets in the hugetlbfs file have
      pages allocated.
      
      Any access to mapped address over holes in the file, which can occur due
      to bugs in the application, is considered invalid and expect the process
      to simply receive a SIGBUS.  However, currently when a hole in the file
      is accessed via the mapped address, kernel/mm attempts to automatically
      allocate a page at page fault time, resulting in implicitly filling the
      hole in the file.  This may not be the desired behavior for applications
      like the database that want to explicitly manage page allocations of
      hugetlbfs files.
      
      Using userfaultfd mechanism with this support to get a signal, database
      application can prevent pages from being allocated implicitly when
      processes access mapped address over holes in the file.
      
      This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to
      request for a SIGBUS signal.
      
      See following for previous discussion about the database requirement
      leading to this proposal as suggested by Andrea.
      
      http://www.spinics.net/lists/linux-mm/msg129224.html
      
      Link: http://lkml.kernel.org/r/1501552446-748335-2-git-send-email-prakash.sangappa@oracle.comSigned-off-by: NPrakash Sangappa <prakash.sangappa@oracle.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d6d6f5a
    • M
      mm: rename global_page_state to global_zone_page_state · c41f012a
      Michal Hocko 提交于
      global_page_state is error prone as a recent bug report pointed out [1].
      It only returns proper values for zone based counters as the enum it
      gets suggests.  We already have global_node_page_state so let's rename
      global_page_state to global_zone_page_state to be more explicit here.
      All existing users seems to be correct:
      
      $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
            2 NR_BOUNCE
            2 NR_FREE_CMA_PAGES
           11 NR_FREE_PAGES
            1 NR_KERNEL_STACK_KB
            1 NR_MLOCK
            2 NR_PAGETABLE
      
      This patch shouldn't introduce any functional change.
      
      [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp
      
      Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c41f012a
    • M
      mm: shm: use new hugetlb size encoding definitions · 4da243ac
      Mike Kravetz 提交于
      Use the common definitions from hugetlb_encode.h header file for
      encoding hugetlb size definitions in shmget system call flags.
      
      In addition, move these definitions from the internal (kernel) to user
      (uapi) header file.
      
      Link: http://lkml.kernel.org/r/1501527386-10736-4-git-send-email-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Suggested-by: NMatthew Wilcox <willy@infradead.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4da243ac
    • M
      mm: arch: consolidate mmap hugetlb size encodings · aafd4562
      Mike Kravetz 提交于
      A non-default huge page size can be encoded in the flags argument of the
      mmap system call.  The definitions for these encodings are in arch
      specific header files.  However, all architectures use the same values.
      
      Consolidate all the definitions in the primary user header file
      (uapi/linux/mman.h).  Include definitions for all known huge page sizes.
      Use the generic encoding definitions in hugetlb_encode.h as the basis
      for these definitions.
      
      Link: http://lkml.kernel.org/r/1501527386-10736-3-git-send-email-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aafd4562
    • M
      mm: hugetlb: define system call hugetlb size encodings in single file · e652f694
      Mike Kravetz 提交于
      Patch series "Consolidate system call hugetlb page size encodings".
      
      These patches are the result of discussions in
      https://lkml.org/lkml/2017/3/8/548.  The following changes are made in the
      patch set:
      
      1) Put all the log2 encoded huge page size definitions in a common
         header file.  The idea is have a set of definitions that can be use as
         the basis for system call specific definitions such as MAP_HUGE_* and
         SHM_HUGE_*.
      
      2) Remove MAP_HUGE_* definitions in arch specific files.  All these
         definitions are the same.  Consolidate all definitions in the primary
         user header file (uapi/linux/mman.h).
      
      3) Remove SHM_HUGE_* definitions intended for user space from kernel
         header file, and add to user (uapi/linux/shm.h) header file.  Add
         definitions for all known huge page size encodings as in mmap.
      
      This patch (of 3):
      
      If hugetlb pages are requested in mmap or shmget system calls, a huge
      page size other than default can be requested.  This is accomplished by
      encoding the log2 of the huge page size in the upper bits of the flag
      argument.  asm-generic and arch specific headers all define the same
      values for these encodings.
      
      Put common definitions in a single header file.  The primary uapi header
      files for mmap and shm will use these definitions as a basis for
      definitions specific to those system calls.
      
      Link: http://lkml.kernel.org/r/1501527386-10736-2-git-send-email-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e652f694
    • J
    • M
      userfaultfd: shmem: add shmem_mfill_zeropage_pte for userfaultfd support · 8d103963
      Mike Rapoport 提交于
      shmem_mfill_zeropage_pte is the low level routine that implements the
      userfaultfd UFFDIO_ZEROPAGE command.  Since for shmem mappings zero
      pages are always allocated and accounted, the new method is a slight
      extension of the existing shmem_mcopy_atomic_pte.
      
      Link: http://lkml.kernel.org/r/1497939652-16528-4-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d103963
    • H
      mm, THP, swap: add THP swapping out fallback counting · fe490cc0
      Huang Ying 提交于
      When swapping out THP (Transparent Huge Page), instead of swapping out
      the THP as a whole, sometimes we have to fallback to split the THP into
      normal pages before swapping, because no free swap clusters are
      available, or cgroup limit is exceeded, etc.  To count the number of the
      fallback, a new VM event THP_SWPOUT_FALLBACK is added, and counted when
      we fallback to split the THP.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-13-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe490cc0
    • H
      mm, THP, swap: support splitting THP for THP swap out · 59807685
      Huang Ying 提交于
      After adding swapping out support for THP (Transparent Huge Page), it is
      possible that a THP in swap cache (partly swapped out) need to be split.
      To split such a THP, the swap cluster backing the THP need to be split
      too, that is, the CLUSTER_FLAG_HUGE flag need to be cleared for the swap
      cluster.  The patch implemented this.
      
      And because the THP swap writing needs the THP keeps as huge page during
      writing.  The PageWriteback flag is checked before splitting.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-8-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59807685
    • H
      mm: test code to write THP to swap device as a whole · 225311a4
      Huang Ying 提交于
      To support delay splitting THP (Transparent Huge Page) after swapped
      out, we need to enhance swap writing code to support to write a THP as a
      whole.  This will improve swap write IO performance.
      
      As Ming Lei <ming.lei@redhat.com> pointed out, this should be based on
      multipage bvec support, which hasn't been merged yet.  So this patch is
      only for testing the functionality of the other patches in the series.
      And will be reimplemented after multipage bvec support is merged.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-7-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      225311a4
    • H
      mm, THP, swap: make reuse_swap_page() works for THP swapped out · ba3c4ce6
      Huang Ying 提交于
      After supporting to delay THP (Transparent Huge Page) splitting after
      swapped out, it is possible that some page table mappings of the THP are
      turned into swap entries.  So reuse_swap_page() need to check the swap
      count in addition to the map count as before.  This patch done that.
      
      In the huge PMD write protect fault handler, in addition to the page map
      count, the swap count need to be checked too, so the page lock need to
      be acquired too when calling reuse_swap_page() in addition to the page
      table lock.
      
      [ying.huang@intel.com: silence a compiler warning]
        Link: http://lkml.kernel.org/r/87bmnzizjy.fsf@yhuang-dev.intel.com
      Link: http://lkml.kernel.org/r/20170724051840.2309-4-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba3c4ce6
    • H
      mm, THP, swap: support to reclaim swap space for THP swapped out · e0709829
      Huang Ying 提交于
      The normal swap slot reclaiming can be done when the swap count reaches
      SWAP_HAS_CACHE.  But for the swap slot which is backing a THP, all swap
      slots backing one THP must be reclaimed together, because the swap slot
      may be used again when the THP is swapped out again later.  So the swap
      slots backing one THP can be reclaimed together when the swap count for
      all swap slots for the THP reached SWAP_HAS_CACHE.  In the patch, the
      functions to check whether the swap count for all swap slots backing one
      THP reached SWAP_HAS_CACHE are implemented and used when checking
      whether a swap slot can be reclaimed.
      
      To make it easier to determine whether a swap slot is backing a THP, a
      new swap cluster flag named CLUSTER_FLAG_HUGE is added to mark a swap
      cluster which is backing a THP (Transparent Huge Page).  Because THP
      swap in as a whole isn't supported now.  After deleting the THP from the
      swap cache (for example, swapping out finished), the CLUSTER_FLAG_HUGE
      flag will be cleared.  So that, the normal pages inside THP can be
      swapped in individually.
      
      [ying.huang@intel.com: fix swap_page_trans_huge_swapped on HDD]
        Link: http://lkml.kernel.org/r/874ltsm0bi.fsf@yhuang-dev.intel.com
      Link: http://lkml.kernel.org/r/20170724051840.2309-3-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0709829
    • M
      mm: memcontrol: use int for event/state parameter in several functions · 04fecbf5
      Matthias Kaehlcke 提交于
      Several functions use an enum type as parameter for an event/state, but
      are called in some locations with an argument of a different enum type.
      Adjust the interface of these functions to reality by changing the
      parameter to int.
      
      This fixes a ton of enum-conversion warnings that are generated when
      building the kernel with clang.
      
      [mka@chromium.org: also change parameter type of inc/dec/mod_memcg_page_state()]
        Link: http://lkml.kernel.org/r/20170728213442.93823-1-mka@chromium.org
      Link: http://lkml.kernel.org/r/20170727211004.34435-1-mka@chromium.orgSigned-off-by: NMatthias Kaehlcke <mka@chromium.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Doug Anderson <dianders@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04fecbf5
    • J
      mm: remove nr_pages argument from pagevec_lookup{,_range}() · 397162ff
      Jan Kara 提交于
      All users of pagevec_lookup() and pagevec_lookup_range() now pass
      PAGEVEC_SIZE as a desired number of pages.
      
      Just drop the argument.
      
      Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      397162ff
    • J
      mm: implement find_get_pages_range() · b947cee4
      Jan Kara 提交于
      Implement a variant of find_get_pages() that stops iterating at given
      index.  This may be substantial performance gain if the mapping is
      sparse.  See following commit for details.  Furthermore lots of users of
      this function (through pagevec_lookup()) actually want a range lookup
      and all of them are currently open-coding this.
      
      Also create corresponding pagevec_lookup_range() function.
      
      Link: http://lkml.kernel.org/r/20170726114704.7626-4-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b947cee4
    • J
      mm: make pagevec_lookup() update index · d72dc8a2
      Jan Kara 提交于
      Make pagevec_lookup() (and underlying find_get_pages()) update index to
      the next page where iteration should continue.  Most callers want this
      and also pagevec_lookup_tag() already does this.
      
      Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d72dc8a2
    • J
      fscache: remove unused ->now_uncached callback · 26b433d0
      Jan Kara 提交于
      Patch series "Ranged pagevec lookup", v2.
      
      In this series I make pagevec_lookup() update the index (to be
      consistent with pagevec_lookup_tag() and also as a preparation for
      ranged lookups), provide ranged variant of pagevec_lookup() and use it
      in places where it makes sense.  This not only removes some common code
      but is also a measurable performance win for some use cases (see patch
      4/10) where radix tree is sparse and searching & grabing of a page after
      the end of the range has measurable overhead.
      
      This patch (of 10):
      
      The callback doesn't ever get called.  Remove it.
      
      Link: http://lkml.kernel.org/r/20170726114704.7626-2-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26b433d0
    • M
      mm, memory_hotplug: get rid of zonelists_mutex · b93e0f32
      Michal Hocko 提交于
      zonelists_mutex was introduced by commit 4eaf3f64 ("mem-hotplug: fix
      potential race while building zonelist for new populated zone") to
      protect zonelist building from races.  This is no longer needed though
      because both memory online and offline are fully serialized.  New users
      have grown since then.
      
      Notably setup_per_zone_wmarks wants to prevent from races between memory
      hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
      (see cfd3da1e ("mm: Serialize access to min_free_kbytes").  Let's
      add a private lock for that purpose.  This will not prevent from seeing
      halfway through memory hotplug operation but that shouldn't be a big
      deal becuse memory hotplug will update watermarks explicitly so we will
      eventually get a full picture.  The lock just makes sure we won't race
      when updating watermarks leading to weird results.
      
      Also __build_all_zonelists manipulates global data so add a private lock
      for it as well.  This doesn't seem to be necessary today but it is more
      robust to have a lock there.
      
      While we are at it make sure we document that memory online/offline
      depends on a full serialization either via mem_hotplug_begin() or
      device_lock.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Haicheng Li <haicheng.li@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b93e0f32
    • M
      mm, memory_hotplug: drop zone from build_all_zonelists · 72675e13
      Michal Hocko 提交于
      build_all_zonelists gets a zone parameter to initialize zone's pagesets.
      There is only a single user which gives a non-NULL zone parameter and
      that one doesn't really need the rest of the build_all_zonelists (see
      commit 6dcd73d7 ("memory-hotplug: allocate zone's pcp before
      onlining pages")).
      
      Therefore remove setup_zone_pageset from build_all_zonelists and call it
      from its only user directly.  This will also remove a pointless zonlists
      rebuilding which is always good.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72675e13
    • M
      mm, page_alloc: rip out ZONELIST_ORDER_ZONE · c9bff3ee
      Michal Hocko 提交于
      Patch series "cleanup zonelists initialization", v1.
      
      This is aimed at cleaning up the zonelists initialization code we have
      but the primary motivation was bug report [2] which got resolved but the
      usage of stop_machine is just too ugly to live.  Most patches are
      straightforward but 3 of them need a special consideration.
      
      Patch 1 removes zone ordered zonelists completely.  I am CCing linux-api
      because this is a user visible change.  As I argue in the patch
      description I do not think we have a strong usecase for it these days.
      I have kept sysctl in place and warn into the log if somebody tries to
      configure zone lists ordering.  If somebody has a real usecase for it we
      can revert this patch but I do not expect anybody will actually notice
      runtime differences.  This patch is not strictly needed for the rest but
      it made patch 6 easier to implement.
      
      Patch 7 removes stop_machine from build_all_zonelists without adding any
      special synchronization between iterators and updater which I _believe_
      is acceptable as explained in the changelog.  I hope I am not missing
      anything.
      
      Patch 8 then removes zonelists_mutex which is kind of ugly as well and
      not really needed AFAICS but a care should be taken when double checking
      my thinking.
      
      This patch (of 9):
      
      Supporting zone ordered zonelists costs us just a lot of code while the
      usefulness is arguable if existent at all.  Mel has already made node
      ordering default on 64b systems.  32b systems are still using
      ZONELIST_ORDER_ZONE because it is considered better to fallback to a
      different NUMA node rather than consume precious lowmem zones.
      
      This argument is, however, weaken by the fact that the memory reclaim
      has been reworked to be node rather than zone oriented.  This means that
      lowmem requests have to skip over all highmem pages on LRUs already and
      so zone ordering doesn't save the reclaim time much.  So the only
      advantage of the zone ordering is under a light memory pressure when
      highmem requests do not ever hit into lowmem zones and the lowmem
      pressure doesn't need to reclaim.
      
      Considering that 32b NUMA systems are rather suboptimal already and it
      is generally advisable to use 64b kernel on such a HW I believe we
      should rather care about the code maintainability and just get rid of
      ZONELIST_ORDER_ZONE altogether.  Keep systcl in place and warn if
      somebody tries to set zone ordering either from kernel command line or
      the sysctl.
      
      [mhocko@suse.com: reading vm.numa_zonelist_order will never terminate]
      Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9bff3ee
    • M
      mm, memory_hotplug: display allowed zones in the preferred ordering · e5e68930
      Michal Hocko 提交于
      Prior to commit f1dd2cd1 ("mm, memory_hotplug: do not associate
      hotadded memory to zones until online") we used to allow to change the
      valid zone types of a memory block if it is adjacent to a different zone
      type.
      
      This fact was reflected in memoryNN/valid_zones by the ordering of
      printed zones.  The first one was default (echo online > memoryNN/state)
      and the other one could be onlined explicitly by online_{movable,kernel}.
      
      This behavior was removed by the said patch and as such the ordering was
      not all that important.  In most cases a kernel zone would be default
      anyway.  The only exception is movable_node handled by "mm,
      memory_hotplug: support movable_node for hotpluggable nodes".
      
      Let's reintroduce this behavior again because later patch will remove
      the zone overlap restriction and so user will be allowed to online
      kernel resp.  movable block regardless of its placement.  Original
      behavior will then become significant again because it would be
      non-trivial for users to see what is the default zone to online into.
      
      Implementation is really simple.  Pull out zone selection out of
      move_pfn_range into zone_for_pfn_range helper and use it in
      show_valid_zones to display the zone for default onlining and then both
      kernel and movable if they are allowed.  Default online zone is not
      duplicated.
      
      Link: http://lkml.kernel.org/r/20170714121233.16861-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: <slaoub@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5e68930
    • C
      mm: track actual nr_scanned during shrink_slab() · d460acb5
      Chris Wilson 提交于
      Some shrinkers may only be able to free a bunch of objects at a time,
      and so free more than the requested nr_to_scan in one pass.
      
      Whilst other shrinkers may find themselves even unable to scan as many
      objects as they counted, and so underreport.  Account for the extra
      freed/scanned objects against the total number of objects we intend to
      scan, otherwise we may end up penalising the slab far more than
      intended.  Similarly, we want to add the underperforming scan to the
      deferred pass so that we try harder and harder in future passes.
      
      Link: http://lkml.kernel.org/r/20170822135325.9191-1-chris@chris-wilson.co.ukSigned-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d460acb5
    • K
      mm: add SLUB free list pointer obfuscation · 2482ddec
      Kees Cook 提交于
      This SLUB free list pointer obfuscation code is modified from Brad
      Spengler/PaX Team's code in the last public patch of grsecurity/PaX
      based on my understanding of the code.  Changes or omissions from the
      original code are mine and don't reflect the original grsecurity/PaX
      code.
      
      This adds a per-cache random value to SLUB caches that is XORed with
      their freelist pointer address and value.  This adds nearly zero
      overhead and frustrates the very common heap overflow exploitation
      method of overwriting freelist pointers.
      
      A recent example of the attack is written up here:
      
        http://cyseclabs.com/blog/cve-2016-6187-heap-off-by-one-exploit
      
      and there is a section dedicated to the technique the book "A Guide to
      Kernel Exploitation: Attacking the Core".
      
      This is based on patches by Daniel Micay, and refactored to minimize the
      use of #ifdef.
      
      With 200-count cycles of "hackbench -g 20 -l 1000" I saw the following
      run times:
      
       before:
       	mean 10.11882499999999999995
      	variance .03320378329145728642
      	stdev .18221905304181911048
      
        after:
      	mean 10.12654000000000000014
      	variance .04700556623115577889
      	stdev .21680767106160192064
      
      The difference gets lost in the noise, but if the above is to be taken
      literally, using CONFIG_FREELIST_HARDENED is 0.07% slower.
      
      Link: http://lkml.kernel.org/r/20170802180609.GA66807@beastSigned-off-by: NKees Cook <keescook@chromium.org>
      Suggested-by: NDaniel Micay <danielmicay@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tycho Andersen <tycho@docker.com>
      Cc: Alexander Popov <alex.popov@linux.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2482ddec
    • R
      dax: move all DAX radix tree defs to fs/dax.c · 527b19d0
      Ross Zwisler 提交于
      Now that we no longer insert struct page pointers in DAX radix trees the
      page cache code no longer needs to know anything about DAX exceptional
      entries.  Move all the DAX exceptional entry definitions from dax.h to
      fs/dax.c.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-6-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      527b19d0
    • R
      dax: remove DAX code from page_cache_tree_insert() · d01ad197
      Ross Zwisler 提交于
      Now that we no longer insert struct page pointers in DAX radix trees we
      can remove the special casing for DAX in page_cache_tree_insert().
      
      This also allows us to make dax_wake_mapping_entry_waiter() local to
      fs/dax.c, removing it from dax.h.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-5-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d01ad197
    • R
      dax: use common 4k zero page for dax mmap reads · 91d25ba8
      Ross Zwisler 提交于
      When servicing mmap() reads from file holes the current DAX code
      allocates a page cache page of all zeroes and places the struct page
      pointer in the mapping->page_tree radix tree.
      
      This has three major drawbacks:
      
      1) It consumes memory unnecessarily. For every 4k page that is read via
         a DAX mmap() over a hole, we allocate a new page cache page. This
         means that if you read 1GiB worth of pages, you end up using 1GiB of
         zeroed memory. This is easily visible by looking at the overall
         memory consumption of the system or by looking at /proc/[pid]/smaps:
      
      	7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:             1048576 kB
      	Pss:             1048576 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:   1048576 kB
      	Private_Dirty:         0 kB
      	Referenced:      1048576 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      2) It is slower than using a common zero page because each page fault
         has more work to do. Instead of just inserting a common zero page we
         have to allocate a page cache page, zero it, and then insert it. Here
         are the average latencies of dax_load_hole() as measured by ftrace on
         a random test box:
      
          Old method, using zeroed page cache pages:	3.4 us
          New method, using the common 4k zero page:	0.8 us
      
         This was the average latency over 1 GiB of sequential reads done by
         this simple fio script:
      
           [global]
           size=1G
           filename=/root/dax/data
           fallocate=none
           [io]
           rw=read
           ioengine=mmap
      
      3) The fact that we had to check for both DAX exceptional entries and
         for page cache pages in the radix tree made the DAX code more
         complex.
      
      Solve these issues by following the lead of the DAX PMD code and using a
      common 4k zero page instead.  As with the PMD code we will now insert a
      DAX exceptional entry into the radix tree instead of a struct page
      pointer which allows us to remove all the special casing in the DAX
      code.
      
      Note that we do still pretty aggressively check for regular pages in the
      DAX radix tree, especially where we take action based on the bits set in
      the page.  If we ever find a regular page in our radix tree now that
      most likely means that someone besides DAX is inserting pages (which has
      happened lots of times in the past), and we want to find that out early
      and fail loudly.
      
      This solution also removes the extra memory consumption.  Here is that
      same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
      code:
      
      	7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:                   0 kB
      	Pss:                   0 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:         0 kB
      	Private_Dirty:         0 kB
      	Referenced:            0 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      Overall system memory consumption is similarly improved.
      
      Another major change is that we remove dax_pfn_mkwrite() from our fault
      flow, and instead rely on the page fault itself to make the PTE dirty
      and writeable.  The following description from the patch adding the
      vm_insert_mixed_mkwrite() call explains this a little more:
      
         "To be able to use the common 4k zero page in DAX we need to have our
          PTE fault path look more like our PMD fault path where a PTE entry
          can be marked as dirty and writeable as it is first inserted rather
          than waiting for a follow-up dax_pfn_mkwrite() =>
          finish_mkwrite_fault() call.
      
          Right now we can rely on having a dax_pfn_mkwrite() call because we
          can distinguish between these two cases in do_wp_page():
      
                  case 1: 4k zero page => writable DAX storage
                  case 2: read-only DAX storage => writeable DAX storage
      
          This distinction is made by via vm_normal_page(). vm_normal_page()
          returns false for the common 4k zero page, though, just as it does
          for DAX ptes. Instead of special casing the DAX + 4k zero page case
          we will simplify our DAX PTE page fault sequence so that it matches
          our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
          We will instead use dax_iomap_fault() to handle write-protection
          faults.
      
          This means that insert_pfn() needs to follow the lead of
          insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
          'mkwrite' is set insert_pfn() will do the work that was previously
          done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91d25ba8
    • R
      mm: add vm_insert_mixed_mkwrite() · b2770da6
      Ross Zwisler 提交于
      When servicing mmap() reads from file holes the current DAX code
      allocates a page cache page of all zeroes and places the struct page
      pointer in the mapping->page_tree radix tree.  This has three major
      drawbacks:
      
      1) It consumes memory unnecessarily. For every 4k page that is read via
         a DAX mmap() over a hole, we allocate a new page cache page. This
         means that if you read 1GiB worth of pages, you end up using 1GiB of
         zeroed memory.
      
      2) It is slower than using a common zero page because each page fault
         has more work to do. Instead of just inserting a common zero page we
         have to allocate a page cache page, zero it, and then insert it.
      
      3) The fact that we had to check for both DAX exceptional entries and
         for page cache pages in the radix tree made the DAX code more
         complex.
      
      This series solves these issues by following the lead of the DAX PMD
      code and using a common 4k zero page instead.  This reduces memory usage
      and decreases latencies for some workloads, and it simplifies the DAX
      code, removing over 100 lines in total.
      
      This patch (of 5):
      
      To be able to use the common 4k zero page in DAX we need to have our PTE
      fault path look more like our PMD fault path where a PTE entry can be
      marked as dirty and writeable as it is first inserted rather than
      waiting for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault()
      call.
      
      Right now we can rely on having a dax_pfn_mkwrite() call because we can
      distinguish between these two cases in do_wp_page():
      
      	case 1: 4k zero page => writable DAX storage
      	case 2: read-only DAX storage => writeable DAX storage
      
      This distinction is made by via vm_normal_page().  vm_normal_page()
      returns false for the common 4k zero page, though, just as it does for
      DAX ptes.  Instead of special casing the DAX + 4k zero page case we will
      simplify our DAX PTE page fault sequence so that it matches our DAX PMD
      sequence, and get rid of the dax_pfn_mkwrite() helper.  We will instead
      use dax_iomap_fault() to handle write-protection faults.
      
      This means that insert_pfn() needs to follow the lead of
      insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag.  If 'mkwrite'
      is set insert_pfn() will do the work that was previously done by
      wp_page_reuse() as part of the dax_pfn_mkwrite() call path.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2770da6
  2. 02 9月, 2017 1 次提交
  3. 01 9月, 2017 7 次提交
    • C
      ANDROID: binder: Add BINDER_GET_NODE_DEBUG_INFO ioctl · abcc6153
      Colin Cross 提交于
      The BINDER_GET_NODE_DEBUG_INFO ioctl will return debug info on
      a node.  Each successive call reusing the previous return value
      will return the next node.  The data will be used by
      libmemunreachable to mark the pointers with kernel references
      as reachable.
      Signed-off-by: NColin Cross <ccross@android.com>
      Signed-off-by: NMartijn Coenen <maco@android.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      abcc6153
    • J
      include/linux/compiler.h: don't perform compiletime_assert with -O0 · c03567a8
      Joe Stringer 提交于
      Commit c7acec71 ("kernel.h: handle pointers to arrays better in
      container_of()") made use of __compiletime_assert() from container_of()
      thus increasing the usage of this macro, allowing developers to notice
      type conflicts in usage of container_of() at compile time.
      
      However, the implementation of __compiletime_assert relies on compiler
      optimizations to report an error.  This means that if a developer uses
      "-O0" with any code that performs container_of(), the compiler will always
      report an error regardless of whether there is an actual problem in the
      code.
      
      This patch disables compile_time_assert when optimizations are disabled to
      allow such code to compile with CFLAGS="-O0".
      
      Example compilation failure:
      
      ./include/linux/compiler.h:547:38: error: call to `__compiletime_assert_94' declared with attribute error: pointer type mismatch in container_of()
        _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
                                            ^
      ./include/linux/compiler.h:530:4: note: in definition of macro `__compiletime_assert'
          prefix ## suffix();    \
          ^~~~~~
      ./include/linux/compiler.h:547:2: note: in expansion of macro `_compiletime_assert'
        _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
        ^~~~~~~~~~~~~~~~~~~
      ./include/linux/build_bug.h:46:37: note: in expansion of macro `compiletime_assert'
       #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                           ^~~~~~~~~~~~~~~~~~
      ./include/linux/kernel.h:860:2: note: in expansion of macro `BUILD_BUG_ON_MSG'
        BUILD_BUG_ON_MSG(!__same_type(*(ptr), ((type *)0)->member) && \
        ^~~~~~~~~~~~~~~~
      
      [akpm@linux-foundation.org: use do{}while(0), per Michal]
      Link: http://lkml.kernel.org/r/20170829230114.11662-1-joe@ovn.org
      Fixes: c7acec71 ("kernel.h: handle pointers to arrays better in container_of()")
      Signed-off-by: NJoe Stringer <joe@ovn.org>
      Cc: Ian Abbott <abbotti@mev.co.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c03567a8
    • J
      mm/mmu_notifier: kill invalidate_page · 5f32b265
      Jérôme Glisse 提交于
      The invalidate_page callback suffered from two pitfalls.  First it used
      to happen after the page table lock was release and thus a new page
      might have setup before the call to invalidate_page() happened.
      
      This is in a weird way fixed by commit c7ab0d2f ("mm: convert
      try_to_unmap_one() to use page_vma_mapped_walk()") that moved the
      callback under the page table lock but this also broke several existing
      users of the mmu_notifier API that assumed they could sleep inside this
      callback.
      
      The second pitfall was invalidate_page() being the only callback not
      taking a range of address in respect to invalidation but was giving an
      address and a page.  Lots of the callback implementers assumed this
      could never be THP and thus failed to invalidate the appropriate range
      for THP.
      
      By killing this callback we unify the mmu_notifier callback API to
      always take a virtual address range as input.
      
      Finally this also simplifies the end user life as there is now two clear
      choices:
        - invalidate_range_start()/end() callback (which allow you to sleep)
        - invalidate_range() where you can not sleep but happen right after
          page table update under page table lock
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Bernhard Held <berny156@gmx.de>
      Cc: Adam Borowski <kilobyte@angband.pl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: axie <axie@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f32b265
    • J
      dax: update to new mmu_notifier semantic · a4d1a885
      Jérôme Glisse 提交于
      Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range()
      and make sure it is bracketed by calls to *_invalidate_range_start()/end().
      
      Note that because we can not presume the pmd value or pte value we have
      to assume the worst and unconditionaly report an invalidation as
      happening.
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Bernhard Held <berny156@gmx.de>
      Cc: Adam Borowski <kilobyte@angband.pl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: axie <axie@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4d1a885
    • B
      <linux/uaccess.h>: Fix copy_in_user() declaration · f58e76c1
      Bart Van Assche 提交于
      copy_in_user() copies data from user-space address @from to user-
      space address @to. Hence declare both @from and @to as user-space
      pointers.
      
      Fixes: commit d597580d ("generic ...copy_..._user primitives")
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f58e76c1
    • C
      annotate RWF_... flags · ddef7ed2
      Christoph Hellwig 提交于
      [AV: added missing annotations in syscalls.h/compat.h]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ddef7ed2
    • A