1. 01 12月, 2019 3 次提交
  2. 07 11月, 2019 1 次提交
    • Y
      mm: thp: handle page cache THP correctly in PageTransCompoundMap · 169226f7
      Yang Shi 提交于
      We have a usecase to use tmpfs as QEMU memory backend and we would like
      to take the advantage of THP as well.  But, our test shows the EPT is
      not PMD mapped even though the underlying THP are PMD mapped on host.
      The number showed by /sys/kernel/debug/kvm/largepage is much less than
      the number of PMD mapped shmem pages as the below:
      
        7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
        Size:            4194304 kB
        [snip]
        AnonHugePages:         0 kB
        ShmemPmdMapped:   579584 kB
        [snip]
        Locked:                0 kB
      
        cat /sys/kernel/debug/kvm/largepages
        12
      
      And some benchmarks do worse than with anonymous THPs.
      
      By digging into the code we figured out that commit 127393fb ("mm:
      thp: kvm: fix memory corruption in KVM with THP enabled") checks if
      there is a single PTE mapping on the page for anonymous THP when setting
      up EPT map.  But the _mapcount < 0 check doesn't work for page cache THP
      since every subpage of page cache THP would get _mapcount inc'ed once it
      is PMD mapped, so PageTransCompoundMap() always returns false for page
      cache THP.  This would prevent KVM from setting up PMD mapped EPT entry.
      
      So we need handle page cache THP correctly.  However, when page cache
      THP's PMD gets split, kernel just remove the map instead of setting up
      PTE map like what anonymous THP does.  Before KVM calls get_user_pages()
      the subpages may get PTE mapped even though it is still a THP since the
      page cache THP may be mapped by other processes at the mean time.
      
      Checking its _mapcount and whether the THP has PTE mapped or not.
      Although this may report some false negative cases (PTE mapped by other
      processes), it looks not trivial to make this accurate.
      
      With this fix /sys/kernel/debug/kvm/largepage would show reasonable
      pages are PMD mapped by EPT as the below:
      
        7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted)
        Size:            4194304 kB
        [snip]
        AnonHugePages:         0 kB
        ShmemPmdMapped:   557056 kB
        [snip]
        Locked:                0 kB
      
        cat /sys/kernel/debug/kvm/largepages
        271
      
      And the benchmarks are as same as anonymous THPs.
      
      [yang.shi@linux.alibaba.com: v4]
        Link: http://lkml.kernel.org/r/1571865575-42913-1-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1571769577-89735-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: dd78fedd ("rmap: support file thp")
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reported-by: NGang Deng <gavin.dg@linux.alibaba.com>
      Tested-by: NGang Deng <gavin.dg@linux.alibaba.com>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      169226f7
  3. 06 11月, 2019 1 次提交
    • T
      mm: Add write-protect and clean utilities for address space ranges · c5acad84
      Thomas Hellstrom 提交于
      Add two utilities to 1) write-protect and 2) clean all ptes pointing into
      a range of an address space.
      The utilities are intended to aid in tracking dirty pages (either
      driver-allocated system memory or pci device memory).
      The write-protect utility should be used in conjunction with
      page_mkwrite() and pfn_mkwrite() to trigger write page-faults on page
      accesses. Typically one would want to use this on sparse accesses into
      large memory regions. The clean utility should be used to utilize
      hardware dirtying functionality and avoid the overhead of page-faults,
      typically on large accesses into small memory regions.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NThomas Hellstrom <thellstrom@vmware.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      c5acad84
  4. 27 9月, 2019 1 次提交
    • M
      mm: treewide: clarify pgtable_page_{ctor,dtor}() naming · b4ed71f5
      Mark Rutland 提交于
      The naming of pgtable_page_{ctor,dtor}() seems to have confused a few
      people, and until recently arm64 used these erroneously/pointlessly for
      other levels of page table.
      
      To make it incredibly clear that these only apply to the PTE level, and to
      align with the naming of pgtable_pmd_page_{ctor,dtor}(), let's rename them
      to pgtable_pte_page_{ctor,dtor}().
      
      These changes were generated with the following shell script:
      
      ----
      git grep -lw 'pgtable_page_.tor' | while read FILE; do
          sed -i '{s/pgtable_page_ctor/pgtable_pte_page_ctor/}' $FILE;
          sed -i '{s/pgtable_page_dtor/pgtable_pte_page_dtor/}' $FILE;
      done
      ----
      
      ... with the documentation re-flowed to remain under 80 columns, and
      whitespace fixed up in macros to keep backslashes aligned.
      
      There should be no functional change as a result of this patch.
      
      Link: http://lkml.kernel.org/r/20190722141133.3116-1-mark.rutland@arm.comSigned-off-by: NMark Rutland <mark.rutland@arm.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4ed71f5
  5. 25 9月, 2019 8 次提交
  6. 07 9月, 2019 1 次提交
  7. 19 7月, 2019 2 次提交
  8. 17 7月, 2019 4 次提交
  9. 16 7月, 2019 2 次提交
  10. 15 7月, 2019 1 次提交
  11. 13 7月, 2019 5 次提交
    • A
      mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options · 6471384a
      Alexander Potapenko 提交于
      Patch series "add init_on_alloc/init_on_free boot options", v10.
      
      Provide init_on_alloc and init_on_free boot options.
      
      These are aimed at preventing possible information leaks and making the
      control-flow bugs that depend on uninitialized values more deterministic.
      
      Enabling either of the options guarantees that the memory returned by the
      page allocator and SL[AU]B is initialized with zeroes.  SLOB allocator
      isn't supported at the moment, as its emulation of kmem caches complicates
      handling of SLAB_TYPESAFE_BY_RCU caches correctly.
      
      Enabling init_on_free also guarantees that pages and heap objects are
      initialized right after they're freed, so it won't be possible to access
      stale data by using a dangling pointer.
      
      As suggested by Michal Hocko, right now we don't let the heap users to
      disable initialization for certain allocations.  There's not enough
      evidence that doing so can speed up real-life cases, and introducing ways
      to opt-out may result in things going out of control.
      
      This patch (of 2):
      
      The new options are needed to prevent possible information leaks and make
      control-flow bugs that depend on uninitialized values more deterministic.
      
      This is expected to be on-by-default on Android and Chrome OS.  And it
      gives the opportunity for anyone else to use it under distros too via the
      boot args.  (The init_on_free feature is regularly requested by folks
      where memory forensics is included in their threat models.)
      
      init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
      objects with zeroes.  Initialization is done at allocation time at the
      places where checks for __GFP_ZERO are performed.
      
      init_on_free=1 makes the kernel initialize freed pages and heap objects
      with zeroes upon their deletion.  This helps to ensure sensitive data
      doesn't leak via use-after-free accesses.
      
      Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
      returns zeroed memory.  The two exceptions are slab caches with
      constructors and SLAB_TYPESAFE_BY_RCU flag.  Those are never
      zero-initialized to preserve their semantics.
      
      Both init_on_alloc and init_on_free default to zero, but those defaults
      can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
      CONFIG_INIT_ON_FREE_DEFAULT_ON.
      
      If either SLUB poisoning or page poisoning is enabled, those options take
      precedence over init_on_alloc and init_on_free: initialization is only
      applied to unpoisoned allocations.
      
      Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:
      
      hackbench, init_on_free=1:  +7.62% sys time (st.err 0.74%)
      hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
      
      Linux build with -j12, init_on_free=1:  +8.38% wall time (st.err 0.39%)
      Linux build with -j12, init_on_free=1:  +24.42% sys time (st.err 0.52%)
      Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
      Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
      
      The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
      is within the standard error.
      
      The new features are also going to pave the way for hardware memory
      tagging (e.g.  arm64's MTE), which will require both on_alloc and on_free
      hooks to set the tags for heap objects.  With MTE, tagging will have the
      same cost as memory initialization.
      
      Although init_on_free is rather costly, there are paranoid use-cases where
      in-memory data lifetime is desired to be minimized.  There are various
      arguments for/against the realism of the associated threat models, but
      given that we'll need the infrastructure for MTE anyway, and there are
      people who want wipe-on-free behavior no matter what the performance cost,
      it seems reasonable to include it in this series.
      
      [glider@google.com: v8]
        Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
      [glider@google.com: v9]
        Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
      [glider@google.com: v10]
        Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
      Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.comSigned-off-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: Michal Hocko <mhocko@suse.cz>		[page and dmapool parts
      Acked-by: James Morris <jamorris@linux.microsoft.com>]
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Sandeep Patil <sspatil@android.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6471384a
    • A
      mm/pgtable: drop pgtable_t variable from pte_fn_t functions · 8b1e0f81
      Anshuman Khandual 提交于
      Drop the pgtable_t variable from all implementation for pte_fn_t as none
      of them use it.  apply_to_pte_range() should stop computing it as well.
      Should help us save some cycles.
      
      Link: http://lkml.kernel.org/r/1556803126-26596-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: NMatthew Wilcox <willy@infradead.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <jglisse@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b1e0f81
    • V
      mm, debug_pagealloc: use a page type instead of page_ext flag · 3972f6bb
      Vlastimil Babka 提交于
      When debug_pagealloc is enabled, we currently allocate the page_ext
      array to mark guard pages with the PAGE_EXT_DEBUG_GUARD flag.  Now that
      we have the page_type field in struct page, we can use that instead, as
      guard pages are neither PageSlab nor mapped to userspace.  This reduces
      memory overhead when debug_pagealloc is enabled and there are no other
      features requiring the page_ext array.
      
      Link: http://lkml.kernel.org/r/20190603143451.27353-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3972f6bb
    • V
      mm, debug_pagelloc: use static keys to enable debugging · 96a2b03f
      Vlastimil Babka 提交于
      Patch series "debug_pagealloc improvements".
      
      I have been recently debugging some pcplist corruptions, where it would be
      useful to perform struct page checks immediately as pages are allocated
      from and freed to pcplists, which is now only possible by rebuilding the
      kernel with CONFIG_DEBUG_VM (details in Patch 2 changelog).
      
      To make this kind of debugging simpler in future on a distro kernel, I
      have improved CONFIG_DEBUG_PAGEALLOC so that it has even smaller overhead
      when not enabled at boot time (Patch 1) and also when enabled (Patch 3),
      and extended it to perform the struct page checks more often when enabled
      (Patch 2).  Now it can be configured in when building a distro kernel
      without extra overhead, and debugging page use after free or double free
      can be enabled simply by rebooting with debug_pagealloc=on.
      
      This patch (of 3):
      
      CONFIG_DEBUG_PAGEALLOC has been redesigned by 031bc574
      ("mm/debug-pagealloc: make debug-pagealloc boottime configurable") to
      allow being always enabled in a distro kernel, but only perform its
      expensive functionality when booted with debug_pagelloc=on.  We can
      further reduce the overhead when not boot-enabled (including page
      allocator fast paths) using static keys.  This patch introduces one for
      debug_pagealloc core functionality, and another for the optional guard
      page functionality (enabled by booting with debug_guardpage_minorder=X).
      
      Link: http://lkml.kernel.org/r/20190603143451.27353-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96a2b03f
    • A
      mm/nvdimm: add is_ioremap_addr and use that to check ioremap address · 9bd3bb67
      Aneesh Kumar K.V 提交于
      Architectures like powerpc use different address range to map ioremap
      and vmalloc range.  The memunmap() check used by the nvdimm layer was
      wrongly using is_vmalloc_addr() to check for ioremap range which fails
      for ppc64.  This result in ppc64 not freeing the ioremap mapping.  The
      side effect of this is an unbind failure during module unload with
      papr_scm nvdimm driver
      
      Link: http://lkml.kernel.org/r/20190701134038.14165-1-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Fixes: b5beae5e ("powerpc/pseries: Add driver for PAPR SCM regions")
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9bd3bb67
  12. 03 7月, 2019 2 次提交
  13. 18 6月, 2019 2 次提交
    • T
      mm: Add write-protect and clean utilities for address space ranges · 4fe51e9e
      Thomas Hellstrom 提交于
      Add two utilities to a) write-protect and b) clean all ptes pointing into
      a range of an address space.
      The utilities are intended to aid in tracking dirty pages (either
      driver-allocated system memory or pci device memory).
      The write-protect utility should be used in conjunction with
      page_mkwrite() and pfn_mkwrite() to trigger write page-faults on page
      accesses. Typically one would want to use this on sparse accesses into
      large memory regions. The clean utility should be used to utilize
      hardware dirtying functionality and avoid the overhead of page-faults,
      typically on large accesses into small memory regions.
      
      The added file "as_dirty_helpers.c" is initially listed as maintained by
      VMware under our DRM driver. If somebody would like it elsewhere,
      that's of course no problem.
      
      Notable changes since RFC:
      - Added comments to help avoid the usage of these function for VMAs
        it's not intended for. We also do advisory checks on the vm_flags and
        warn on illegal usage.
      - Perform the pte modifications the same way softdirty does.
      - Add mmu_notifier range invalidation calls.
      - Add a config option so that this code is not unconditionally included.
      - Tell the mmu_gather code about pending tlb flushes.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: linux-mm@kvack.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NThomas Hellstrom <thellstrom@vmware.com>
      Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> #v1
      4fe51e9e
    • T
      mm: Add an apply_to_pfn_range interface · 29875a52
      Thomas Hellstrom 提交于
      This is basically apply_to_page_range with added functionality:
      Allocating missing parts of the page table becomes optional, which
      means that the function can be guaranteed not to error if allocation
      is disabled. Also passing of the closure struct and callback function
      becomes different and more in line with how things are done elsewhere.
      
      Finally we keep apply_to_page_range as a wrapper around apply_to_pfn_range
      
      The reason for not using the page-walk code is that we want to perform
      the page-walk on vmas pointing to an address space without requiring the
      mmap_sem to be held rather than on vmas belonging to a process with the
      mmap_sem held.
      
      Notable changes since RFC:
      Don't export apply_to_pfn range.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: linux-mm@kvack.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NThomas Hellstrom <thellstrom@vmware.com>
      Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> #v1
      29875a52
  14. 08 6月, 2019 1 次提交
  15. 15 5月, 2019 6 次提交
    • D
      mm: move buddy list manipulations into helpers · b03641af
      Dan Williams 提交于
      In preparation for runtime randomization of the zone lists, take all
      (well, most of) the list_*() functions in the buddy allocator and put
      them in helper functions.  Provide a common control point for injecting
      additional behavior when freeing pages.
      
      [dan.j.williams@intel.com: fix buddy list helpers]
        Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com
      [vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter]
        Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz
      Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Robert Elliott <elliott@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b03641af
    • S
      mm: introduce new vm_map_pages() and vm_map_pages_zero() API · a667d745
      Souptick Joarder 提交于
      Patch series "mm: Use vm_map_pages() and vm_map_pages_zero() API", v5.
      
      This patch (of 5):
      
      Previouly drivers have their own way of mapping range of kernel
      pages/memory into user vma and this was done by invoking vm_insert_page()
      within a loop.
      
      As this pattern is common across different drivers, it can be generalized
      by creating new functions and using them across the drivers.
      
      vm_map_pages() is the API which can be used to map kernel memory/pages in
      drivers which have considered vm_pgoff
      
      vm_map_pages_zero() is the API which can be used to map a range of kernel
      memory/pages in drivers which have not considered vm_pgoff.  vm_pgoff is
      passed as default 0 for those drivers.
      
      We _could_ then at a later "fix" these drivers which are using
      vm_map_pages_zero() to behave according to the normal vm_pgoff offsetting
      simply by removing the _zero suffix on the function name and if that
      causes regressions, it gives us an easy way to revert.
      
      Tested on Rockchip hardware and display is working, including talking to
      Lima via prime.
      
      Link: http://lkml.kernel.org/r/751cb8a0f4c3e67e95c58a3b072937617f338eea.1552921225.git.jrdr.linux@gmail.comSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
      Suggested-by: NRussell King <linux@armlinux.org.uk>
      Suggested-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Tested-by: NHeiko Stuebner <heiko@sntech.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Thierry Reding <treding@nvidia.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
      Cc: Sandy Huang <hjc@rock-chips.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Pawel Osciak <pawel@osciak.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a667d745
    • A
      mm: use mm_zero_struct_page from SPARC on all 64b architectures · 5470dea4
      Alexander Duyck 提交于
      Patch series "Deferred page init improvements", v7.
      
      This patchset is essentially a refactor of the page initialization logic
      that is meant to provide for better code reuse while providing a
      significant improvement in deferred page initialization performance.
      
      In my testing on an x86_64 system with 384GB of RAM I have seen the
      following.  In the case of regular memory initialization the deferred init
      time was decreased from 3.75s to 1.38s on average.  This amounts to a 172%
      improvement for the deferred memory initialization performance.
      
      I have called out the improvement observed with each patch.
      
      This patch (of 4):
      
      Use the same approach that was already in use on Sparc on all the
      architectures that support a 64b long.
      
      This is mostly motivated by the fact that 7 to 10 store/move instructions
      are likely always going to be faster than having to call into a function
      that is not specialized for handling page init.
      
      An added advantage to doing it this way is that the compiler can get away
      with combining writes in the __init_single_page call.  As a result the
      memset call will be reduced to only about 4 write operations, or at least
      that is what I am seeing with GCC 6.2 as the flags, LRU pointers, and
      count/mapcount seem to be cancelling out at least 4 of the 8 assignments
      on my system.
      
      One change I had to make to the function was to reduce the minimum page
      size to 56 to support some powerpc64 configurations.
      
      This change should introduce no change on SPARC since it already had this
      code.  In the case of x86_64 I saw a reduction from 3.75s to 2.80s when
      initializing 384GB of RAM per node.  Pavel Tatashin tested on a system
      with Broadcom's Stingray CPU and 48GB of RAM and found that
      __init_single_page() takes 19.30ns / 64-byte struct page before this patch
      and with this patch it takes 17.33ns / 64-byte struct page.  Mike Rapoport
      ran a similar test on a OpenPower (S812LC 8348-21C) with Power8 processor
      and 128GB or RAM.  His results per 64-byte struct page were 4.68ns before,
      and 4.59ns after this patch.
      
      Link: http://lkml.kernel.org/r/20190405221213.12227.9392.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <yi.z.zhang@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5470dea4
    • J
      mm: introduce put_user_page*(), placeholder versions · fc1d8e7c
      John Hubbard 提交于
      A discussion of the overall problem is below.
      
      As mentioned in patch 0001, the steps are to fix the problem are:
      
      1) Provide put_user_page*() routines, intended to be used
         for releasing pages that were pinned via get_user_pages*().
      
      2) Convert all of the call sites for get_user_pages*(), to
         invoke put_user_page*(), instead of put_page(). This involves dozens of
         call sites, and will take some time.
      
      3) After (2) is complete, use get_user_pages*() and put_user_page*() to
         implement tracking of these pages. This tracking will be separate from
         the existing struct page refcounting.
      
      4) Use the tracking and identification of these pages, to implement
         special handling (especially in writeback paths) when the pages are
         backed by a filesystem.
      
      Overview
      ========
      
      Some kernel components (file systems, device drivers) need to access
      memory that is specified via process virtual address.  For a long time,
      the API to achieve that was get_user_pages ("GUP") and its variations.
      However, GUP has critical limitations that have been overlooked; in
      particular, GUP does not interact correctly with filesystems in all
      situations.  That means that file-backed memory + GUP is a recipe for
      potential problems, some of which have already occurred in the field.
      
      GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem
      code to get the struct page behind a virtual address and to let storage
      hardware perform a direct copy to or from that page.  This is a
      short-lived access pattern, and as such, the window for a concurrent
      writeback of GUP'd page was small enough that there were not (we think)
      any reported problems.  Also, userspace was expected to understand and
      accept that Direct IO was not synchronized with memory-mapped access to
      that data, nor with any process address space changes such as munmap(),
      mremap(), etc.
      
      Over the years, more GUP uses have appeared (virtualization, device
      drivers, RDMA) that can keep the pages they get via GUP for a long period
      of time (seconds, minutes, hours, days, ...).  This long-term pinning
      makes an underlying design problem more obvious.
      
      In fact, there are a number of key problems inherent to GUP:
      
      Interactions with file systems
      ==============================
      
      File systems expect to be able to write back data, both to reclaim pages,
      and for data integrity.  Allowing other hardware (NICs, GPUs, etc) to gain
      write access to the file memory pages means that such hardware can dirty
      the pages, without the filesystem being aware.  This can, in some cases
      (depending on filesystem, filesystem options, block device, block device
      options, and other variables), lead to data corruption, and also to kernel
      bugs of the form:
      
          kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
          backtrace:
              ext4_writepage
              __writepage
              write_cache_pages
              ext4_writepages
              do_writepages
              __writeback_single_inode
              writeback_sb_inodes
              __writeback_inodes_wb
              wb_writeback
              wb_workfn
              process_one_work
              worker_thread
              kthread
              ret_from_fork
      
      ...which is due to the file system asserting that there are still buffer
      heads attached:
      
              ({                                                      \
                      BUG_ON(!PagePrivate(page));                     \
                      ((struct buffer_head *)page_private(page));     \
              })
      
      Dave Chinner's description of this is very clear:
      
          "The fundamental issue is that ->page_mkwrite must be called on every
          write access to a clean file backed page, not just the first one.
          How long the GUP reference lasts is irrelevant, if the page is clean
          and you need to dirty it, you must call ->page_mkwrite before it is
          marked writeable and dirtied. Every. Time."
      
      This is just one symptom of the larger design problem: real filesystems
      that actually write to a backing device, do not actually support
      get_user_pages() being called on their pages, and letting hardware write
      directly to those pages--even though that pattern has been going on since
      about 2005 or so.
      
      Long term GUP
      =============
      
      Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
      writeable mapping is created), and the pages are file-backed.  That can
      lead to filesystem corruption.  What happens is that when a file-backed
      page is being written back, it is first mapped read-only in all of the CPU
      page tables; the file system then assumes that nobody can write to the
      page, and that the page content is therefore stable.  Unfortunately, the
      GUP callers generally do not monitor changes to the CPU pages tables; they
      instead assume that the following pattern is safe (it's not):
      
          get_user_pages()
      
          Hardware can keep a reference to those pages for a very long time,
          and write to it at any time.  Because "hardware" here means "devices
          that are not a CPU", this activity occurs without any interaction with
          the kernel's file system code.
      
          for each page
              set_page_dirty
              put_page()
      
      In fact, the GUP documentation even recommends that pattern.
      
      Anyway, the file system assumes that the page is stable (nothing is
      writing to the page), and that is a problem: stable page content is
      necessary for many filesystem actions during writeback, such as checksum,
      encryption, RAID striping, etc.  Furthermore, filesystem features like COW
      (copy on write) or snapshot also rely on being able to use a new page for
      as memory for that memory range inside the file.
      
      Corruption during write back is clearly possible here.  To solve that, one
      idea is to identify pages that have active GUP, so that we can use a
      bounce page to write stable data to the filesystem.  The filesystem would
      work on the bounce page, while any of the active GUP might write to the
      original page.  This would avoid the stable page violation problem, but
      note that it is only part of the overall solution, because other problems
      remain.
      
      Other filesystem features that need to replace the page with a new one can
      be inhibited for pages that are GUP-pinned.  This will, however, alter and
      limit some of those filesystem features.  The only fix for that would be
      to require GUP users to monitor and respond to CPU page table updates.
      Subsystems such as ODP and HMM do this, for example.  This aspect of the
      problem is still under discussion.
      
      Direct IO
      =========
      
      Direct IO can cause corruption, if userspace does Direct-IO that writes to
      a range of virtual addresses that are mmap'd to a file.  The pages written
      to are file-backed pages that can be under write back, while the Direct IO
      is taking place.  Here, Direct IO races with a write back: it calls GUP
      before page_mkclean() has replaced the CPU pte with a read-only entry.
      The race window is pretty small, which is probably why years have gone by
      before we noticed this problem: Direct IO is generally very quick, and
      tends to finish up before the filesystem gets around to do anything with
      the page contents.  However, it's still a real problem.  The solution is
      to never let GUP return pages that are under write back, but instead,
      force GUP to take a write fault on those pages.  That way, GUP will
      properly synchronize with the active write back.  This does not change the
      required GUP behavior, it just avoids that race.
      
      Details
      =======
      
      Introduces put_user_page(), which simply calls put_page().  This provides
      a way to update all get_user_pages*() callers, so that they call
      put_user_page(), instead of put_page().
      
      Also introduces put_user_pages(), and a few dirty/locked variations, as a
      replacement for release_pages(), and also as a replacement for open-coded
      loops that release multiple pages.  These may be used for subsequent
      performance improvements, via batching of pages to be released.
      
      This is the first step of fixing a problem (also described in [1] and [2])
      with interactions between get_user_pages ("gup") and filesystems.
      
      Problem description: let's start with a bug report.  Below, is what
      happens sometimes, under memory pressure, when a driver pins some pages
      via gup, and then marks those pages dirty, and releases them.  Note that
      the gup documentation actually recommends that pattern.  The problem is
      that the filesystem may do a writeback while the pages were gup-pinned,
      and then the filesystem believes that the pages are clean.  So, when the
      driver later marks the pages as dirty, that conflicts with the
      filesystem's page tracking and results in a BUG(), like this one that I
      experienced:
      
          kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
          backtrace:
              ext4_writepage
              __writepage
              write_cache_pages
              ext4_writepages
              do_writepages
              __writeback_single_inode
              writeback_sb_inodes
              __writeback_inodes_wb
              wb_writeback
              wb_workfn
              process_one_work
              worker_thread
              kthread
              ret_from_fork
      
      ...which is due to the file system asserting that there are still buffer
      heads attached:
      
              ({                                                      \
                      BUG_ON(!PagePrivate(page));                     \
                      ((struct buffer_head *)page_private(page));     \
              })
      
      Dave Chinner's description of this is very clear:
      
          "The fundamental issue is that ->page_mkwrite must be called on
          every write access to a clean file backed page, not just the first
          one.  How long the GUP reference lasts is irrelevant, if the page is
          clean and you need to dirty it, you must call ->page_mkwrite before it
          is marked writeable and dirtied.  Every.  Time."
      
      This is just one symptom of the larger design problem: real filesystems
      that actually write to a backing device, do not actually support
      get_user_pages() being called on their pages, and letting hardware write
      directly to those pages--even though that pattern has been going on since
      about 2005 or so.
      
      The steps are to fix it are:
      
      1) (This patch): provide put_user_page*() routines, intended to be used
         for releasing pages that were pinned via get_user_pages*().
      
      2) Convert all of the call sites for get_user_pages*(), to
         invoke put_user_page*(), instead of put_page(). This involves dozens of
         call sites, and will take some time.
      
      3) After (2) is complete, use get_user_pages*() and put_user_page*() to
         implement tracking of these pages. This tracking will be separate from
         the existing struct page refcounting.
      
      4) Use the tracking and identification of these pages, to implement
         special handling (especially in writeback paths) when the pages are
         backed by a filesystem.
      
      [1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
      [2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
      
      Link: http://lkml.kernel.org/r/20190327023632.13307-2-jhubbard@nvidia.comSigned-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>		[docs]
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc1d8e7c
    • I
      mm/gup: change GUP fast to use flags rather than a write 'bool' · 73b0140b
      Ira Weiny 提交于
      To facilitate additional options to get_user_pages_fast() change the
      singular write parameter to be gup_flags.
      
      This patch does not change any functionality.  New functionality will
      follow in subsequent patches.
      
      Some of the get_user_pages_fast() call sites were unchanged because they
      already passed FOLL_WRITE or 0 for the write parameter.
      
      NOTE: It was suggested to change the ordering of the get_user_pages_fast()
      arguments to ensure that callers were converted.  This breaks the current
      GUP call site convention of having the returned pages be the final
      parameter.  So the suggestion was rejected.
      
      Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NMike Marshall <hubcap@omnibond.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73b0140b
    • I
      mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM · 932f4a63
      Ira Weiny 提交于
      Pach series "Add FOLL_LONGTERM to GUP fast and use it".
      
      HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
      advantages.  These pages can be held for a significant time.  But
      get_user_pages_fast() does not protect against mapping FS DAX pages.
      
      Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
      retains the performance while also adding the FS DAX checks.  XDP has also
      shown interest in using this functionality.[1]
      
      In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
      and remove the specialized get_user_pages_longterm call.
      
      [1] https://lkml.org/lkml/2019/3/19/939
      
      "longterm" is a relative thing and at this point is probably a misnomer.
      This is really flagging a pin which is going to be given to hardware and
      can't move.  I've thought of a couple of alternative names but I think we
      have to settle on if we are going to use FL_LAYOUT or something else to
      solve the "longterm" problem.  Then I think we can change the flag to a
      better name.
      
      Secondly, it depends on how often you are registering memory.  I have
      spoken with some RDMA users who consider MR in the performance path...
      For the overall application performance.  I don't have the numbers as the
      tests for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an aside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      This patch (of 7):
      
      This patch starts a series which aims to support FOLL_LONGTERM in
      get_user_pages_fast().  Some callers who would like to do a longterm (user
      controlled pin) of pages with the fast variant of GUP for performance
      purposes.
      
      Rather than have a separate get_user_pages_longterm() call, introduce
      FOLL_LONGTERM and change the longterm callers to use it.
      
      This patch does not change any functionality.  In the short term
      "longterm" or user controlled pins are unsafe for Filesystems and FS DAX
      in particular has been blocked.  However, callers of get_user_pages_fast()
      were not "protected".
      
      FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
      requires vmas to determine if DAX is in use.
      
      NOTE: In merging with the CMA changes we opt to change the
      get_user_pages() call in check_and_migrate_cma_pages() to a call of
      __get_user_pages_locked() on the newly migrated pages.  This makes the
      code read better in that we are calling __get_user_pages_locked() on the
      pages before and after a potential migration.
      
      As a side affect some of the interfaces are cleaned up but this is not the
      primary purpose of the series.
      
      In review[1] it was asked:
      
      <quote>
      > This I don't get - if you do lock down long term mappings performance
      > of the actual get_user_pages call shouldn't matter to start with.
      >
      > What do I miss?
      
      A couple of points.
      
      First "longterm" is a relative thing and at this point is probably a
      misnomer.  This is really flagging a pin which is going to be given to
      hardware and can't move.  I've thought of a couple of alternative names
      but I think we have to settle on if we are going to use FL_LAYOUT or
      something else to solve the "longterm" problem.  Then I think we can
      change the flag to a better name.
      
      Second, It depends on how often you are registering memory.  I have spoken
      with some RDMA users who consider MR in the performance path...  For the
      overall application performance.  I don't have the numbers as the tests
      for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an asside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      </quote>
      
      [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965
      
      [ira.weiny@intel.com: v3]
        Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      932f4a63