1. 07 3月, 2016 8 次提交
    • T
      powerpc/ftrace: Add Kconfig & Make glue for mprofile-kernel · 8c50b72a
      Torsten Duwe 提交于
      Firstly we add logic to Kconfig to allow a user to choose if they want
      mprofile-kernel. This has to be user-selectable because only some
      current toolchains support it. If we enabled it unconditionally we would
      prevent some users from building the kernel entirely.
      
      Arguably it would be nice if we could detect if mprofile-kernel was
      available, and use it then. However that would violate the principle of
      least surprise because a user having choosen options such as live
      patching, would then see them quietly disabled at build time.
      
      We also make the user selectable option negative, ie. it disables when
      selected, so that allyesconfig continues to build on old toolchains.
      
      Once we've decided we do want to use mprofile-kernel, we then add a
      script which checks it actually works. That is because there are
      versions of gcc that accept the flag but don't generate correct code.
      
      Due to the way kconfig works, we can't error out when we detect a
      non-working toolchain. If we did a user would never be able to modify
      their config and run oldconfig - because the check would block oldconfig
      from running. Instead we emit a warning and add a bogus flag to CFLAGS
      so that the build will fail.
      Signed-off-by: NTorsten Duwe <duwe@suse.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8c50b72a
    • T
      powerpc/ftrace: Add support for -mprofile-kernel ftrace ABI · 15308664
      Torsten Duwe 提交于
      The gcc switch -mprofile-kernel defines a new ABI for calling _mcount()
      very early in the function with minimal overhead.
      
      Although mprofile-kernel has been available since GCC 3.4, there were
      bugs which were only fixed recently. Currently it is known to work in
      GCC 4.9, 5 and 6.
      
      Additionally there are two possible code sequences generated by the
      flag, the first uses mflr/std/bl and the second is optimised to omit the
      std. Currently only gcc 6 has the optimised sequence. This patch
      supports both sequences.
      
      Initial work started by Vojtech Pavlik, used with permission.
      
      Key changes:
       - rework _mcount() to work for both the old and new ABIs.
       - implement new versions of ftrace_caller() and ftrace_graph_caller()
         which deal with the new ABI.
       - updates to __ftrace_make_nop() to recognise the new mcount calling
         sequence.
       - updates to __ftrace_make_call() to recognise the nop'ed sequence.
       - implement ftrace_modify_call().
       - updates to the module loader to surpress the toc save in the module
         stub when calling mcount with the new ABI.
      Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NTorsten Duwe <duwe@suse.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      15308664
    • T
      powerpc/ftrace: Use $(CC_FLAGS_FTRACE) when disabling ftrace · 9a7841ae
      Torsten Duwe 提交于
      Rather than open-coding -pg whereever we want to disable ftrace, use the
      existing $(CC_FLAGS_FTRACE) variable.
      
      This has the advantage that it will work in future when we use a
      different set of flags to enable ftrace.
      Signed-off-by: NTorsten Duwe <duwe@suse.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9a7841ae
    • T
      powerpc/ftrace: Use generic ftrace_modify_all_code() · c96f8385
      Torsten Duwe 提交于
      Convert powerpc's arch_ftrace_update_code() from its own version to use
      the generic default functionality (without stop_machine -- our
      instructions are properly aligned and the replacements atomic).
      
      With this we gain error checking and the much-needed function_trace_op
      handling.
      Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
      Reviewed-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      Signed-off-by: NTorsten Duwe <duwe@suse.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c96f8385
    • M
      powerpc/module: Create a special stub for ftrace_caller() · 336a7b5d
      Michael Ellerman 提交于
      In order to support the new -mprofile-kernel ABI, we need to be able to
      call from the module back to ftrace_caller() (in the kernel) without
      using the module's r2. That is because the function in this module which
      is calling ftrace_caller() may not have setup r2, if it doesn't
      otherwise need it (ie. it accesses no globals).
      
      To make that work we add a new stub which is used for calling
      ftrace_caller(), which uses the kernel toc instead of the module toc.
      Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
      Reviewed-by: NTorsten Duwe <duwe@suse.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      336a7b5d
    • M
      powerpc/module: Mark module stubs with a magic value · f17c4e01
      Michael Ellerman 提交于
      When a module is loaded, calls out to the kernel go via a stub which is
      generated at runtime. One of these stubs is used to call _mcount(),
      which is the default target of tracing calls generated by the compiler
      with -pg.
      
      If dynamic ftrace is enabled (which it typically is), another stub is
      used to call ftrace_caller(), which is the target of tracing calls when
      ftrace is actually active.
      
      ftrace then wants to disable the calls to _mcount() at module startup,
      and enable/disable the calls to ftrace_caller() when enabling/disabling
      tracing - all of these it does by patching the code.
      
      As part of that code patching, the ftrace code wants to confirm that the
      branch it is about to modify, is in fact a call to a module stub which
      calls _mcount() or ftrace_caller().
      
      Currently it does that by inspecting the instructions and confirming
      they are what it expects. Although that works, the code to do it is
      pretty intricate because it requires lots of knowledge about the exact
      format of the stub.
      
      We can make that process easier by marking the generated stubs with a
      magic value, and then looking for that magic value. Altough this is not
      as rigorous as the current method, I believe it is sufficient in
      practice.
      Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
      Reviewed-by: NTorsten Duwe <duwe@suse.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f17c4e01
    • M
      powerpc/module: Only try to generate the ftrace_caller() stub once · 136cd345
      Michael Ellerman 提交于
      Currently we generate the module stub for ftrace_caller() at the bottom
      of apply_relocate_add(). However apply_relocate_add() is potentially
      called more than once per module, which means we will try to generate
      the ftrace_caller() stub multiple times.
      
      Although the current code deals with that correctly, ie. it only
      generates a stub the first time, it would be clearer to only try to
      generate the stub once.
      
      Note also on first reading it may appear that we generate a different
      stub for each section that requires relocation, but that is not the
      case. The code in stub_for_addr() that searches for an existing stub
      uses sechdrs[me->arch.stubs_section], ie. the single stub section for
      this module.
      
      A cleaner approach is to only generate the ftrace_caller() stub once,
      from module_finalize(). Although the original code didn't check to see
      if the stub was actually generated correctly, it seems prudent to add a
      check, so do that. And an additional benefit is we can clean the ifdefs
      up a little.
      
      Finally we must propagate the const'ness of some of the pointers passed
      to module_finalize(), but that is also an improvement.
      Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
      Reviewed-by: NTorsten Duwe <duwe@suse.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      136cd345
    • M
      powerpc: Create a helper for getting the kernel toc value · a5cab83c
      Michael Ellerman 提交于
      Move the logic to work out the kernel toc pointer into a header. This is
      a good cleanup, and also means we can use it elsewhere in future.
      Reviewed-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      Reviewed-by: NTorsten Duwe <duwe@suse.de>
      Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Tested-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      a5cab83c
  2. 28 1月, 2016 2 次提交
    • A
      powerpc/mm: Fixup _HPAGE_CHG_MASK · 2d19fc63
      Aneesh Kumar K.V 提交于
      This was wrongly updated by commit 7aa9a23c ("powerpc, thp: remove
      infrastructure for handling splitting PMDs") during the last merge
      window. Fix it up.
      
      This could lead to incorrect behaviour in THP and/or mprotect(), at a
      minimum.
      
      Fixes: 7aa9a23c ("powerpc, thp: remove infrastructure for handling splitting PMDs")
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2d19fc63
    • M
      powerpc/perf: Remove PPMU_HAS_SSLOT flag for Power8 · 370f06c8
      Madhavan Srinivasan 提交于
      Commit 7a786832 ("powerpc/perf: Add an explict flag indicating
      presence of SLOT field") introduced the PPMU_HAS_SSLOT flag to remove
      the assumption that MMCRA[SLOT] was present when PPMU_ALT_SIPR was not
      set.
      
      That commit's changelog also mentions that Power8 does not support
      MMCRA[SLOT]. However when the Power8 PMU support was merged, it
      errnoeously included the PPMU_HAS_SSLOT flag.
      
      So remove PPMU_HAS_SSLOT from the Power8 flags.
      
      mpe: On systems where MMCRA[SLOT] exists, the field occupies bits 37:39
      (IBM numbering). On Power8 bit 37 is reserved, and 38:39 overlap with
      the high bits of the Threshold Event Counter Mantissa. I am not aware of
      any published events which use the threshold counting mechanism, which
      would cause the mantissa bits to be set. So in practice this bug is
      unlikely to trigger.
      
      Fixes: e05b9b9e ("powerpc/perf: Power8 PMU support")
      Signed-off-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      370f06c8
  3. 27 1月, 2016 1 次提交
  4. 25 1月, 2016 1 次提交
  5. 23 1月, 2016 1 次提交
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  6. 21 1月, 2016 6 次提交
    • S
      powerpc: Remove newly added extra definition of pmd_dirty · 0e2bce74
      Stephen Rothwell 提交于
      Commit d5d6a443 ("arch/powerpc/include/asm/pgtable-ppc64.h:
      add pmd_[dirty|mkclean] for THP") added a new identical definition
      of pmd_dirty(). Remove it again.
      
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0e2bce74
    • A
      powerpc: Simplify module TOC handling · c153693d
      Alan Modra 提交于
      PowerPC64 uses the symbol .TOC. much as other targets use
      _GLOBAL_OFFSET_TABLE_. It identifies the value of the GOT pointer (or in
      powerpc parlance, the TOC pointer). Global offset tables are generally
      local to an executable or shared library, or in the kernel, module. Thus
      it does not make sense for a module to resolve a relocation against
      .TOC. to the kernel's .TOC. value. A module has its own .TOC., and
      indeed the powerpc64 module relocation processing ignores the kernel
      value of .TOC. and instead calculates a module-local value.
      
      This patch removes code involved in exporting the kernel .TOC., tweaks
      modpost to ignore an undefined .TOC., and the module loader to twiddle
      the section symbol so that .TOC. isn't seen as undefined.
      
      Note that if the kernel was compiled with -msingle-pic-base then ELFv2
      would not have function global entry code setting up r2. In that case
      the module call stubs would need to be modified to set up r2 using the
      kernel .TOC. value, requiring some of this code to be reinstated.
      
      mpe: Furthermore a change in binutils master (not yet released) causes
      the current way we handle the TOC to no longer work when building with
      MODVERSIONS=y and RELOCATABLE=n. The symptom is that modules can not be
      loaded due to there being no version found for TOC.
      
      Cc: stable@vger.kernel.org # 3.16+
      Signed-off-by: NAlan Modra <amodra@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c153693d
    • C
      powerpc: Wire up copy_file_range() syscall · d7f9ee60
      Chandan Rajendra 提交于
      Test runs on a ppc64 BE guest succeeded using modified fstests.
      
      Also tested on ppc64 LE using a home made test - mpe.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d7f9ee60
    • C
      dma-mapping: always provide the dma_map_ops based implementation · e1c7e324
      Christoph Hellwig 提交于
      Move the generic implementation to <linux/dma-mapping.h> now that all
      architectures support it and remove the HAVE_DMA_ATTR Kconfig symbol now
      that everyone supports them.
      
      [valentinrothberg@gmail.com: remove leftovers in Kconfig]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Helge Deller <deller@gmx.de>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Signed-off-by: NValentin Rothberg <valentinrothberg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1c7e324
    • D
      powerpc: enable UBSAN support · bf76f73c
      Daniel Axtens 提交于
      This hooks up UBSAN support for PowerPC.
      
      So far it's found some interesting cases where we don't properly sanitise
      input to shifts, including one in our futex handling.  Nothing critical,
      but interesting and worth fixing.
      
      [valentinrothberg@gmail.com: arch/powerpc/Kconfig: fix typo in select statement]
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Tested-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NValentin Rothberg <valentinrothberg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf76f73c
    • R
      powerpc/fadump: rename cpu_online_mask member of struct fadump_crash_info_header · a0512164
      Rasmus Villemoes 提交于
      The four cpumasks cpu_{possible,online,present,active}_bits are exposed
      readonly via the corresponding const variables cpu_xyz_mask.  But they are
      also accessible for arbitrary writing via the exposed functions
      set_cpu_xyz.  There's quite a bit of code throughout the kernel which
      iterates over or otherwise accesses these bitmaps, and having the access
      go via the cpu_xyz_mask variables is nowadays [1] simply a useless
      indirection.
      
      It may be that any problem in CS can be solved by an extra level of
      indirection, but that doesn't mean every extra indirection solves a
      problem.  In this case, it even necessitates some minor ugliness (see
      4/6).
      
      Patch 1/6 is new in v2, and fixes a build failure on ppc by renaming a
      struct member, to avoid problems when the identifier cpu_online_mask
      becomes a macro later in the series.  The next four patches eliminate the
      cpu_xyz_mask variables by simply exposing the actual bitmaps, after
      renaming them to discourage direct access - that still happens through
      cpu_xyz_mask, which are now simply macros with the same type and value as
      they used to have.
      
      After that, there's no longer any reason to have the setter functions be
      out-of-line: The boolean parameter is almost always a literal true or
      false, so by making them static inlines they will usually compile to one
      or two instructions.
      
      For a defconfig build on x86_64, bloat-o-meter says we save ~3000 bytes.
      We also save a little stack (stackdelta says 127 functions have a 16 byte
      smaller stack frame, while two grow by that amount).  Mostly because, when
      iterating over the mask, gcc typically loads the value of cpu_xyz_mask
      into a callee-saved register and from there into %rdi before each
      find_next_bit call - now it can just load the appropriate immediate
      address into %rdi before each call.
      
      [1] See Rusty's kind explanation
      http://thread.gmane.org/gmane.linux.kernel/2047078/focus=2047722 for
      some historic context.
      
      This patch (of 6):
      
      As preparation for eliminating the indirect access to the various global
      cpu_*_bits bitmaps via the pointer variables cpu_*_mask, rename the
      cpu_online_mask member of struct fadump_crash_info_header to simply
      online_mask, thus allowing cpu_online_mask to become a macro.
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0512164
  7. 17 1月, 2016 1 次提交
  8. 16 1月, 2016 6 次提交
    • D
      mm, dax, pmem: introduce pfn_t · 34c0fd54
      Dan Williams 提交于
      For the purpose of communicating the optional presence of a 'struct
      page' for the pfn returned from ->direct_access(), introduce a type that
      encapsulates a page-frame-number plus flags.  These flags contain the
      historical "page_link" encoding for a scatterlist entry, but can also
      denote "device memory".  Where "device memory" is a set of pfns that are
      not part of the kernel's linear mapping by default, but are accessed via
      the same memory controller as ram.
      
      The motivation for this new type is large capacity persistent memory
      that needs struct page entries in the 'memmap' to support 3rd party DMA
      (i.e.  O_DIRECT I/O with a persistent memory source/target).  However,
      we also need it in support of maintaining a list of mapped inodes which
      need to be unmapped at driver teardown or freeze_bdev() time.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34c0fd54
    • D
      kvm: rename pfn_t to kvm_pfn_t · ba049e93
      Dan Williams 提交于
      To date, we have implemented two I/O usage models for persistent memory,
      PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
      userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
      to be the target of direct-i/o.  It allows userspace to coordinate
      DMA/RDMA from/to persistent memory.
      
      The implementation leverages the ZONE_DEVICE mm-zone that went into
      4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
      and dynamically mapped by a device driver.  The pmem driver, after
      mapping a persistent memory range into the system memmap via
      devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
      page-backed pmem-pfns via flags in the new pfn_t type.
      
      The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
      resulting pte(s) inserted into the process page tables with a new
      _PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it keys
      off _PAGE_DEVMAP to pin the device hosting the page range active.
      Finally, get_page() and put_page() are modified to take references
      against the device driver established page mapping.
      
      Finally, this need for "struct page" for persistent memory requires
      memory capacity to store the memmap array.  Given the memmap array for a
      large pool of persistent may exhaust available DRAM introduce a
      mechanism to allocate the memmap from persistent memory.  The new
      "struct vmem_altmap *" parameter to devm_memremap_pages() enables
      arch_add_memory() to use reserved pmem capacity rather than the page
      allocator.
      
      This patch (of 18):
      
      The core has developed a need for a "pfn_t" type [1].  Move the existing
      pfn_t in KVM to kvm_pfn_t [2].
      
      [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
      [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.htmlSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba049e93
    • M
      arch/powerpc/include/asm/pgtable-ppc64.h: add pmd_[dirty|mkclean] for THP · d5d6a443
      Minchan Kim 提交于
      MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
      of the contents since MADV_FREE syscall is called for THP page.
      
      This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
      support.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Evans <je@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5d6a443
    • K
      powerpc, thp: remove infrastructure for handling splitting PMDs · 7aa9a23c
      Kirill A. Shutemov 提交于
      With new refcounting we don't need to mark PMDs splitting.  Let's drop
      code to handle this.
      
      pmdp_splitting_flush() is not needed too: on splitting PMD we will do
      pmdp_clear_flush() + set_pte_at().  pmdp_clear_flush() will do IPI as
      needed for fast_gup.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7aa9a23c
    • K
      mm: drop tail page refcounting · ddc58f27
      Kirill A. Shutemov 提交于
      Tail page refcounting is utterly complicated and painful to support.
      
      It uses ->_mapcount on tail pages to store how many times this page is
      pinned.  get_page() bumps ->_mapcount on tail page in addition to
      ->_count on head.  This information is required by split_huge_page() to
      be able to distribute pins from head of compound page to tails during
      the split.
      
      We will need ->_mapcount to account PTE mappings of subpages of the
      compound page.  We eliminate need in current meaning of ->_mapcount in
      tail pages by forbidding split entirely if the page is pinned.
      
      The only user of tail page refcounting is THP which is marked BROKEN for
      now.
      
      Let's drop all this mess.  It makes get_page() and put_page() much
      simpler.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ddc58f27
    • K
      thp: rename split_huge_page_pmd() to split_huge_pmd() · 78ddc534
      Kirill A. Shutemov 提交于
      We are going to decouple splitting THP PMD from splitting underlying
      compound page.
      
      This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
      to reflect the fact that it doesn't imply page splitting, only PMD.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78ddc534
  9. 15 1月, 2016 1 次提交
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  10. 14 1月, 2016 1 次提交
  11. 13 1月, 2016 5 次提交
  12. 12 1月, 2016 2 次提交
    • H
      powerpc/mm: fix _PAGE_SWP_SOFT_DIRTY breaking swapoff · 2f10f1a7
      Hugh Dickins 提交于
      Swapoff after swapping hangs on the G5, when CONFIG_CHECKPOINT_RESTORE=y
      but CONFIG_MEM_SOFT_DIRTY is not set.  That's because the non-zero
      _PAGE_SWP_SOFT_DIRTY bit, added by CONFIG_HAVE_ARCH_SOFT_DIRTY=y, is not
      discounted when CONFIG_MEM_SOFT_DIRTY is not set: so swap ptes cannot be
      recognized.
      
      (I suspect that the peculiar dependence of HAVE_ARCH_SOFT_DIRTY on
      CHECKPOINT_RESTORE in arch/powerpc/Kconfig comes from an incomplete
      attempt to solve this problem.)
      
      It's true that the relationship between CONFIG_HAVE_ARCH_SOFT_DIRTY and
      and CONFIG_MEM_SOFT_DIRTY is too confusing, and it's true that swapoff
      should be made more robust; but nevertheless, fix up the powerpc ifdefs
      as x86_64 and s390 (which met the same problem) have them, defining the
      bits as 0 if CONFIG_MEM_SOFT_DIRTY is not set.
      
      Fixes: 7207f436 ("powerpc/mm: Add page soft dirty tracking")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2f10f1a7
    • A
      powerpc/mm: Fix _PAGE_PTE breaking swapoff · 44734f23
      Aneesh Kumar K.V 提交于
      Core kernel expects swp_entry_t to consist of only swap type and swap
      offset. We should not leak pte bits into swp_entry_t. This breaks
      swapoff which use the swap type and offset to build a swp_entry_t and
      later compare that to the swp_entry_t obtained from linux page table
      pte. Leaking pte bits into swp_entry_t breaks that comparison and
      results in us looping in try_to_unuse.
      
      The stack trace can be anywhere below try_to_unuse() in mm/swapfile.c,
      since swapoff is circling around and around that function, reading from
      each used swap block into a page, then trying to find where that page
      belongs, looking at every non-file pte of every mm that ever swapped.
      
      Fixes: 6a119eae ("powerpc/mm: Add a _PAGE_PTE bit")
      Reported-by: NHugh Dickins <hughd@google.com>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      44734f23
  13. 11 1月, 2016 5 次提交