1. 16 3月, 2016 4 次提交
    • J
      mm: migrate: do not touch page->mem_cgroup of live pages · 6a93ca8f
      Johannes Weiner 提交于
      Changing a page's memcg association complicates dealing with the page,
      so we want to limit this as much as possible.  Page migration e.g.  does
      not have to do that.  Just like page cache replacement, it can forcibly
      charge a replacement page, and then uncharge the old page when it gets
      freed.  Temporarily overcharging the cgroup by a single page is not an
      issue in practice, and charging is so cheap nowadays that this is much
      preferrable to the headache of messing with live pages.
      
      The only place that still changes the page->mem_cgroup binding of live
      pages is when pages move along with a task to another cgroup.  But that
      path isolates the page from the LRU, takes the page lock, and the move
      lock (lock_page_memcg()).  That means page->mem_cgroup is always stable
      in callers that have the page isolated from the LRU or locked.  Lighter
      unlocked paths, like writeback accounting, can use lock_page_memcg().
      
      [akpm@linux-foundation.org: fix build]
      [vdavydov@virtuozzo.com: fix lockdep splat]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a93ca8f
    • L
      mm/page_poisoning.c: allow for zero poisoning · 1414c7f4
      Laura Abbott 提交于
      By default, page poisoning uses a poison value (0xaa) on free.  If this
      is changed to 0, the page is not only sanitized but zeroing on alloc
      with __GFP_ZERO can be skipped as well.  The tradeoff is that detecting
      corruption from the poisoning is harder to detect.  This feature also
      cannot be used with hibernation since pages are not guaranteed to be
      zeroed after hibernation.
      
      Credit to Grsecurity/PaX team for inspiring this work
      Signed-off-by: NLaura Abbott <labbott@fedoraproject.org>
      Acked-by: NRafael J. Wysocki <rjw@rjwysocki.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1414c7f4
    • L
      mm/page_poison.c: enable PAGE_POISONING as a separate option · 8823b1db
      Laura Abbott 提交于
      Page poisoning is currently set up as a feature if architectures don't
      have architecture debug page_alloc to allow unmapping of pages.  It has
      uses apart from that though.  Clearing of the pages on free provides an
      increase in security as it helps to limit the risk of information leaks.
      Allow page poisoning to be enabled as a separate option independent of
      kernel_map pages since the two features do separate work.  Because of
      how hiberanation is implemented, the checks on alloc cannot occur if
      hibernation is enabled.  The runtime alloc checks can also be enabled
      with an option when !HIBERNATION.
      
      Credit to Grsecurity/PaX team for inspiring this work
      Signed-off-by: NLaura Abbott <labbott@fedoraproject.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8823b1db
    • J
      mm/slab: clean up DEBUG_PAGEALLOC processing code · 40b44137
      Joonsoo Kim 提交于
      Currently, open code for checking DEBUG_PAGEALLOC cache is spread to
      some sites.  It makes code unreadable and hard to change.
      
      This patch cleans up this code.  The following patch will change the
      criteria for DEBUG_PAGEALLOC cache so this clean-up will help it, too.
      
      [akpm@linux-foundation.org: fix build with CONFIG_DEBUG_PAGEALLOC=n]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40b44137
  2. 04 2月, 2016 2 次提交
    • K
      mm: polish virtual memory accounting · 30bdbb78
      Konstantin Khlebnikov 提交于
      * add VM_STACK as alias for VM_GROWSUP/DOWN depending on architecture
      * always account VMAs with flag VM_STACK as stack (as it was before)
      * cleanup classifying helpers
      * update comments and documentation
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Tested-by: NSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30bdbb78
    • J
      proc: revert /proc/<pid>/maps [stack:TID] annotation · 65376df5
      Johannes Weiner 提交于
      Commit b7643757 ("procfs: mark thread stack correctly in
      proc/<pid>/maps") added [stack:TID] annotation to /proc/<pid>/maps.
      
      Finding the task of a stack VMA requires walking the entire thread list,
      turning this into quadratic behavior: a thousand threads means a
      thousand stacks, so the rendering of /proc/<pid>/maps needs to look at a
      million combinations.
      
      The cost is not in proportion to the usefulness as described in the
      patch.
      
      Drop the [stack:TID] annotation to make /proc/<pid>/maps (and
      /proc/<pid>/numa_maps) usable again for higher thread counts.
      
      The [stack] annotation inside /proc/<pid>/task/<tid>/maps is retained, as
      identifying the stack VMA there is an O(1) operation.
      
      Siddesh said:
       "The end users needed a way to identify thread stacks programmatically and
        there wasn't a way to do that.  I'm afraid I no longer remember (or have
        access to the resources that would aid my memory since I changed
        employers) the details of their requirement.  However, I did do this on my
        own time because I thought it was an interesting project for me and nobody
        really gave any feedback then as to its utility, so as far as I am
        concerned you could roll back the main thread maps information since the
        information is available in the thread-specific files"
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
      Cc: Shaohua Li <shli@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65376df5
  3. 30 1月, 2016 1 次提交
    • T
      memremap: Change region_intersects() to take @flags and @desc · 1c29f25b
      Toshi Kani 提交于
      Change region_intersects() to identify a target with @flags and
      @desc, instead of @name with strcmp().
      
      Change the callers of region_intersects(), memremap() and
      devm_memremap(), to set IORESOURCE_SYSTEM_RAM in @flags and
      IORES_DESC_NONE in @desc when searching System RAM.
      
      Also, export region_intersects() so that the ACPI EINJ error
      injection driver can call this function in a later patch.
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jakub Sitnicki <jsitnicki@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jiang Liu <jiang.liu@linux.intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-mm <linux-mm@kvack.org>
      Link: http://lkml.kernel.org/r/1453841853-11383-13-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1c29f25b
  4. 16 1月, 2016 14 次提交
  5. 15 1月, 2016 4 次提交
    • K
      mm: rework virtual memory accounting · 84638335
      Konstantin Khlebnikov 提交于
      When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
      testing the RLIMIT_DATA value to figure out if we're allowed to assign
      new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
      commited that RLIMIT_DATA in a form it's implemented now doesn't do
      anything useful because most of user-space libraries use mmap() syscall
      for dynamic memory allocations.
      
      Linus suggested to convert RLIMIT_DATA rlimit into something suitable
      for anonymous memory accounting.  But in this patch we go further, and
      the changes are bundled together as:
      
       * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
       * replace mm->shared_vm with better defined mm->data_vm
       * account anonymous executable areas as executable
       * account file-backed growsdown/up areas as stack
       * drop struct file* argument from vm_stat_account
       * enforce RLIMIT_DATA for size of data areas
      
      This way code looks cleaner: now code/stack/data classification depends
      only on vm_flags state:
      
       VM_EXEC & ~VM_WRITE            -> code  (VmExe + VmLib in proc)
       VM_GROWSUP | VM_GROWSDOWN      -> stack (VmStk)
       VM_WRITE & ~VM_SHARED & !stack -> data  (VmData)
      
      The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
      "shared", but that might be strange beast like readonly-private or VM_IO
      area.
      
       - RLIMIT_AS            limits whole address space "VmSize"
       - RLIMIT_STACK         limits stack "VmStk" (but each vma individually)
       - RLIMIT_DATA          now limits "VmData"
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Kees Cook <keescook@google.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84638335
    • M
      mm: allow GFP_{FS,IO} for page_cache_read page cache allocation · c20cd45e
      Michal Hocko 提交于
      page_cache_read has been historically using page_cache_alloc_cold to
      allocate a new page.  This means that mapping_gfp_mask is used as the
      base for the gfp_mask.  Many filesystems are setting this mask to
      GFP_NOFS to prevent from fs recursion issues.  page_cache_read is called
      from the vm_operations_struct::fault() context during the page fault.
      This context doesn't need the reclaim protection normally.
      
      ceph and ocfs2 which call filemap_fault from their fault handlers seem
      to be OK because they are not taking any fs lock before invoking generic
      implementation.  xfs which takes XFS_MMAPLOCK_SHARED is safe from the
      reclaim recursion POV because this lock serializes truncate and punch
      hole with the page faults and it doesn't get involved in the reclaim.
      
      There is simply no reason to deliberately use a weaker allocation
      context when a __GFP_FS | __GFP_IO can be used.  The GFP_NOFS protection
      might be even harmful.  There is a push to fail GFP_NOFS allocations
      rather than loop within allocator indefinitely with a very limited
      reclaim ability.  Once we start failing those requests the OOM killer
      might be triggered prematurely because the page cache allocation failure
      is propagated up the page fault path and end up in
      pagefault_out_of_memory.
      
      We cannot play with mapping_gfp_mask directly because that would be racy
      wrt.  parallel page faults and it might interfere with other users who
      really rely on NOFS semantic from the stored gfp_mask.  The mask is also
      inode proper so it would even be a layering violation.  What we can do
      instead is to push the gfp_mask into struct vm_fault and allow fs layer
      to overwrite it should the callback need to be called with a different
      allocation context.
      
      Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO)
      because this should be safe from the page fault path normally.  Why do
      we care about mapping_gfp_mask at all then? Because this doesn't hold
      only reclaim protection flags but it also might contain zone and
      movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have
      to respect those.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NJan Kara <jack@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c20cd45e
    • D
      mm: mmap: add new /proc tunable for mmap_base ASLR · d07e2259
      Daniel Cashman 提交于
      Address Space Layout Randomization (ASLR) provides a barrier to
      exploitation of user-space processes in the presence of security
      vulnerabilities by making it more difficult to find desired code/data
      which could help an attack.  This is done by adding a random offset to
      the location of regions in the process address space, with a greater
      range of potential offset values corresponding to better protection/a
      larger search-space for brute force, but also to greater potential for
      fragmentation.
      
      The offset added to the mmap_base address, which provides the basis for
      the majority of the mappings for a process, is set once on process exec
      in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
      which reflect, hopefully, the best compromise for all systems.  The
      trade-off between increased entropy in the offset value generation and
      the corresponding increased variability in address space fragmentation
      is not absolute, however, and some platforms may tolerate higher amounts
      of entropy.  This patch introduces both new Kconfig values and a sysctl
      interface which may be used to change the amount of entropy used for
      offset generation on a system.
      
      The direct motivation for this change was in response to the
      libstagefright vulnerabilities that affected Android, specifically to
      information provided by Google's project zero at:
      
        http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html
      
      The attack presented therein, by Google's project zero, specifically
      targeted the limited randomness used to generate the offset added to the
      mmap_base address in order to craft a brute-force-based attack.
      Concretely, the attack was against the mediaserver process, which was
      limited to respawning every 5 seconds, on an arm device.  The hard-coded
      8 bits used resulted in an average expected success rate of defeating
      the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
      piece).  With this patch, and an accompanying increase in the entropy
      value to 16 bits, the same attack would take an average expected time of
      over 45 hours (32768 tries), which makes it both less feasible and more
      likely to be noticed.
      
      The introduced Kconfig and sysctl options are limited by per-arch
      minimum and maximum values, the minimum of which was chosen to match the
      current hard-coded value and the maximum of which was chosen so as to
      give the greatest flexibility without generating an invalid mmap_base
      address, generally a 3-4 bits less than the number of bits in the
      user-space accessible virtual address space.
      
      When decided whether or not to change the default value, a system
      developer should consider that mmap_base address could be placed
      anywhere up to 2^(value) bits away from the non-randomized location,
      which would introduce variable-sized areas above and below the mmap_base
      address such that the maximum vm_area_struct size may be reduced,
      preventing very large allocations.
      
      This patch (of 4):
      
      ASLR only uses as few as 8 bits to generate the random offset for the
      mmap base address on 32 bit architectures.  This value was chosen to
      prevent a poorly chosen value from dividing the address space in such a
      way as to prevent large allocations.  This may not be an issue on all
      platforms.  Allow the specification of a minimum number of bits so that
      platforms desiring greater ASLR protection may determine where to place
      the trade-off.
      Signed-off-by: NDaniel Cashman <dcashman@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Nick Kralevich <nnk@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d07e2259
    • J
      mm, shmem: add internal shmem resident memory accounting · eca56ff9
      Jerome Marchand 提交于
      Currently looking at /proc/<pid>/status or statm, there is no way to
      distinguish shmem pages from pages mapped to a regular file (shmem pages
      are mapped to /dev/zero), even though their implication in actual memory
      use is quite different.
      
      The internal accounting currently counts shmem pages together with
      regular files.  As a preparation to extend the userspace interfaces,
      this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
      shmem pages separately from MM_FILEPAGES.  The next patch will expose it
      to userspace - this patch doesn't change the exported values yet, by
      adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
      used before.  The only user-visible change after this patch is the OOM
      killer message that separates the reported "shmem-rss" from "file-rss".
      
      [vbabka@suse.cz: forward-porting, tweak changelog]
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eca56ff9
  6. 07 11月, 2015 3 次提交
  7. 06 11月, 2015 3 次提交
    • E
      mm: introduce VM_LOCKONFAULT · de60f5f1
      Eric B Munson 提交于
      The cost of faulting in all memory to be locked can be very high when
      working with large mappings.  If only portions of the mapping will be used
      this can incur a high penalty for locking.
      
      For the example of a large file, this is the usage pattern for a large
      statical language model (probably applies to other statical or graphical
      models as well).  For the security example, any application transacting in
      data that cannot be swapped out (credit card data, medical records, etc).
      
      This patch introduces the ability to request that pages are not
      pre-faulted, but are placed on the unevictable LRU when they are finally
      faulted in.  The VM_LOCKONFAULT flag will be used together with VM_LOCKED
      and has no effect when set without VM_LOCKED.  Setting the VM_LOCKONFAULT
      flag for a VMA will cause pages faulted into that VMA to be added to the
      unevictable LRU when they are faulted or if they are already present, but
      will not cause any missing pages to be faulted in.
      
      Exposing this new lock state means that we cannot overload the meaning of
      the FOLL_POPULATE flag any longer.  Prior to this patch it was used to
      mean that the VMA for a fault was locked.  This means we need the new
      FOLL_MLOCK flag to communicate the locked state of a VMA.  FOLL_POPULATE
      will now only control if the VMA should be populated and in the case of
      VM_LOCKONFAULT, it will not be set.
      Signed-off-by: NEric B Munson <emunson@akamai.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de60f5f1
    • V
      mm: do not inc NR_PAGETABLE if ptlock_init failed · 706874e9
      Vladimir Davydov 提交于
      If ALLOC_SPLIT_PTLOCKS is defined, ptlock_init may fail, in which case we
      shouldn't increment NR_PAGETABLE.
      
      Since small allocations, such as ptlock, normally do not fail (currently
      they can fail if kmemcg is used though), this patch does not really fix
      anything and should be considered as a code cleanup.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      706874e9
    • R
      mm: use only per-device readahead limit · 600e19af
      Roman Gushchin 提交于
      Maximal readahead size is limited now by two values:
       1) by global 2Mb constant (MAX_READAHEAD in max_sane_readahead())
       2) by configurable per-device value* (bdi->ra_pages)
      
      There are devices, which require custom readahead limit.
      For instance, for RAIDs it's calculated as number of devices
      multiplied by chunk size times 2.
      
      Readahead size can never be larger than bdi->ra_pages * 2 value
      (POSIX_FADV_SEQUNTIAL doubles readahead size).
      
      If so, why do we need two limits?
      I suggest to completely remove this max_sane_readahead() stuff and
      use per-device readahead limit everywhere.
      
      Also, using right readahead size for RAID disks can significantly
      increase i/o performance:
      
      before:
        dd if=/dev/md2 of=/dev/null bs=100M count=100
        100+0 records in
        100+0 records out
        10485760000 bytes (10 GB) copied, 12.9741 s, 808 MB/s
      
      after:
        $ dd if=/dev/md2 of=/dev/null bs=100M count=100
        100+0 records in
        100+0 records out
        10485760000 bytes (10 GB) copied, 8.91317 s, 1.2 GB/s
      
      (It's an 8-disks RAID5 storage).
      
      This patch doesn't change sys_readahead and madvise(MADV_WILLNEED)
      behavior introduced by 6d2be915 ("mm/readahead.c: fix readahead
      failure for memoryless NUMA nodes and limit readahead pages").
      Signed-off-by: NRoman Gushchin <klamm@yandex-team.ru>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: onstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      600e19af
  8. 02 10月, 2015 1 次提交
    • G
      memcg: fix dirty page migration · 0610c25d
      Greg Thelen 提交于
      The problem starts with a file backed dirty page which is charged to a
      memcg.  Then page migration is used to move oldpage to newpage.
      
      Migration:
       - copies the oldpage's data to newpage
       - clears oldpage.PG_dirty
       - sets newpage.PG_dirty
       - uncharges oldpage from memcg
       - charges newpage to memcg
      
      Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
      count.
      
      However, because newpage is not yet charged, setting newpage.PG_dirty
      does not increment the memcg's dirty page count.  After migration
      completes newpage.PG_dirty is eventually cleared, often in
      account_page_cleaned().  At this time newpage is charged to a memcg so
      the memcg's dirty page count is decremented which causes underflow
      because the count was not previously incremented by migration.  This
      underflow causes balance_dirty_pages() to see a very large unsigned
      number of dirty memcg pages which leads to aggressive throttling of
      buffered writes by processes in non root memcg.
      
      This issue:
       - can harm performance of non root memcg buffered writes.
       - can report too small (even negative) values in
         memory.stat[(total_)dirty] counters of all memcg, including the root.
      
      To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
      page_memcg() and set_page_memcg() helpers.
      
      Test:
          0) setup and enter limited memcg
          mkdir /sys/fs/cgroup/test
          echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
          echo $$ > /sys/fs/cgroup/test/cgroup.procs
      
          1) buffered writes baseline
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
          2) buffered writes with compaction antagonist to induce migration
          yes 1 > /proc/sys/vm/compact_memory &
          rm -rf /data/tmp/foo
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          kill %
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
          3) buffered writes without antagonist, should match baseline
          rm -rf /data/tmp/foo
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
                             (speed, dirty residue)
                   unpatched                       patched
          1) 841 MB/s 0 dirty pages          886 MB/s 0 dirty pages
          2) 611 MB/s -33427456 dirty pages  793 MB/s 0 dirty pages
          3) 114 MB/s -33427456 dirty pages  891 MB/s 0 dirty pages
      
          Notice that unpatched baseline performance (1) fell after
          migration (3): 841 -> 114 MB/s.  In the patched kernel, post
          migration performance matches baseline.
      
      Fixes: c4843a75 ("memcg: add per cgroup dirty page accounting")
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0610c25d
  9. 11 9月, 2015 1 次提交
  10. 09 9月, 2015 5 次提交
  11. 05 9月, 2015 2 次提交