1. 20 10月, 2016 2 次提交
  2. 08 10月, 2016 38 次提交
    • P
      console: don't prefer first registered if DT specifies stdout-path · 05fd007e
      Paul Burton 提交于
      If a device tree specifies a preferred device for kernel console output
      via the stdout-path or linux,stdout-path chosen node properties or the
      stdout alias then the kernel ought to honor it & output the kernel
      console to that device.  As it stands, this isn't the case.  Whilst we
      parse the stdout-path properties & set an of_stdout variable from
      of_alias_scan(), and use that from of_console_check() to determine
      whether to add a console device as a preferred console whilst
      registering it, we also prefer the first registered console if no other
      has been selected at the time of its registration.
      
      This means that if a console other than the one the device tree selects
      via stdout-path is registered first, we will switch to using it & when
      the stdout-path console is later registered the call to
      add_preferred_console() via of_console_check() is too late to do
      anything useful.  In practice this seems to mean that we switch to the
      dummy console device fairly early & see no further console output:
      
          Console: colour dummy device 80x25
          console [tty0] enabled
          bootconsole [ns16550a0] disabled
      
      Fix this by not automatically preferring the first registered console if
      one is specified by the device tree.  This allows consoles to be
      registered but not enabled, and once the driver for the console selected
      by stdout-path calls of_console_check() the driver will be added to the
      list of preferred consoles before any other console has been enabled.
      When that console is then registered via register_console() it will be
      enabled as expected.
      
      Link: http://lkml.kernel.org/r/20160809151937.26118-1-paul.burton@imgtec.comSigned-off-by: NPaul Burton <paul.burton@imgtec.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Ivan Delalande <colona@arista.com>
      Cc: Thierry Reding <treding@nvidia.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jan Kara <jack@suse.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05fd007e
    • A
      cred: simpler, 1D supplementary groups · 81243eac
      Alexey Dobriyan 提交于
      Current supplementary groups code can massively overallocate memory and
      is implemented in a way so that access to individual gid is done via 2D
      array.
      
      If number of gids is <= 32, memory allocation is more or less tolerable
      (140/148 bytes).  But if it is not, code allocates full page (!)
      regardless and, what's even more fun, doesn't reuse small 32-entry
      array.
      
      2D array means dependent shifts, loads and LEAs without possibility to
      optimize them (gid is never known at compile time).
      
      All of the above is unnecessary.  Switch to the usual
      trailing-zero-len-array scheme.  Memory is allocated with
      kmalloc/vmalloc() and only as much as needed.  Accesses become simpler
      (LEA 8(gi,idx,4) or even without displacement).
      
      Maximum number of gids is 65536 which translates to 256KB+8 bytes.  I
      think kernel can handle such allocation.
      
      On my usual desktop system with whole 9 (nine) aux groups, struct
      group_info shrinks from 148 bytes to 44 bytes, yay!
      
      Nice side effects:
      
       - "gi->gid[i]" is shorter than "GROUP_AT(gi, i)", less typing,
      
       - fix little mess in net/ipv4/ping.c
         should have been using GROUP_AT macro but this point becomes moot,
      
       - aux group allocation is persistent and should be accounted as such.
      
      Link: http://lkml.kernel.org/r/20160817201927.GA2096@p183.telecom.bySigned-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Vasily Kulikov <segoon@openwall.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81243eac
    • C
      nmi_backtrace: generate one-line reports for idle cpus · 6727ad9e
      Chris Metcalf 提交于
      When doing an nmi backtrace of many cores, most of which are idle, the
      output is a little overwhelming and very uninformative.  Suppress
      messages for cpus that are idling when they are interrupted and just
      emit one line, "NMI backtrace for N skipped: idling at pc 0xNNN".
      
      We do this by grouping all the cpuidle code together into a new
      .cpuidle.text section, and then checking the address of the interrupted
      PC to see if it lies within that section.
      
      This commit suitably tags x86 and tile idle routines, and only adds in
      the minimal framework for other architectures.
      
      Link: http://lkml.kernel.org/r/1472487169-14923-5-git-send-email-cmetcalf@mellanox.comSigned-off-by: NChris Metcalf <cmetcalf@mellanox.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: Daniel Thompson <daniel.thompson@linaro.org> [arm]
      Tested-by: NPetr Mladek <pmladek@suse.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6727ad9e
    • C
      nmi_backtrace: add more trigger_*_cpu_backtrace() methods · 9a01c3ed
      Chris Metcalf 提交于
      Patch series "improvements to the nmi_backtrace code" v9.
      
      This patch series modifies the trigger_xxx_backtrace() NMI-based remote
      backtracing code to make it more flexible, and makes a few small
      improvements along the way.
      
      The motivation comes from the task isolation code, where there are
      scenarios where we want to be able to diagnose a case where some cpu is
      about to interrupt a task-isolated cpu.  It can be helpful to see both
      where the interrupting cpu is, and also an approximation of where the
      cpu that is being interrupted is.  The nmi_backtrace framework allows us
      to discover the stack of the interrupted cpu.
      
      I've tested that the change works as desired on tile, and build-tested
      x86, arm, mips, and sparc64.  For x86 I confirmed that the generic
      cpuidle stuff as well as the architecture-specific routines are in the
      new cpuidle section.  For arm, mips, and sparc I just build-tested it
      and made sure the generic cpuidle routines were in the new cpuidle
      section, but I didn't attempt to figure out which the platform-specific
      idle routines might be.  That might be more usefully done by someone
      with platform experience in follow-up patches.
      
      This patch (of 4):
      
      Currently you can only request a backtrace of either all cpus, or all
      cpus but yourself.  It can also be helpful to request a remote backtrace
      of a single cpu, and since we want that, the logical extension is to
      support a cpumask as the underlying primitive.
      
      This change modifies the existing lib/nmi_backtrace.c code to take a
      cpumask as its basic primitive, and modifies the linux/nmi.h code to use
      the new "cpumask" method instead.
      
      The existing clients of nmi_backtrace (arm and x86) are converted to
      using the new cpumask approach in this change.
      
      The other users of the backtracing API (sparc64 and mips) are converted
      to use the cpumask approach rather than the all/allbutself approach.
      The mips code ignored the "include_self" boolean but with this change it
      will now also dump a local backtrace if requested.
      
      Link: http://lkml.kernel.org/r/1472487169-14923-2-git-send-email-cmetcalf@mellanox.comSigned-off-by: NChris Metcalf <cmetcalf@mellanox.com>
      Tested-by: Daniel Thompson <daniel.thompson@linaro.org> [arm]
      Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
      Reviewed-by: NPetr Mladek <pmladek@suse.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a01c3ed
    • J
      min/max: remove sparse warnings when they're nested · 589a9785
      Johannes Berg 提交于
      Currently, when min/max are nested within themselves, sparse will warn:
      
          warning: symbol '_min1' shadows an earlier one
          originally declared here
          warning: symbol '_min1' shadows an earlier one
          originally declared here
          warning: symbol '_min2' shadows an earlier one
          originally declared here
      
      This also immediately happens when min3() or max3() are used.
      
      Since sparse implements __COUNTER__, we can use __UNIQUE_ID() to
      generate unique variable names, avoiding this.
      
      Link: http://lkml.kernel.org/r/1471519773-29882-1-git-send-email-johannes@sipsolutions.netSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      589a9785
    • J
      seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char · 75ba1d07
      Joe Perches 提交于
      Allow some seq_puts removals by taking a string instead of a single
      char.
      
      [akpm@linux-foundation.org: update vmstat_show(), per Joe]
      Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.comSigned-off-by: NJoe Perches <joe@perches.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75ba1d07
    • Z
      linux/mm.h: canonicalize macro PAGE_ALIGNED() definition · 1061b0d2
      zijun_hu 提交于
      The macro PAGE_ALIGNED() is prone to cause error because it doesn't
      follow convention to parenthesize parameter @addr within macro body, for
      example unsigned long *ptr = kmalloc(...); PAGE_ALIGNED(ptr + 16); for
      the left parameter of macro IS_ALIGNED(), (unsigned long)(ptr + 16) is
      desired but the actual one is (unsigned long)ptr + 16.
      
      It is fixed by simply canonicalizing macro PAGE_ALIGNED() definition.
      
      Link: http://lkml.kernel.org/r/57EA6AE7.7090807@zoho.comSigned-off-by: Nzijun_hu <zijun_hu@htc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1061b0d2
    • Z
      mm: remove unnecessary condition in remove_inode_hugepages · 72e2936c
      zhong jiang 提交于
      When the huge page is added to the page cahce (huge_add_to_page_cache),
      the page private flag will be cleared.  since this code
      (remove_inode_hugepages) will only be called for pages in the page
      cahce, PagePrivate(page) will always be false.
      
      The patch remove the code without any functional change.
      
      Link: http://lkml.kernel.org/r/1475113323-29368-1-git-send-email-zhongjiang@huawei.comSigned-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72e2936c
    • M
      mm: consolidate warn_alloc_failed users · 7877cdcc
      Michal Hocko 提交于
      warn_alloc_failed is currently used from the page and vmalloc
      allocators.  This is a good reuse of the code except that vmalloc would
      appreciate a slightly different warning message.  This is already
      handled by the fmt parameter except that
      
        "%s: page allocation failure: order:%u, mode:%#x(%pGg)"
      
      is printed anyway.  This might be quite misleading because it might be a
      vmalloc failure which leads to the warning while the page allocator is
      not the culprit here.  Fix this by always using the fmt string and only
      print the context that makes sense for the particular context (e.g.
      order makes only very little sense for the vmalloc context).
      
      Rename the function to not miss any user and also because a later patch
      will reuse it also for !failure cases.
      
      Link: http://lkml.kernel.org/r/20160929084407.7004-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7877cdcc
    • A
      mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk · e86f15ee
      Andrea Arcangeli 提交于
      The rmap_walk can access vm_page_prot (and potentially vm_flags in the
      pte/pmd manipulations).  So it's not safe to wait the caller to update
      the vm_page_prot/vm_flags after vma_merge returned potentially removing
      the "next" vma and extending the "current" vma over the
      next->vm_start,vm_end range, but still with the "current" vma
      vm_page_prot, after releasing the rmap locks.
      
      The vm_page_prot/vm_flags must be transferred from the "next" vma to the
      current vma while vma_merge still holds the rmap locks.
      
      The side effect of this race condition is pte corruption during migrate
      as remove_migration_ptes when run on a address of the "next" vma that
      got removed, used the vm_page_prot of the current vma.
      
        migrate   	      	        mprotect
        ------------			-------------
        migrating in "next" vma
      				vma_merge() # removes "next" vma and
      			        	    # extends "current" vma
      					    # current vma is not with
      					    # vm_page_prot updated
        remove_migration_ptes
        read vm_page_prot of current "vma"
        establish pte with wrong permissions
      				vm_set_page_prot(vma) # too late!
      				change_protection in the old vma range
      				only, next range is not updated
      
      This caused segmentation faults and potentially memory corruption in
      heavy mprotect loads with some light page migration caused by compaction
      in the background.
      
      Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
      which confirms the case 8 is only buggy one where the race can trigger,
      in all other vma_merge cases the above cannot happen.
      
      This fix removes the oddness factor from case 8 and it converts it from:
      
            AAAA
        PPPPNNNNXXXX -> PPPPNNNNNNNN
      
      to:
      
            AAAA
        PPPPNNNNXXXX -> PPPPXXXXXXXX
      
      XXXX has the right vma properties for the whole merged vma returned by
      vma_adjust, so it solves the problem fully.  It has the added benefits
      that the callers could stop updating vma properties when vma_merge
      succeeds however the callers are not updated by this patch (there are
      bits like VM_SOFTDIRTY that still need special care for the whole range,
      as the vma merging ignores them, but as long as they're not processed by
      rmap walks and instead they're accessed with the mmap_sem at least for
      reading, they are fine not to be updated within vma_adjust before
      releasing the rmap_locks).
      
      Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NAditya Mandaleeka <adityam@microsoft.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jan Vorlicek <janvorli@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e86f15ee
    • A
      mm: vm_page_prot: update with WRITE_ONCE/READ_ONCE · 6d2329f8
      Andrea Arcangeli 提交于
      vma->vm_page_prot is read lockless from the rmap_walk, it may be updated
      concurrently and this prevents the risk of reading intermediate values.
      
      Link: http://lkml.kernel.org/r/1474660305-19222-1-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jan Vorlicek <janvorli@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d2329f8
    • G
      mm/hugetlb: check for reserved hugepages during memory offline · 082d5b6b
      Gerald Schaefer 提交于
      In dissolve_free_huge_pages(), free hugepages will be dissolved without
      making sure that there are enough of them left to satisfy hugepage
      reservations.
      
      Fix this by adding a return value to dissolve_free_huge_pages() and
      checking h->free_huge_pages vs.  h->resv_huge_pages.  Note that this may
      lead to the situation where dissolve_free_huge_page() returns an error
      and all free hugepages that were dissolved before that error are lost,
      while the memory block still cannot be set offline.
      
      Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
      Link: http://lkml.kernel.org/r/20160926172811.94033-3-gerald.schaefer@de.ibm.comSigned-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rui Teng <rui.teng@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      082d5b6b
    • J
      mm: memcontrol: consolidate cgroup socket tracking · 2d758073
      Johannes Weiner 提交于
      The cgroup core and the memory controller need to track socket ownership
      for different purposes, but the tracking sites being entirely different
      is kind of ugly.
      
      Be a better citizen and rename the memory controller callbacks to match
      the cgroup core callbacks, then move them to the same place.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20160914194846.11153-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d758073
    • B
      mm: move phys_mem_access_prot_allowed() declaration to pgtable.h · 08ea8c07
      Baoyou Xie 提交于
      We get 1 warning when building kernel with W=1:
      
        drivers/char/mem.c:220:12: warning: no previous prototype for 'phys_mem_access_prot_allowed' [-Wmissing-prototypes]
         int __weak phys_mem_access_prot_allowed(struct file *file,
      
      In fact, its declaration is spreading to several header files in
      different architecture, but need to be declare in common header file.
      
      So this patch moves phys_mem_access_prot_allowed() to pgtable.h.
      
      Link: http://lkml.kernel.org/r/1473751597-12139-1-git-send-email-baoyou.xie@linaro.orgSigned-off-by: NBaoyou Xie <baoyou.xie@linaro.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08ea8c07
    • V
      mm, compaction: restrict full priority to non-costly orders · c2033b00
      Vlastimil Babka 提交于
      The new ultimate compaction priority disables some heuristics, which may
      result in excessive cost.  This is fine for non-costly orders where we
      want to try hard before resulting for OOM, but might be disruptive for
      costly orders which do not trigger OOM and should generally have some
      fallback.  Thus, we disable the full priority for costly orders.
      Suggested-by: NMichal Hocko <mhocko@kernel.org>
      Link: http://lkml.kernel.org/r/20160906135258.18335-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2033b00
    • H
      mm: remove page_file_index · 8cd79788
      Huang Ying 提交于
      After using the offset of the swap entry as the key of the swap cache,
      the page_index() becomes exactly same as page_file_index().  So the
      page_file_index() is removed and the callers are changed to use
      page_index() instead.
      
      Link: http://lkml.kernel.org/r/1473270649-27229-2-git-send-email-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: Anna Schumaker <anna.schumaker@netapp.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cd79788
    • H
      mm, swap: use offset of swap entry as key of swap cache · f6ab1f7f
      Huang Ying 提交于
      This patch is to improve the performance of swap cache operations when
      the type of the swap device is not 0.  Originally, the whole swap entry
      value is used as the key of the swap cache, even though there is one
      radix tree for each swap device.  If the type of the swap device is not
      0, the height of the radix tree of the swap cache will be increased
      unnecessary, especially on 64bit architecture.  For example, for a 1GB
      swap device on the x86_64 architecture, the height of the radix tree of
      the swap cache is 11.  But if the offset of the swap entry is used as
      the key of the swap cache, the height of the radix tree of the swap
      cache is 4.  The increased height causes unnecessary radix tree
      descending and increased cache footprint.
      
      This patch reduces the height of the radix tree of the swap cache via
      using the offset of the swap entry instead of the whole swap entry value
      as the key of the swap cache.  In 32 processes sequential swap out test
      case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
      for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
      when the type of the swap device is 1.
      
      Use the whole swap entry as key,
      
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37,
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78,
      
      Use the swap offset as key,
      
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25,
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94,
      
      Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6ab1f7f
    • A
      thp: reduce usage of huge zero page's atomic counter · 6fcb52a5
      Aaron Lu 提交于
      The global zero page is used to satisfy an anonymous read fault.  If
      THP(Transparent HugePage) is enabled then the global huge zero page is
      used.  The global huge zero page uses an atomic counter for reference
      counting and is allocated/freed dynamically according to its counter
      value.
      
      CPU time spent on that counter will greatly increase if there are a lot
      of processes doing anonymous read faults.  This patch proposes a way to
      reduce the access to the global counter so that the CPU load can be
      reduced accordingly.
      
      To do this, a new flag of the mm_struct is introduced:
      MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
      the global counter in two cases:
      
       1 The first time it uses the global huge zero page;
       2 The time when mm_user of its mm_struct reaches zero.
      
      Note that right now, the huge zero page is eligible to be freed as soon
      as its last use goes away.  With this patch, the page will not be
      eligible to be freed until the exit of the last process from which it
      was ever used.
      
      And with the use of mm_user, the kthread is not eligible to use huge
      zero page either.  Since no kthread is using huge zero page today, there
      is no difference after applying this patch.  But if that is not desired,
      I can change it to when mm_count reaches zero.
      
      Case used for test on Haswell EP:
      
        usemem -n 72 --readonly -j 0x200000 100G
      
      Which spawns 72 processes and each will mmap 100G anonymous space and
      then do read only access to that space sequentially with a step of 2MB.
      
        CPU cycles from perf report for base commit:
            54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
        CPU cycles from perf report for this commit:
             0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page
      
      Performance(throughput) of the workload for base commit: 1784430792
      Performance(throughput) of the workload for this commit: 4726928591
      164% increase.
      
      Runtime of the workload for base commit: 707592 us
      Runtime of the workload for this commit: 303970 us
      50% drop.
      
      Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6fcb52a5
    • T
      thp, dax: add thp_get_unmapped_area for pmd mappings · 74d2fad1
      Toshi Kani 提交于
      When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using pmd page size.
      This feature relies on both mmap virtual address and FS block (i.e.
      physical address) to be aligned by the pmd page size.  Users can use
      mkfs options to specify FS to align block allocations.  However,
      aligning mmap address requires code changes to existing applications for
      providing a pmd-aligned address to mmap().
      
      For instance, fio with "ioengine=mmap" performs I/Os with mmap() [1].
      It calls mmap() with a NULL address, which needs to be changed to
      provide a pmd-aligned address for testing with DAX pmd mappings.
      Changing all applications that call mmap() with NULL is undesirable.
      
      Add thp_get_unmapped_area(), which can be called by filesystem's
      get_unmapped_area to align an mmap address by the pmd size for a DAX
      file.  It calls the default handler, mm->get_unmapped_area(), to find a
      range and then aligns it for a DAX file.
      
      The patch is based on Matthew Wilcox's change that allows adding support
      of the pud page size easily.
      
      [1]: https://github.com/axboe/fio/blob/master/engines/mmap.c
      Link: http://lkml.kernel.org/r/1472497881-9323-2-git-send-email-toshi.kani@hpe.comSigned-off-by: NToshi Kani <toshi.kani@hpe.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74d2fad1
    • H
      mm: don't use radix tree writeback tags for pages in swap cache · 371a096e
      Huang Ying 提交于
      File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK,
      etc.) to accelerate finding the pages with a specific tag in the radix
      tree during inode writeback.  But for anonymous pages in the swap cache,
      there is no inode writeback.  So there is no need to find the pages with
      some writeback tags in the radix tree.  It is not necessary to touch
      radix tree writeback tags for pages in the swap cache.
      
      Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is
      introduced for address spaces which don't need to update the writeback
      tags.  The flag is set for swap caches.  It may be used for DAX file
      systems, etc.
      
      With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to
      ~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes.
      The test is done on a Xeon E5 v3 system.  The swap device used is a RAM
      simulated PMEM (persistent memory) device.  The improvement comes from
      the reduced contention on the swap cache radix tree lock.  To test
      sequential swapping out, the test case uses 8 processes, which
      sequentially allocate and write to the anonymous pages until RAM and
      part of the swap device is used up.
      
      Details of comparison is as follow,
      
      base             base+patch
      ---------------- --------------------------
               %stddev     %change         %stddev
                   \          |                \
         2506952 ±  2%     +28.1%    3212076 ±  7%  vm-scalability.throughput
         1207402 ±  7%     +22.3%    1476578 ±  6%  vmstat.swap.so
           10.86 ± 12%     -23.4%       8.31 ± 16%  perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
           10.82 ± 13%     -33.1%       7.24 ± 14%  perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg
           10.36 ± 11%    -100.0%       0.00 ± -1%  perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage
           10.52 ± 12%    -100.0%       0.00 ± -1%  perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page
      
      Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      371a096e
    • Z
      mm/nobootmem.c: remove duplicate macro ARCH_LOW_ADDRESS_LIMIT statements · 2382705f
      zijun_hu 提交于
      Fix the following bugs:
      
       - the same ARCH_LOW_ADDRESS_LIMIT statements are duplicated between
         header and relevant source
      
       - don't ensure ARCH_LOW_ADDRESS_LIMIT perhaps defined by ARCH in
         asm/processor.h is preferred over default in linux/bootmem.h
         completely since the former header isn't included by the latter
      
      Link: http://lkml.kernel.org/r/e046aeaa-e160-6d9e-dc1b-e084c2fd999f@zoho.comSigned-off-by: Nzijun_hu <zijun_hu@htc.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2382705f
    • S
      mm/memblock.c: expose total reserved memory · 8907de5d
      Srikar Dronamraju 提交于
      The total reserved memory in a system is accounted but not available for
      use use outside mm/memblock.c.  By exposing the total reserved memory,
      systems can better calculate the size of large hashes.
      
      Link: http://lkml.kernel.org/r/1472476010-4709-3-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Suggested-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8907de5d
    • S
      mm: introduce arch_reserved_kernel_pages() · f6f34b43
      Srikar Dronamraju 提交于
      Currently arch specific code can reserve memory blocks but
      alloc_large_system_hash() may not take it into consideration when sizing
      the hashes.  This can lead to bigger hash than required and lead to no
      available memory for other purposes.  This is specifically true for
      systems with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
      
      One approach to solve this problem would be to walk through the memblock
      regions and calculate the available memory and base the size of hash
      system on the available memory.
      
      The other approach would be to depend on the architecture to provide the
      number of pages that are reserved.  This change provides hooks to allow
      the architecture to provide the required info.
      
      Link: http://lkml.kernel.org/r/1472476010-4709-2-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Suggested-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6f34b43
    • M
      mm: make sure that kthreads will not refault oom reaped memory · 3f70dc38
      Michal Hocko 提交于
      There are only few use_mm() users in the kernel right now.  Most of them
      write to the target memory but vhost driver relies on
      copy_from_user/get_user from a kernel thread context.  This makes it
      impossible to reap the memory of an oom victim which shares the mm with
      the vhost kernel thread because it could see a zero page unexpectedly
      and theoretically make an incorrect decision visible outside of the
      killed task context.
      
      To quote Michael S. Tsirkin:
      : Getting an error from __get_user and friends is handled gracefully.
      : Getting zero instead of a real value will cause userspace
      : memory corruption.
      
      The vhost kernel thread is bound to an open fd of the vhost device which
      is not tight to the mm owner life cycle in general.  The device fd can
      be inherited or passed over to another process which means that we
      really have to be careful about unexpected memory corruption because
      unlike for normal oom victims the result will be visible outside of the
      oom victim context.
      
      Make sure that no kthread context (users of use_mm) can ever see
      corrupted data because of the oom reaper and hook into the page fault
      path by checking MMF_UNSTABLE mm flag.  __oom_reap_task_mm will set the
      flag before it starts unmapping the address space while the flag is
      checked after the page fault has been handled.  If the flag is set then
      SIGBUS is triggered so any g-u-p user will get a error code.
      
      Regular tasks do not need this protection because all which share the mm
      are killed when the mm is reaped and so the corruption will not outlive
      them.
      
      This patch shouldn't have any visible effect at this moment because the
      OOM killer doesn't invoke oom reaper for tasks with mm shared with
      kthreads yet.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: N"Michael S. Tsirkin" <mst@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f70dc38
    • T
      mm, oom: enforce exit_oom_victim on current task · 38531201
      Tetsuo Handa 提交于
      There are no users of exit_oom_victim on !current task anymore so enforce
      the API to always work on the current.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-8-git-send-email-mhocko@kernel.orgSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38531201
    • M
      oom, suspend: fix oom_killer_disable vs. pm suspend properly · 7d2e7a22
      Michal Hocko 提交于
      Commit 74070542 ("oom, suspend: fix oom_reaper vs.
      oom_killer_disable race") has workaround an existing race between
      oom_killer_disable and oom_reaper by adding another round of
      try_to_freeze_tasks after the oom killer was disabled.  This was the
      easiest thing to do for a late 4.7 fix.  Let's fix it properly now.
      
      After "oom: keep mm of the killed task available" we no longer have to
      call exit_oom_victim from the oom reaper because we have stable mm
      available and hide the oom_reaped mm by MMF_OOM_SKIP flag.  So let's
      remove exit_oom_victim and the race described in the above commit
      doesn't exist anymore if.
      
      Unfortunately this alone is not sufficient for the oom_killer_disable
      usecase because now we do not have any reliable way to reach
      exit_oom_victim (the victim might get stuck on a way to exit for an
      unbounded amount of time).  OOM killer can cope with that by checking mm
      flags and move on to another victim but we cannot do the same for
      oom_killer_disable as we would lose the guarantee of no further
      interference of the victim with the rest of the system.  What we can do
      instead is to cap the maximum time the oom_killer_disable waits for
      victims.  The only current user of this function (pm suspend) already
      has a concept of timeout for back off so we can reuse the same value
      there.
      
      Let's drop set_freezable for the oom_reaper kthread because it is no
      longer needed as the reaper doesn't wake or thaw any processes.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-7-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d2e7a22
    • M
      mm, oom: get rid of signal_struct::oom_victims · 862e3073
      Michal Hocko 提交于
      After "oom: keep mm of the killed task available" we can safely detect
      an oom victim by checking task->signal->oom_mm so we do not need the
      signal_struct counter anymore so let's get rid of it.
      
      This alone wouldn't be sufficient for nommu archs because
      exit_oom_victim doesn't hide the process from the oom killer anymore.
      We can, however, mark the mm with a MMF flag in __mmput.  We can reuse
      MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      862e3073
    • M
      kernel, oom: fix potential pgd_lock deadlock from __mmdrop · 7283094e
      Michal Hocko 提交于
      Lockdep complains that __mmdrop is not safe from the softirq context:
      
        =================================
        [ INFO: inconsistent lock state ]
        4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G        W
        ---------------------------------
        inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
        swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
         (pgd_lock){+.?...}, at: pgd_free+0x19/0x6b
        {SOFTIRQ-ON-W} state was registered at:
           __lock_acquire+0xa06/0x196e
           lock_acquire+0x139/0x1e1
           _raw_spin_lock+0x32/0x41
           __change_page_attr_set_clr+0x2a5/0xacd
           change_page_attr_set_clr+0x16f/0x32c
           set_memory_nx+0x37/0x3a
           free_init_pages+0x9e/0xc7
           alternative_instructions+0xa2/0xb3
           check_bugs+0xe/0x2d
           start_kernel+0x3ce/0x3ea
           x86_64_start_reservations+0x2a/0x2c
           x86_64_start_kernel+0x17a/0x18d
        irq event stamp: 105916
        hardirqs last  enabled at (105916): free_hot_cold_page+0x37e/0x390
        hardirqs last disabled at (105915): free_hot_cold_page+0x2c1/0x390
        softirqs last  enabled at (105878): _local_bh_enable+0x42/0x44
        softirqs last disabled at (105879): irq_exit+0x6f/0xd1
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(pgd_lock);
          <Interrupt>
            lock(pgd_lock);
      
         *** DEADLOCK ***
      
        1 lock held by swapper/1/0:
         #0:  (rcu_callback){......}, at: rcu_process_callbacks+0x390/0x800
      
        stack backtrace:
        CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
        Call Trace:
         <IRQ>
          print_usage_bug.part.25+0x259/0x268
          mark_lock+0x381/0x567
          __lock_acquire+0x993/0x196e
          lock_acquire+0x139/0x1e1
          _raw_spin_lock+0x32/0x41
          pgd_free+0x19/0x6b
          __mmdrop+0x25/0xb9
          __put_task_struct+0x103/0x11e
          delayed_put_task_struct+0x157/0x15e
          rcu_process_callbacks+0x660/0x800
          __do_softirq+0x1ec/0x4d5
          irq_exit+0x6f/0xd1
          smp_apic_timer_interrupt+0x42/0x4d
          apic_timer_interrupt+0x8e/0xa0
         <EOI>
          arch_cpu_idle+0xf/0x11
          default_idle_call+0x32/0x34
          cpu_startup_entry+0x20c/0x399
          start_secondary+0xfe/0x101
      
      More over commit a79e53d8 ("x86/mm: Fix pgd_lock deadlock") was
      explicit about pgd_lock not to be called from the irq context.  This
      means that __mmdrop called from free_signal_struct has to be postponed
      to a user context.  We already have a similar mechanism for mmput_async
      so we can use it here as well.  This is safe because mm_count is pinned
      by mm_users.
      
      This fixes bug introduced by "oom: keep mm of the killed task available"
      
      Link: http://lkml.kernel.org/r/1472119394-11342-5-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7283094e
    • M
      oom: keep mm of the killed task available · 26db62f1
      Michal Hocko 提交于
      oom_reap_task has to call exit_oom_victim in order to make sure that the
      oom vicim will not block the oom killer for ever.  This is, however,
      opening new problems (e.g oom_killer_disable exclusion - see commit
      74070542 ("oom, suspend: fix oom_reaper vs.  oom_killer_disable
      race")).  exit_oom_victim should be only called from the victim's
      context ideally.
      
      One way to achieve this would be to rely on per mm_struct flags.  We
      already have MMF_OOM_REAPED to hide a task from the oom killer since
      "mm, oom: hide mm which is shared with kthread or global init". The
      problem is that the exit path:
      
        do_exit
          exit_mm
            tsk->mm = NULL;
            mmput
              __mmput
            exit_oom_victim
      
      doesn't guarantee that exit_oom_victim will get called in a bounded
      amount of time.  At least exit_aio depends on IO which might get blocked
      due to lack of memory and who knows what else is lurking there.
      
      This patch takes a different approach.  We remember tsk->mm into the
      signal_struct and bind it to the signal struct life time for all oom
      victims.  __oom_reap_task_mm as well as oom_scan_process_thread do not
      have to rely on find_lock_task_mm anymore and they will have a reliable
      reference to the mm struct.  As a result all the oom specific
      communication inside the OOM killer can be done via tsk->signal->oom_mm.
      
      Increasing the signal_struct for something as unlikely as the oom killer
      is far from ideal but this approach will make the code much more
      reasonable and long term we even might want to move task->mm into the
      signal_struct anyway.  In the next step we might want to make the oom
      killer exclusion and access to memory reserves completely independent
      which would be also nice.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26db62f1
    • T
      mm,oom_reaper: do not attempt to reap a task twice · 8496afab
      Tetsuo Handa 提交于
      "mm, oom_reaper: do not attempt to reap a task twice" tried to give the
      OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag.
      But the usefulness of the flag is rather limited and actually never
      shown in practice.  If the flag is set, it means that the holder of
      mm->mmap_sem cannot call up_write() due to presumably being blocked at
      unkillable wait waiting for other thread's memory allocation.  But since
      one of threads sharing that mm will queue that mm immediately via
      task_will_free_mem() shortcut (otherwise, oom_badness() will select the
      same mm again due to oom_score_adj value unchanged), retrying
      MMF_OOM_NOT_REAPABLE mm is unlikely helpful.
      
      Let's always set MMF_OOM_REAPED.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-3-git-send-email-mhocko@kernel.orgSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8496afab
    • H
      mm, swap: add swap_cluster_list · 6b534915
      Huang Ying 提交于
      This is a code clean up patch without functionality changes.  The
      swap_cluster_list data structure and its operations are introduced to
      provide some better encapsulation for the free cluster and discard
      cluster list operations.  This avoid some code duplication, improved the
      code readability, and reduced the total line number.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1472067356-16004-1-git-send-email-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b534915
    • J
      mm: pagewalk: fix the comment for test_walk · f7e2355f
      James Morse 提交于
      Modify the comment describing struct mm_walk->test_walk()s behaviour to
      match the comment on walk_page_test() and the behaviour of
      walk_page_vma().
      
      Fixes: fafaa426 ("pagewalk: improve vma handling")
      Link: http://lkml.kernel.org/r/1471622518-21980-1-git-send-email-james.morse@arm.comSigned-off-by: NJames Morse <james.morse@arm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7e2355f
    • J
      mm/page_owner: don't define fields on struct page_ext by hard-coding · 9300d8df
      Joonsoo Kim 提交于
      There is a memory waste problem if we define field on struct page_ext by
      hard-coding.  Entry size of struct page_ext includes the size of those
      fields even if it is disabled at runtime.  Now, extra memory request at
      runtime is possible so page_owner don't need to define it's own fields
      by hard-coding.
      
      This patch removes hard-coded define and uses extra memory for storing
      page_owner information in page_owner.  Most of code are just mechanical
      changes.
      
      Link: http://lkml.kernel.org/r/1471315879-32294-7-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9300d8df
    • J
      mm/page_ext: support extra space allocation by page_ext user · 980ac167
      Joonsoo Kim 提交于
      Until now, if some page_ext users want to use it's own field on
      page_ext, it should be defined in struct page_ext by hard-coding.  It
      has a problem that wastes memory in following situation.
      
        struct page_ext {
         #ifdef CONFIG_A
        	int a;
         #endif
         #ifdef CONFIG_B
        	int b;
         #endif
        };
      
      Assume that kernel is built with both CONFIG_A and CONFIG_B.  Even if we
      enable feature A and doesn't enable feature B at runtime, each entry of
      struct page_ext takes two int rather than one int.  It's undesirable
      result so this patch tries to fix it.
      
      To solve above problem, this patch implements to support extra space
      allocation at runtime.  When need() callback returns true, it's extra
      memory requirement is summed to entry size of page_ext.  Also, offset
      for each user's extra memory space is returned.  With this offset, user
      can use this extra space and there is no need to define needed field on
      page_ext by hard-coding.
      
      This patch only implements an infrastructure.  Following patch will use
      it for page_owner which is only user having it's own fields on page_ext.
      
      Link: http://lkml.kernel.org/r/1471315879-32294-6-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      980ac167
    • J
      mm/page_owner: move page_owner specific function to page_owner.c · e2f612e6
      Joonsoo Kim 提交于
      There is no reason that page_owner specific function resides on
      vmstat.c.
      
      Link: http://lkml.kernel.org/r/1471315879-32294-4-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2f612e6
    • M
      mm, vmscan: get rid of throttle_vm_writeout · bf484383
      Michal Hocko 提交于
      throttle_vm_writeout() was introduced back in 2005 to fix OOMs caused by
      excessive pageout activity during the reclaim.  Too many pages could be
      put under writeback therefore LRUs would be full of unreclaimable pages
      until the IO completes and in turn the OOM killer could be invoked.
      
      There have been some important changes introduced since then in the
      reclaim path though.  Writers are throttled by balance_dirty_pages when
      initiating the buffered IO and later during the memory pressure, the
      direct reclaim is throttled by wait_iff_congested if the node is
      considered congested by dirty pages on LRUs and the underlying bdi is
      congested by the queued IO.  The kswapd is throttled as well if it
      encounters pages marked for immediate reclaim or under writeback which
      signals that that there are too many pages under writeback already.
      Finally should_reclaim_retry does congestion_wait if the reclaim cannot
      make any progress and there are too many dirty/writeback pages.
      
      Another important aspect is that we do not issue any IO from the direct
      reclaim context anymore.  In a heavy parallel load this could queue a
      lot of IO which would be very scattered and thus unefficient which would
      just make the problem worse.
      
      This three mechanisms should throttle and keep the amount of IO in a
      steady state even under heavy IO and memory pressure so yet another
      throttling point doesn't really seem helpful.  Quite contrary, Mikulas
      Patocka has reported that swap backed by dm-crypt doesn't work properly
      because the swapout IO cannot make sufficient progress as the writeout
      path depends on dm_crypt worker which has to allocate memory to perform
      the encryption.  In order to guarantee a forward progress it relies on
      the mempool allocator.  mempool_alloc(), however, prefers to use the
      underlying (usually page) allocator before it grabs objects from the
      pool.  Such an allocation can dive into the memory reclaim and
      consequently to throttle_vm_writeout.  If there are too many dirty or
      pages under writeback it will get throttled even though it is in fact a
      flusher to clear pending pages.
      
        kworker/u4:0    D ffff88003df7f438 10488     6      2	0x00000000
        Workqueue: kcryptd kcryptd_crypt [dm_crypt]
        Call Trace:
          schedule+0x3c/0x90
          schedule_timeout+0x1d8/0x360
          io_schedule_timeout+0xa4/0x110
          congestion_wait+0x86/0x1f0
          throttle_vm_writeout+0x44/0xd0
          shrink_zone_memcg+0x613/0x720
          shrink_zone+0xe0/0x300
          do_try_to_free_pages+0x1ad/0x450
          try_to_free_pages+0xef/0x300
          __alloc_pages_nodemask+0x879/0x1210
          alloc_pages_current+0xa1/0x1f0
          new_slab+0x2d7/0x6a0
          ___slab_alloc+0x3fb/0x5c0
          __slab_alloc+0x51/0x90
          kmem_cache_alloc+0x27b/0x310
          mempool_alloc_slab+0x1d/0x30
          mempool_alloc+0x91/0x230
          bio_alloc_bioset+0xbd/0x260
          kcryptd_crypt+0x114/0x3b0 [dm_crypt]
      
      Let's just drop throttle_vm_writeout altogether.  It is not very much
      helpful anymore.
      
      I have tried to test a potential writeback IO runaway similar to the one
      described in the original patch which has introduced that [1].  Small
      virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
      rather slow NFS in a sync mode on the host) with 8 parallel writers each
      writing 1G worth of data.  As soon as the pagecache fills up and the
      direct reclaim hits then I start anon memory consumer in a loop
      (allocating 300M and exiting after populating it) in the background to
      make the memory pressure even stronger as well as to disrupt the steady
      state for the IO.  The direct reclaim is throttled because of the
      congestion as well as kswapd hitting congestion_wait due to nr_immediate
      but throttle_vm_writeout doesn't ever trigger the sleep throughout the
      test.  Dirty+writeback are close to nr_dirty_threshold with some
      fluctuations caused by the anon consumer.
      
      [1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
      Link: http://lkml.kernel.org/r/1471171473-21418-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ondrej Kozina <okozina@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf484383
    • V
      mm, compaction: create compact_gap wrapper · 9861a62c
      Vlastimil Babka 提交于
      Compaction uses a watermark gap of (2UL << order) pages at various
      places and it's not immediately obvious why.  Abstract it through a
      compact_gap() wrapper to create a single place with a thorough
      explanation.
      
      [vbabka@suse.cz: clarify the comment of compact_gap()]
       Link: http://lkml.kernel.org/r/7b6aed1f-fdf8-2063-9ff4-bbe4de712d37@suse.cz
      Link: http://lkml.kernel.org/r/20160810091226.6709-9-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9861a62c
    • V
      mm, compaction: add the ultimate direct compaction priority · a8e025e5
      Vlastimil Babka 提交于
      During reclaim/compaction loop, it's desirable to get a final answer
      from unsuccessful compaction so we can either fail the allocation or
      invoke the OOM killer.  However, heuristics such as deferred compaction
      or pageblock skip bits can cause compaction to skip parts or whole zones
      and lead to premature OOM's, failures or excessive reclaim/compaction
      retries.
      
      To remedy this, we introduce a new direct compaction priority called
      COMPACT_PRIO_SYNC_FULL, which instructs direct compaction to:
      
       - ignore deferred compaction status for a zone
       - ignore pageblock skip hints
       - ignore cached scanner positions and scan the whole zone
      
      The new priority should get eventually picked up by
      should_compact_retry() and this should improve success rates for costly
      allocations using __GFP_REPEAT, such as hugetlbfs allocations, and
      reduce some corner-case OOM's for non-costly allocations.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-6-vbabka@suse.cz
      [vbabka@suse.cz: use the MIN_COMPACT_PRIORITY alias]
        Link: http://lkml.kernel.org/r/d443b884-87e7-1c93-8684-3a3a35759fb1@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8e025e5
新手
引导
客服 返回
顶部