1. 10 10月, 2014 40 次提交
    • M
      zsmalloc: move pages_allocated to zs_pool · 13de8933
      Minchan Kim 提交于
      Currently, zram has no feature to limit memory so theoretically zram can
      deplete system memory.  Users have asked for a limit several times as even
      without exhaustion zram makes it hard to control memory usage of the
      platform.  This patchset adds the feature.
      
      Patch 1 makes zs_get_total_size_bytes faster because it would be used
      frequently in later patches for the new feature.
      
      Patch 2 changes zs_get_total_size_bytes's return unit from bytes to page
      so that zsmalloc doesn't need unnecessary operation(ie, << PAGE_SHIFT).
      
      Patch 3 adds new feature.  I added the feature into zram layer, not
      zsmalloc because limiation is zram's requirement, not zsmalloc so any
      other user using zsmalloc(ie, zpool) shouldn't affected by unnecessary
      branch of zsmalloc.  In future, if every users of zsmalloc want the
      feature, then, we could move the feature from client side to zsmalloc
      easily but vice versa would be painful.
      
      Patch 4 adds news facility to report maximum memory usage of zram so that
      this avoids user polling frequently via /sys/block/zram0/ mem_used_total
      and ensures transient max are not missed.
      
      This patch (of 4):
      
      pages_allocated has counted in size_class structure and when user of
      zsmalloc want to see total_size_bytes, it should gather all of count from
      each size_class to report the sum.
      
      It's not bad if user don't see the value often but if user start to see
      the value frequently, it would be not a good deal for performance pov.
      
      This patch moves the count from size_class to zs_pool so it could reduce
      memory footprint (from [255 * 8byte] to [sizeof(atomic_long_t)]).
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: <juno.choi@lge.com>
      Cc: <seungho1.park@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Reviewed-by: NDavid Horner <ds2horner@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13de8933
    • D
      m68k: call find_vma with the mmap_sem held in sys_cacheflush() · cd2567b6
      Davidlohr Bueso 提交于
      Performing vma lookups without taking the mm->mmap_sem is asking for
      trouble.  While doing the search, the vma in question can be modified or
      even removed before returning to the caller.  Take the lock (shared) in
      order to avoid races while iterating through the vmacache and/or rbtree.
      In addition, this guarantees that the address space will remain intact
      during the CPU flushing.
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd2567b6
    • C
      vmstat: on-demand vmstat workers V8 · 7cc36bbd
      Christoph Lameter 提交于
      vmstat workers are used for folding counter differentials into the zone,
      per node and global counters at certain time intervals.  They currently
      run at defined intervals on all processors which will cause some holdoff
      for processors that need minimal intrusion by the OS.
      
      The current vmstat_update mechanism depends on a deferrable timer firing
      every other second by default which registers a work queue item that runs
      on the local CPU, with the result that we have 1 interrupt and one
      additional schedulable task on each CPU every 2 seconds If a workload
      indeed causes VM activity or multiple tasks are running on a CPU, then
      there are probably bigger issues to deal with.
      
      However, some workloads dedicate a CPU for a single CPU bound task.  This
      is done in high performance computing, in high frequency financial
      applications, in networking (Intel DPDK, EZchip NPS) and with the advent
      of systems with more and more CPUs over time, this may become more and
      more common to do since when one has enough CPUs one cares less about
      efficiently sharing a CPU with other tasks and more about efficiently
      monopolizing a CPU per task.
      
      The difference of having this timer firing and workqueue kernel thread
      scheduled per second can be enormous.  An artificial test measuring the
      worst case time to do a simple "i++" in an endless loop on a bare metal
      system and under Linux on an isolated CPU with dynticks and with and
      without this patch, have Linux match the bare metal performance (~700
      cycles) with this patch and loose by couple of orders of magnitude (~200k
      cycles) without it[*].  The loss occurs for something that just calculates
      statistics.  For networking applications, for example, this could be the
      difference between dropping packets or sustaining line rate.
      
      Statistics are important and useful, but it would be great if there would
      be a way to not cause statistics gathering produce a huge performance
      difference.  This patche does just that.
      
      This patch creates a vmstat shepherd worker that monitors the per cpu
      differentials on all processors.  If there are differentials on a
      processor then a vmstat worker local to the processors with the
      differentials is created.  That worker will then start folding the diffs
      in regular intervals.  Should the worker find that there is no work to be
      done then it will make the shepherd worker monitor the differentials
      again.
      
      With this patch it is possible then to have periods longer than
      2 seconds without any OS event on a "cpu" (hardware thread).
      
      The patch shows a very minor increased in system performance.
      
      hackbench -s 512 -l 2000 -g 15 -f 25 -P
      
      Results before the patch:
      
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.992
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.971
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 5.063
      
      Hackbench after the patch:
      
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.973
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.990
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.993
      
      [fengguang.wu@intel.com: cpu_stat_off can be static]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NGilad Ben-Yossef <gilad@benyossef.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Max Krasnyansky <maxk@qti.qualcomm.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cc36bbd
    • J
      CMA: document cma=0 · f0d6d1f6
      Jean Delvare 提交于
      It isn't obvious that CMA can be disabled on the kernel's command line, so
      document it.
      Signed-off-by: NJean Delvare <jdelvare@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0d6d1f6
    • S
      fs/buffer.c: increase the buffer-head per-CPU LRU size · 86cf78d7
      Sebastien Buisson 提交于
      Increase the buffer-head per-CPU LRU size to allow efficient filesystem
      operations that access many blocks for each transaction.  For example,
      creating a file in a large ext4 directory with quota enabled will access
      multiple buffer heads and will overflow the LRU at the default 8-block LRU
      size:
      
      * parent directory inode table block (ctime, nlinks for subdirs)
      * new inode bitmap
      * inode table block
      * 2 quota blocks
      * directory leaf block (not reused, but pollutes one cache entry)
      * 2 levels htree blocks (only one is reused, other pollutes cache)
      * 2 levels indirect/index blocks (only one is reused)
      
      The buffer-head per-CPU LRU size is raised to 16, as it shows in metadata
      performance benchmarks up to 10% gain for create, 4% for lookup and 7% for
      destroy.
      Signed-off-by: NLiang Zhen <liang.zhen@intel.com>
      Signed-off-by: NAndreas Dilger <andreas.dilger@intel.com>
      Signed-off-by: NSebastien Buisson <sebastien.buisson@bull.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86cf78d7
    • M
      mm: mempolicy: skip inaccessible VMAs when setting MPOL_MF_LAZY · 2c0346a3
      Mel Gorman 提交于
      PROT_NUMA VMAs are skipped to avoid problems distinguishing between
      present, prot_none and special entries.  MPOL_MF_LAZY is not visible from
      userspace since commit a720094d ("mm: mempolicy: Hide MPOL_NOOP and
      MPOL_MF_LAZY from userspace for now") but it should still skip VMAs the
      same way task_numa_work does.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c0346a3
    • K
      selftests/vm/transhuge-stress: stress test for memory compaction · 0085d61f
      Konstantin Khlebnikov 提交于
      This tool induces memory fragmentation via sequential allocation of
      transparent huge pages and splitting off everything except their last
      sub-pages.  It easily generates pressure to the memory compaction code.
      
      $ perf stat -e 'compaction:*' -e 'migrate:*' ./transhuge-stress
      transhuge-stress: allocate 7858 transhuge pages, using 15716 MiB virtual memory and 61 MiB of ram
      transhuge-stress: 1.653 s/loop, 0.210 ms/page,   9504.828 MiB/s	7858 succeed,    0 failed, 2439 different pages
      transhuge-stress: 1.537 s/loop, 0.196 ms/page,  10226.227 MiB/s	7858 succeed,    0 failed, 2364 different pages
      transhuge-stress: 1.658 s/loop, 0.211 ms/page,   9479.215 MiB/s	7858 succeed,    0 failed, 2179 different pages
      transhuge-stress: 1.617 s/loop, 0.206 ms/page,   9716.992 MiB/s	7858 succeed,    0 failed, 2421 different pages
      ^C./transhuge-stress: Interrupt
      
       Performance counter stats for './transhuge-stress':
      
               1.744.051      compaction:mm_compaction_isolate_migratepages
                   1.014      compaction:mm_compaction_isolate_freepages
               1.744.051      compaction:mm_compaction_migratepages
                   1.647      compaction:mm_compaction_begin
                   1.647      compaction:mm_compaction_end
               1.744.051      migrate:mm_migrate_pages
                       0      migrate:mm_numa_migrate_ratelimit
      
             7,964696835 seconds time elapsed
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0085d61f
    • K
      mm/balloon_compaction: add vmstat counters and kpageflags bit · 09316c09
      Konstantin Khlebnikov 提交于
      Always mark pages with PageBalloon even if balloon compaction is disabled
      and expose this mark in /proc/kpageflags as KPF_BALLOON.
      
      Also this patch adds three counters into /proc/vmstat: "balloon_inflate",
      "balloon_deflate" and "balloon_migrate".  They accumulate balloon
      activity.  Current size of balloon is (balloon_inflate - balloon_deflate)
      pages.
      
      All generic balloon code now gathered under option CONFIG_MEMORY_BALLOON.
      It should be selected by ballooning driver which wants use this feature.
      Currently virtio-balloon is the only user.
      Signed-off-by: NKonstantin Khlebnikov <k.khlebnikov@samsung.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09316c09
    • K
      mm/balloon_compaction: remove balloon mapping and flag AS_BALLOON_MAP · 9d1ba805
      Konstantin Khlebnikov 提交于
      Now ballooned pages are detected using PageBalloon().  Fake mapping is no
      longer required.  This patch links ballooned pages to balloon device using
      field page->private instead of page->mapping.  Also this patch embeds
      balloon_dev_info directly into struct virtio_balloon.
      Signed-off-by: NKonstantin Khlebnikov <k.khlebnikov@samsung.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d1ba805
    • K
      mm/balloon_compaction: redesign ballooned pages management · d6d86c0a
      Konstantin Khlebnikov 提交于
      Sasha Levin reported KASAN splash inside isolate_migratepages_range().
      Problem is in the function __is_movable_balloon_page() which tests
      AS_BALLOON_MAP in page->mapping->flags.  This function has no protection
      against anonymous pages.  As result it tried to check address space flags
      inside struct anon_vma.
      
      Further investigation shows more problems in current implementation:
      
      * Special branch in __unmap_and_move() never works:
        balloon_page_movable() checks page flags and page_count.  In
        __unmap_and_move() page is locked, reference counter is elevated, thus
        balloon_page_movable() always fails.  As a result execution goes to the
        normal migration path.  virtballoon_migratepage() returns
        MIGRATEPAGE_BALLOON_SUCCESS instead of MIGRATEPAGE_SUCCESS,
        move_to_new_page() thinks this is an error code and assigns
        newpage->mapping to NULL.  Newly migrated page lose connectivity with
        balloon an all ability for further migration.
      
      * lru_lock erroneously required in isolate_migratepages_range() for
        isolation ballooned page.  This function releases lru_lock periodically,
        this makes migration mostly impossible for some pages.
      
      * balloon_page_dequeue have a tight race with balloon_page_isolate:
        balloon_page_isolate could be executed in parallel with dequeue between
        picking page from list and locking page_lock.  Race is rare because they
        use trylock_page() for locking.
      
      This patch fixes all of them.
      
      Instead of fake mapping with special flag this patch uses special state of
      page->_mapcount: PAGE_BALLOON_MAPCOUNT_VALUE = -256.  Buddy allocator uses
      PAGE_BUDDY_MAPCOUNT_VALUE = -128 for similar purpose.  Storing mark
      directly in struct page makes everything safer and easier.
      
      PagePrivate is used to mark pages present in page list (i.e.  not
      isolated, like PageLRU for normal pages).  It replaces special rules for
      reference counter and makes balloon migration similar to migration of
      normal pages.  This flag is protected by page_lock together with link to
      the balloon device.
      Signed-off-by: NKonstantin Khlebnikov <k.khlebnikov@samsung.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Link: http://lkml.kernel.org/p/53E6CEAA.9020105@oracle.com
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: <stable@vger.kernel.org>	[3.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6d86c0a
    • S
      arm64: mm: enable RCU fast_gup · 29e56940
      Steve Capper 提交于
      Activate the RCU fast_gup for ARM64.  We also need to force THP splits to
      broadcast an IPI s.t.  we block in the fast_gup page walker.  As THP
      splits are comparatively rare, this should not lead to a noticeable
      performance degradation.
      
      Some pre-requisite functions pud_write and pud_page are also added.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NSteve Capper <steve.capper@linaro.org>
      Tested-by: NDann Frazier <dann.frazier@canonical.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Christoffer Dall <christoffer.dall@linaro.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29e56940
    • S
      arm64: mm: enable HAVE_RCU_TABLE_FREE logic · 5e5f6dc1
      Steve Capper 提交于
      In order to implement fast_get_user_pages we need to ensure that the page
      table walker is protected from page table pages being freed from under it.
      
      This patch enables HAVE_RCU_TABLE_FREE, any page table pages belonging to
      address spaces with multiple users will be call_rcu_sched freed.  Meaning
      that disabling interrupts will block the free and protect the fast gup
      page walker.
      Signed-off-by: NSteve Capper <steve.capper@linaro.org>
      Tested-by: NDann Frazier <dann.frazier@canonical.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Christoffer Dall <christoffer.dall@linaro.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e5f6dc1
    • S
      arm: mm: enable RCU fast_gup · b8cd51af
      Steve Capper 提交于
      Activate the RCU fast_gup for ARM.  We also need to force THP splits to
      broadcast an IPI s.t.  we block in the fast_gup page walker.  As THP
      splits are comparatively rare, this should not lead to a noticeable
      performance degradation.
      
      Some pre-requisite functions pud_write and pud_page are also added.
      Signed-off-by: NSteve Capper <steve.capper@linaro.org>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Dann Frazier <dann.frazier@canonical.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Christoffer Dall <christoffer.dall@linaro.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8cd51af
    • S
      arm: mm: enable HAVE_RCU_TABLE_FREE logic · a0ad5496
      Steve Capper 提交于
      In order to implement fast_get_user_pages we need to ensure that the page
      table walker is protected from page table pages being freed from under it.
      
      This patch enables HAVE_RCU_TABLE_FREE, any page table pages belonging to
      address spaces with multiple users will be call_rcu_sched freed.  Meaning
      that disabling interrupts will block the free and protect the fast gup
      page walker.
      Signed-off-by: NSteve Capper <steve.capper@linaro.org>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Dann Frazier <dann.frazier@canonical.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Christoffer Dall <christoffer.dall@linaro.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0ad5496
    • S
      arm: mm: introduce special ptes for LPAE · bd951303
      Steve Capper 提交于
      We need a mechanism to tag ptes as being special, this indicates that no
      attempt should be made to access the underlying struct page * associated
      with the pte.  This is used by the fast_gup when operating on ptes as it
      has no means to access VMAs (that also contain this information)
      locklessly.
      
      The L_PTE_SPECIAL bit is already allocated for LPAE, this patch modifies
      pte_special and pte_mkspecial to make use of it, and defines
      __HAVE_ARCH_PTE_SPECIAL.
      
      This patch also excludes special ptes from the icache/dcache sync logic.
      Signed-off-by: NSteve Capper <steve.capper@linaro.org>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Dann Frazier <dann.frazier@canonical.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Christoffer Dall <christoffer.dall@linaro.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd951303
    • S
      mm: introduce a general RCU get_user_pages_fast() · 2667f50e
      Steve Capper 提交于
      This series implements general forms of get_user_pages_fast and
      __get_user_pages_fast in core code and activates them for arm and arm64.
      
      These are required for Transparent HugePages to function correctly, as a
      futex on a THP tail will otherwise result in an infinite loop (due to the
      core implementation of __get_user_pages_fast always returning 0).
      
      Unfortunately, a futex on THP tail can be quite common for certain
      workloads; thus THP is unreliable without a __get_user_pages_fast
      implementation.
      
      This series may also be beneficial for direct-IO heavy workloads and
      certain KVM workloads.
      
      This patch (of 6):
      
      get_user_pages_fast() attempts to pin user pages by walking the page
      tables directly and avoids taking locks.  Thus the walker needs to be
      protected from page table pages being freed from under it, and needs to
      block any THP splits.
      
      One way to achieve this is to have the walker disable interrupts, and rely
      on IPIs from the TLB flushing code blocking before the page table pages
      are freed.
      
      On some platforms we have hardware broadcast of TLB invalidations, thus
      the TLB flushing code doesn't necessarily need to broadcast IPIs; and
      spuriously broadcasting IPIs can hurt system performance if done too
      often.
      
      This problem has been solved on PowerPC and Sparc by batching up page
      table pages belonging to more than one mm_user, then scheduling an
      rcu_sched callback to free the pages.  This RCU page table free logic has
      been promoted to core code and is activated when one enables
      HAVE_RCU_TABLE_FREE.  Unfortunately, these architectures implement their
      own get_user_pages_fast routines.
      
      The RCU page table free logic coupled with an IPI broadcast on THP split
      (which is a rare event), allows one to protect a page table walker by
      merely disabling the interrupts during the walk.
      
      This patch provides a general RCU implementation of get_user_pages_fast
      that can be used by architectures that perform hardware broadcast of TLB
      invalidations.
      
      It is based heavily on the PowerPC implementation by Nick Piggin.
      
      [akpm@linux-foundation.org: various comment fixes]
      Signed-off-by: NSteve Capper <steve.capper@linaro.org>
      Tested-by: NDann Frazier <dann.frazier@canonical.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Christoffer Dall <christoffer.dall@linaro.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2667f50e
    • P
      mm/dmapool.c: fixed a brace coding style issue · baa2ef83
      Paul McQuade 提交于
      Remove 3 brace coding style for any arm of this statement
      Signed-off-by: NPaul McQuade <paulmcquad@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      baa2ef83
    • P
      mm: ksm use pr_err instead of printk · 25acde31
      Paul McQuade 提交于
      WARNING: Prefer: pr_err(...  to printk(KERN_ERR ...
      
      [akpm@linux-foundation.org: remove KERN_ERR]
      Signed-off-by: NPaul McQuade <paulmcquad@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25acde31
    • Y
      drivers/firmware/memmap.c: don't create memmap sysfs of same firmware_map_entry · 22880ebe
      Yasuaki Ishimatsu 提交于
      By the following commits, we prevented from allocating firmware_map_entry
      of same memory range:
        f0093ede: drivers/firmware/memmap.c: don't allocate firmware_map_entry
                  of same memory range
        49c8b24d: drivers/firmware/memmap.c: pass the correct argument to
                  firmware_map_find_entry_bootmem()
      
      But it's not enough. When PNP0C80 device is added by acpi_scan_init(),
      memmap sysfses of same firmware_map_entry are created twice as follows:
      
        # cat /sys/firmware/memmap/*/start
        0x40000000000
        0x60000000000
        0x4a837000
        0x4a83a000
        0x4a8b5000
        ...
        0x40000000000
        0x60000000000
        ...
      
      The flows of the issues are as follows:
      
        1. e820_reserve_resources() allocates firmware_map_entrys of all
           memory ranges defined in e820. And, these firmware_map_entrys
           are linked with map_entries list.
      
           map_entries -> entry 1 -> ... -> entry N
      
        2. When PNP0C80 device is limited by mem= boot option, acpi_scan_init()
           added the memory device. In this case, firmware_map_add_hotplug()
           allocates firmware_map_entry and creates memmap sysfs.
      
           map_entries -> entry 1 -> ... -> entry N -> entry N+1
                                                       |
                                                       memmap 1
      
        3. firmware_memmap_init() creates memmap sysfses of firmware_map_entrys
           linked with map_entries.
      
           map_entries -> entry 1 -> ... -> entry N -> entry N+1
                           |                 |             |
                           memmap 2          memmap N+1    memmap 1
                                                           memmap N+2
      
      So while hot removing the PNP0C80 device, kernel panic occurs as follows:
      
           BUG: unable to handle kernel paging request at 00000001003e000b
            IP: sysfs_open_file+0x46/0x2b0
            PGD 203a89fe067 PUD 0
            Oops: 0000 [#1] SMP
            ...
            Call Trace:
              do_dentry_open+0x1ef/0x2a0
              finish_open+0x31/0x40
              do_last+0x57c/0x1220
              path_openat+0xc2/0x4c0
              do_filp_open+0x4b/0xb0
              do_sys_open+0xf3/0x1f0
              SyS_open+0x1e/0x20
              system_call_fastpath+0x16/0x1b
      
      The patch adds a check of confirming whether memmap sysfs of
      firmware_map_entry has been created, and does not create memmap
      sysfs of same firmware_map_entry.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22880ebe
    • P
      mm/bootmem.c: use include/linux/ headers · d85fbee8
      Paul McQuade 提交于
      Replace asm. headers with linux/headers:
      
      <linux/bug.h>
      <linux/io.h>
      Signed-off-by: NPaul McQuade <paulmcquad@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d85fbee8
    • P
    • P
      mm/mremap.c: use linux headers · 2581d202
      Paul McQuade 提交于
      "WARNING: Use #include <linux/uaccess.h> instead of <asm/uaccess.h>"
      Signed-off-by: NPaul McQuade <paulmcquad@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2581d202
    • V
      memcg: zap memcg_can_account_kmem · cf2b8fbf
      Vladimir Davydov 提交于
      memcg_can_account_kmem() returns true iff
      
          !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
                                         memcg_kmem_is_active(memcg);
      
      To begin with the !mem_cgroup_is_root(memcg) check is useless, because one
      can't enable kmem accounting for the root cgroup (mem_cgroup_write()
      returns EINVAL on an attempt to set the limit on the root cgroup).
      
      Furthermore, the !mem_cgroup_disabled() check also seems to be redundant.
      The point is memcg_can_account_kmem() is called from three places:
      mem_cgroup_salbinfo_read(), __memcg_kmem_get_cache(), and
      __memcg_kmem_newpage_charge().  The latter two functions are only invoked
      if memcg_kmem_enabled() returns true, which implies that the memory cgroup
      subsystem is enabled.  And mem_cgroup_slabinfo_read() shows the output of
      memory.kmem.slabinfo, which won't exist if the memory cgroup is completely
      disabled.
      
      So let's substitute all the calls to memcg_can_account_kmem() with plain
      memcg_kmem_is_active(), and kill the former.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf2b8fbf
    • J
      mm: memcontrol: fix transparent huge page allocations under pressure · b70a2a21
      Johannes Weiner 提交于
      In a memcg with even just moderate cache pressure, success rates for
      transparent huge page allocations drop to zero, wasting a lot of effort
      that the allocator puts into assembling these pages.
      
      The reason for this is that the memcg reclaim code was never designed for
      higher-order charges.  It reclaims in small batches until there is room
      for at least one page.  Huge page charges only succeed when these batches
      add up over a series of huge faults, which is unlikely under any
      significant load involving order-0 allocations in the group.
      
      Remove that loop on the memcg side in favor of passing the actual reclaim
      goal to direct reclaim, which is already set up and optimized to meet
      higher-order goals efficiently.
      
      This brings memcg's THP policy in line with the system policy: if the
      allocator painstakingly assembles a hugepage, memcg will at least make an
      honest effort to charge it.  As a result, transparent hugepage allocation
      rates amid cache activity are drastically improved:
      
                                            vanilla                 patched
      pgalloc                 4717530.80 (  +0.00%)   4451376.40 (  -5.64%)
      pgfault                  491370.60 (  +0.00%)    225477.40 ( -54.11%)
      pgmajfault                    2.00 (  +0.00%)         1.80 (  -6.67%)
      thp_fault_alloc               0.00 (  +0.00%)       531.60 (+100.00%)
      thp_fault_fallback          749.00 (  +0.00%)       217.40 ( -70.88%)
      
      [ Note: this may in turn increase memory consumption from internal
        fragmentation, which is an inherent risk of transparent hugepages.
        Some setups may have to adjust the memcg limits accordingly to
        accomodate this - or, if the machine is already packed to capacity,
        disable the transparent huge page feature. ]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b70a2a21
    • J
      mm: memcontrol: simplify detecting when the memory+swap limit is hit · 3fbe7244
      Johannes Weiner 提交于
      When attempting to charge pages, we first charge the memory counter and
      then the memory+swap counter.  If one of the counters is at its limit, we
      enter reclaim, but if it's the memory+swap counter, reclaim shouldn't swap
      because that wouldn't change the situation.  However, if the counters have
      the same limits, we never get to the memory+swap limit.  To know whether
      reclaim should swap or not, there is a state flag that indicates whether
      the limits are equal and whether hitting the memory limit implies hitting
      the memory+swap limit.
      
      Just try the memory+swap counter first.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fbe7244
    • M
      mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache · aabfb572
      Michal Hocko 提交于
      free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
      This is not a big deal for the normal release path but it completely kills
      memcg uncharge batching which reduces res_counter spin_lock contention.
      Dave has noticed this with his page fault scalability test case on a large
      machine when the lock was basically dominating on all CPUs:
      
          80.18%    80.18%  [kernel]               [k] _raw_spin_lock
                        |
                        --- _raw_spin_lock
                           |
                           |--66.59%-- res_counter_uncharge_until
                           |          res_counter_uncharge
                           |          uncharge_batch
                           |          uncharge_list
                           |          mem_cgroup_uncharge_list
                           |          release_pages
                           |          free_pages_and_swap_cache
                           |          tlb_flush_mmu_free
                           |          |
                           |          |--90.12%-- unmap_single_vma
                           |          |          unmap_vmas
                           |          |          unmap_region
                           |          |          do_munmap
                           |          |          vm_munmap
                           |          |          sys_munmap
                           |          |          system_call_fastpath
                           |          |          __GI___munmap
                           |          |
                           |           --9.88%-- tlb_flush_mmu
                           |                     tlb_finish_mmu
                           |                     unmap_region
                           |                     do_munmap
                           |                     vm_munmap
                           |                     sys_munmap
                           |                     system_call_fastpath
                           |                     __GI___munmap
      
      In his case the load was running in the root memcg and that part has been
      handled by reverting 05b84301 ("mm: memcontrol: use root_mem_cgroup
      res_counter") because this is a clear regression, but the problem remains
      inside dedicated memcgs.
      
      There is no reason to limit release_pages to PAGEVEC_SIZE batches other
      than lru_lock held times.  This logic, however, can be moved inside the
      function.  mem_cgroup_uncharge_list and free_hot_cold_page_list do not
      hold any lock for the whole pages_to_free list so it is safe to call them
      in a single run.
      
      The release_pages() code was previously breaking the lru_lock each
      PAGEVEC_SIZE pages (ie, 14 pages).  However this code has no usage of
      pagevecs so switch to breaking the lock at least every SWAP_CLUSTER_MAX
      (32) pages.  This means that the lock acquisition frequency is
      approximately halved and the max hold times are approximately doubled.
      
      The now unneeded batching is removed from free_pages_and_swap_cache().
      
      Also update the grossly out-of-date release_pages documentation.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NDave Hansen <dave@sr71.net>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aabfb572
    • S
      mm: dmapool: add/remove sysfs file outside of the pool lock lock · 01c2965f
      Sebastian Andrzej Siewior 提交于
      cat /sys/.../pools followed by removal the device leads to:
      
      |======================================================
      |[ INFO: possible circular locking dependency detected ]
      |3.17.0-rc4+ #1498 Not tainted
      |-------------------------------------------------------
      |rmmod/2505 is trying to acquire lock:
      | (s_active#28){++++.+}, at: [<c017f754>] kernfs_remove_by_name_ns+0x3c/0x88
      |
      |but task is already holding lock:
      | (pools_lock){+.+.+.}, at: [<c011494c>] dma_pool_destroy+0x18/0x17c
      |
      |which lock already depends on the new lock.
      |the existing dependency chain (in reverse order) is:
      |
      |-> #1 (pools_lock){+.+.+.}:
      |   [<c0114ae8>] show_pools+0x30/0xf8
      |   [<c0313210>] dev_attr_show+0x1c/0x48
      |   [<c0180e84>] sysfs_kf_seq_show+0x88/0x10c
      |   [<c017f960>] kernfs_seq_show+0x24/0x28
      |   [<c013efc4>] seq_read+0x1b8/0x480
      |   [<c011e820>] vfs_read+0x8c/0x148
      |   [<c011ea10>] SyS_read+0x40/0x8c
      |   [<c000e960>] ret_fast_syscall+0x0/0x48
      |
      |-> #0 (s_active#28){++++.+}:
      |   [<c017e9ac>] __kernfs_remove+0x258/0x2ec
      |   [<c017f754>] kernfs_remove_by_name_ns+0x3c/0x88
      |   [<c0114a7c>] dma_pool_destroy+0x148/0x17c
      |   [<c03ad288>] hcd_buffer_destroy+0x20/0x34
      |   [<c03a4780>] usb_remove_hcd+0x110/0x1a4
      
      The problem is the lock order of pools_lock and kernfs_mutex in
      dma_pool_destroy() vs show_pools() call path.
      
      This patch breaks out the creation of the sysfs file outside of the
      pools_lock mutex.  The newly added pools_reg_lock ensures that there is no
      race of create vs destroy code path in terms whether or not the sysfs file
      has to be deleted (and was it deleted before we try to create a new one)
      and what to do if device_create_file() failed.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01c2965f
    • V
      memcg: move memcg_update_cache_size() to slab_common.c · 6f817f4c
      Vladimir Davydov 提交于
      `While growing per memcg caches arrays, we jump between memcontrol.c and
      slab_common.c in a weird way:
      
        memcg_alloc_cache_id - memcontrol.c
          memcg_update_all_caches - slab_common.c
            memcg_update_cache_size - memcontrol.c
      
      There's absolutely no reason why memcg_update_cache_size can't live on the
      slab's side though.  So let's move it there and settle it comfortably amid
      per-memcg cache allocation functions.
      
      Besides, this patch cleans this function up a bit, removing all the
      useless comments from it, and renames it to memcg_update_cache_params to
      conform to memcg_alloc/free_cache_params, which we already have in
      slab_common.c.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f817f4c
    • V
      memcg: don't call memcg_update_all_caches if new cache id fits · f3bb3043
      Vladimir Davydov 提交于
      memcg_update_all_caches grows arrays of per-memcg caches, so we only need
      to call it when memcg_limited_groups_array_size is increased.  However,
      currently we invoke it each time a new kmem-active memory cgroup is
      created.  Then it just iterates over all slab_caches and does nothing
      (memcg_update_cache_size returns immediately).
      
      This patch fixes this insanity.  In the meantime it moves the code dealing
      with id allocations to separate functions, memcg_alloc_cache_id and
      memcg_free_cache_id.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3bb3043
    • V
      memcg: move memcg_{alloc,free}_cache_params to slab_common.c · 33a690c4
      Vladimir Davydov 提交于
      The only reason why they live in memcontrol.c is that we get/put css
      reference to the owner memory cgroup in them.  However, we can do that in
      memcg_{un,}register_cache.  OTOH, there are several reasons to move them
      to slab_common.c.
      
      First, I think that the less public interface functions we have in
      memcontrol.h the better.  Since the functions I move don't depend on
      memcontrol, I think it's worth making them private to slab, especially
      taking into account that the arrays are defined on the slab's side too.
      
      Second, the way how per-memcg arrays are updated looks rather awkward: it
      proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c
      (memcg_update_all_caches) and back to memcontrol.c again
      (memcg_update_array_size).  In the following patches I move the function
      relocating the arrays (memcg_update_array_size) to slab_common.c and
      therefore get rid this circular call path.  I think we should have the
      cache allocation stuff in the same place where we have relocation, because
      it's easier to follow the code then.  So I move arrays alloc/free
      functions to slab_common.c too.
      
      The third point isn't obvious.  I'm going to make the list_lru structure
      per-memcg to allow targeted kmem reclaim.  That means we will have
      per-memcg arrays in list_lrus too.  It turns out that it's much easier to
      update these arrays in list_lru.c rather than in memcontrol.c, because all
      the stuff we need is defined there.  This patch makes memcg caches arrays
      allocation path conform that of the upcoming list_lru.
      
      So let's move these functions to slab_common.c and make them static.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33a690c4
    • A
      mm/debug.c: use pr_emerg() · 7a82ca0d
      Andrew Morton 提交于
      - s/KERN_ALERT/pr_emerg/: we're going BUG so let's maximize the changes
        of getting the message out.
      
      - convert debug.c to pr_foo()
      
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a82ca0d
    • S
      mm: use VM_BUG_ON_MM where possible · 96dad67f
      Sasha Levin 提交于
      Dump the contents of the relevant struct_mm when we hit the bug condition.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96dad67f
    • S
      mm: introduce VM_BUG_ON_MM · 31c9afa6
      Sasha Levin 提交于
      Very similar to VM_BUG_ON_PAGE and VM_BUG_ON_VMA, dump struct_mm when the
      bug is hit.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [mhocko@suse.cz: fix build]
      [mhocko@suse.cz: fix build some more]
      [akpm@linux-foundation.org: do strange things to avoid doing strange things for the comma separators]
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Dave Jones <davej@redhat.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31c9afa6
    • S
      mm: move debug code out of page_alloc.c · 82742a3a
      Sasha Levin 提交于
      dump_page() and dump_vma() are not specific to page_alloc.c, move them out
      so page_alloc.c won't turn into the unofficial debug repository.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82742a3a
    • P
      mm: softdirty: unmapped addresses between VMAs are clean · 81d0fa62
      Peter Feiner 提交于
      If a /proc/pid/pagemap read spans a [VMA, an unmapped region, then a
      VM_SOFTDIRTY VMA], the virtual pages in the unmapped region are reported
      as softdirty.  Here's a program to demonstrate the bug:
      
      int main() {
      	const uint64_t PAGEMAP_SOFTDIRTY = 1ul << 55;
      	uint64_t pme[3];
      	int fd = open("/proc/self/pagemap", O_RDONLY);;
      	char *m = mmap(NULL, 3 * getpagesize(), PROT_READ,
      	               MAP_ANONYMOUS | MAP_SHARED, -1, 0);
      	munmap(m + getpagesize(), getpagesize());
      	pread(fd, pme, 24, (unsigned long) m / getpagesize() * 8);
      	assert(pme[0] & PAGEMAP_SOFTDIRTY);    /* passes */
      	assert(!(pme[1] & PAGEMAP_SOFTDIRTY)); /* fails */
      	assert(pme[2] & PAGEMAP_SOFTDIRTY);    /* passes */
      	return 0;
      }
      
      (Note that all pages in new VMAs are softdirty until cleared).
      
      Tested:
      	Used the program given above. I'm going to include this code in
      	a selftest in the future.
      
      [n-horiguchi@ah.jp.nec.com: prevent pagemap_pte_range() from overrunning]
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Jamie Liu <jamieliu@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81d0fa62
    • M
      mm: page_alloc: default node-ordering on 64-bit NUMA, zone-ordering on 32-bit · 3193913c
      Mel Gorman 提交于
      Zones are allocated by the page allocator in either node or zone order.
      Node ordering is preferred in terms of locality and is applied
      automatically in one of three cases:
      
        1. If a node has only low memory
      
        2. If DMA/DMA32 is a high percentage of memory
      
        3. If low memory on a single node is greater than 70% of the node size
      
      Otherwise zone ordering is used to preserve low memory for devices that
      require it.  Unfortunately a consequence of this is that applications
      running on a machine with balanced NUMA nodes will experience different
      performance characteristics depending on which node they happen to start
      from.
      
      The point of zone ordering is to protect lower zones for devices that
      require DMA/DMA32 memory.  When NUMA was first introduced, this was
      critical as 32-bit NUMA machines existed and exhausting low memory
      triggered OOMs easily as so many allocations required low memory.  On
      64-bit machines the primary concern is devices that are 32-bit only which
      is less severe than the low memory exhaustion problem on 32-bit NUMA.  It
      seems there are really few devices that depends on it.
      
      AGP -- I assume this is getting more rare but even then I think the allocations
      	happen early in boot time where lowmem pressure is less of a problem
      
      DRM -- If the device is 32-bit only then there may be low pressure. I didn't
      	evaluate these in detail but it looks like some of these are mobile
      	graphics card. Not many NUMA laptops out there. DRM folk should know
      	better though.
      
      Some TV cards -- Much demand for 32-bit capable TV cards on NUMA machines?
      
      B43 wireless card -- again not really a NUMA thing.
      
      I cannot find a good reason to incur a performance penalty on all 64-bit NUMA
      machines in case someone throws a brain damanged TV or graphics card in there.
      This patch defaults to node-ordering on 64-bit NUMA machines. I was tempted
      to make it default everywhere but I understand that some embedded arches may
      be using 32-bit NUMA where I cannot predict the consequences.
      
      The performance impact depends on the workload and the characteristics of the
      machine and the machine I tested on had a large Normal zone on node 0 so the
      impact is within the noise for the majority of tests. The allocation stats
      show more allocation requests were from DMA32 and local node. Running SpecJBB
      with multiple JVMs and automatic NUMA balancing disabled the results were
      
      specjbb
                           3.17.0-rc2            3.17.0-rc2
                              vanilla        nodeorder-v1r1
      Min    1      29534.00 (  0.00%)     30020.00 (  1.65%)
      Min    10    115717.00 (  0.00%)    134038.00 ( 15.83%)
      Min    19    109718.00 (  0.00%)    114186.00 (  4.07%)
      Min    28    104459.00 (  0.00%)    103639.00 ( -0.78%)
      Min    37     98245.00 (  0.00%)    103756.00 (  5.61%)
      Min    46     97198.00 (  0.00%)     96197.00 ( -1.03%)
      Mean   1      30953.25 (  0.00%)     31917.75 (  3.12%)
      Mean   10    124432.50 (  0.00%)    140904.00 ( 13.24%)
      Mean   19    116033.50 (  0.00%)    119294.75 (  2.81%)
      Mean   28    108365.25 (  0.00%)    106879.50 ( -1.37%)
      Mean   37    102984.75 (  0.00%)    106924.25 (  3.83%)
      Mean   46    100783.25 (  0.00%)    105368.50 (  4.55%)
      Stddev 1       1260.38 (  0.00%)      1109.66 ( 11.96%)
      Stddev 10      7434.03 (  0.00%)      5171.91 ( 30.43%)
      Stddev 19      8453.84 (  0.00%)      5309.59 ( 37.19%)
      Stddev 28      4184.55 (  0.00%)      2906.63 ( 30.54%)
      Stddev 37      5409.49 (  0.00%)      3192.12 ( 40.99%)
      Stddev 46      4521.95 (  0.00%)      7392.52 (-63.48%)
      Max    1      32738.00 (  0.00%)     32719.00 ( -0.06%)
      Max    10    136039.00 (  0.00%)    148614.00 (  9.24%)
      Max    19    130566.00 (  0.00%)    127418.00 ( -2.41%)
      Max    28    115404.00 (  0.00%)    111254.00 ( -3.60%)
      Max    37    112118.00 (  0.00%)    111732.00 ( -0.34%)
      Max    46    108541.00 (  0.00%)    116849.00 (  7.65%)
      TPut   1     123813.00 (  0.00%)    127671.00 (  3.12%)
      TPut   10    497730.00 (  0.00%)    563616.00 ( 13.24%)
      TPut   19    464134.00 (  0.00%)    477179.00 (  2.81%)
      TPut   28    433461.00 (  0.00%)    427518.00 ( -1.37%)
      TPut   37    411939.00 (  0.00%)    427697.00 (  3.83%)
      TPut   46    403133.00 (  0.00%)    421474.00 (  4.55%)
      
                                  3.17.0-rc2  3.17.0-rc2
                                     vanillanodeorder-v1r1
      DMA allocs                           0           0
      DMA32 allocs                        57     1491992
      Normal allocs                 32543566    30026383
      Movable allocs                       0           0
      Direct pages scanned                 0           0
      Kswapd pages scanned                 0           0
      Kswapd pages reclaimed               0           0
      Direct pages reclaimed               0           0
      Kswapd efficiency                 100%        100%
      Kswapd velocity                  0.000       0.000
      Direct efficiency                 100%        100%
      Direct velocity                  0.000       0.000
      Percentage direct scans             0%          0%
      Zone normal velocity             0.000       0.000
      Zone dma32 velocity              0.000       0.000
      Zone dma velocity                0.000       0.000
      THP fault alloc                  55164       52987
      THP collapse alloc                 139         147
      THP splits                          26          21
      NUMA alloc hit                 4169066     4250692
      NUMA alloc miss                      0           0
      
      Note that there were more DMA32 allocations with the patch applied.  In this
      particular case there was no difference in numa_hit and numa_miss. The
      expectation is that DMA32 was being used at the low watermark instead of
      falling into the slow path. kswapd was not woken but it's not worken for
      THP allocations.
      
      On 32-bit, this patch defaults to zone-ordering as low memory depletion
      can be a serious problem on 32-bit large memory machines. If the default
      ordering was node then processes on node 0 will deplete the Normal zone
      due to normal activity.  The problem is worse if CONFIG_HIGHPTE is not
      set. If combined with large amounts of dirty/writeback pages in Normal
      zone then there is also a high risk of OOM. The heuristics are removed
      as it's not clear they were ever important on 32-bit. They were only
      relevant for setting node-ordering on 64-bit.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3193913c
    • M
      mm: page_alloc: Make paranoid check in move_freepages a VM_BUG_ON · 97ee4ba7
      Mel Gorman 提交于
      Since 2.6.24 there has been a paranoid check in move_freepages that looks
      up the zone of two pages.  This is a very slow path and the only time I've
      seen this bug trigger recently is when memory initialisation was broken
      during patch development.  Despite the fact it's a slow path, this patch
      converts the check to a VM_BUG_ON anyway as it has served its purpose by
      now.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97ee4ba7
    • X
      ocfs2: fix a deadlock while o2net_wq doing direct memory reclaim · b246d3d1
      Xue jiufei 提交于
      Fix a deadlock problem caused by direct memory reclaim in o2net_wq.  The
      situation is as follows:
      
      1) Receive a connect message from another node, node queues a
         work_struct o2net_listen_work.
      
      2) o2net_wq processes this work and call the following functions:
      
      o2net_wq
      -> o2net_accept_one
        -> sock_create_lite
          -> sock_alloc()
            -> kmem_cache_alloc with GFP_KERNEL
              -> ____cache_alloc_node
                ->__alloc_pages_nodemask
                  -> do_try_to_free_pages
                    -> shrink_slab
                      -> evict
                        -> ocfs2_evict_inode
                          -> ocfs2_drop_lock
                            -> dlmunlock
                              -> o2net_send_message_vec
      
         then o2net_wq wait for the unlock reply from master.
      
      3) tcp layer received the reply, call o2net_data_ready() and queue
         sc_rx_work, waiting o2net_wq to process this work.
      
      4) o2net_wq is a single thread workqueue, it process the work one by
         one.  Right now it is still doing o2net_listen_work and cannot handle
         sc_rx_work.  so we deadlock.
      
      Junxiao Bi's patch "mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set"
      (http://ozlabs.org/~akpm/mmots/broken-out/mm-clear-__gfp_fs-when-pf_memalloc_noio-is-set.patch)
      clears __GFP_FS in memalloc_noio_flags() besides __GFP_IO.  We use
      memalloc_noio_save() to set process flag PF_MEMALLOC_NOIO so that all
      allocations done by this process are done as if GFP_NOIO was specified.
      We are not reentering filesystem while doing memory reclaim.
      Signed-off-by: Njoyce.xue <xuejiufei@huawei.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b246d3d1
    • J
      mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set · 934f3072
      Junxiao Bi 提交于
      commit 21caf2fc ("mm: teach mm by current context info to not do I/O
      during memory allocation") introduces PF_MEMALLOC_NOIO flag to avoid doing
      I/O inside memory allocation, __GFP_IO is cleared when this flag is set,
      but __GFP_FS implies __GFP_IO, it should also be cleared.  Or it may still
      run into I/O, like in superblock shrinker.  And this will make the kernel
      run into the deadlock case described in that commit.
      
      See Dave Chinner's comment about io in superblock shrinker:
      
      Filesystem shrinkers do indeed perform IO from the superblock shrinker and
      have for years.  Even clean inodes can require IO before they can be freed
      - e.g.  on an orphan list, need truncation of post-eof blocks, need to
      wait for ordered operations to complete before it can be freed, etc.
      
      IOWs, Ext4, btrfs and XFS all can issue and/or block on arbitrary amounts
      of IO in the superblock shrinker context.  XFS, in particular, has been
      doing transactions and IO from the VFS inode cache shrinker since it was
      first introduced....
      
      Fix this by clearing __GFP_FS in memalloc_noio_flags(), this function has
      masked all the gfp_mask that will be passed into fs for the processes
      setting PF_MEMALLOC_NOIO in the direct reclaim path.
      
      v1 thread at: https://lkml.org/lkml/2014/9/3/32Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: joyce.xue <xuejiufei@huawei.com>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      934f3072
    • X
      mm/compaction.c: fix warning of 'flags' may be used uninitialized · b8b2d825
      Xiubo Li 提交于
      C      mm/compaction.o
      mm/compaction.c: In function isolate_freepages_block:
      mm/compaction.c:364:37: warning: flags may be used uninitialized in this function [-Wmaybe-uninitialized]
             && compact_unlock_should_abort(&cc->zone->lock, flags,
                                           ^
      Signed-off-by: NXiubo Li <Li.Xiubo@freescale.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8b2d825