1. 21 1月, 2016 9 次提交
    • J
      mm: memcontrol: give the kmem states more descriptive names · 567e9ab2
      Johannes Weiner 提交于
      On any given memcg, the kmem accounting feature has three separate
      states: not initialized, structures allocated, and actively accounting
      slab memory.  These are represented through a combination of the
      kmem_acct_activated and kmem_acct_active flags, which is confusing.
      
      Convert to a kmem_state enum with the states NONE, ALLOCATED, and
      ONLINE.  Then rename the functions to modify the state accordingly.
      This follows the nomenclature of css object states more closely.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      567e9ab2
    • J
      mm: memcontrol: remove double kmem page_counter init · b15aac11
      Johannes Weiner 提交于
      The kmem page_counter's limit is initialized to PAGE_COUNTER_MAX inside
      mem_cgroup_css_online().  There is no need to repeat this from
      memcg_propagate_kmem().
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b15aac11
    • J
      mm: memcontrol: drop unused @css argument in memcg_init_kmem · 6d378dac
      Johannes Weiner 提交于
      This series adds accounting of the historical "kmem" memory consumers to
      the cgroup2 memory controller.
      
      These consumers include the dentry cache, the inode cache, kernel stack
      pages, and a few others that are pointed out in patch 7/8.  The
      footprint of these consumers is directly tied to userspace activity in
      common workloads, and so they have to be part of the minimally viable
      configuration in order to present a complete feature to our users.
      
      The cgroup2 interface of the memory controller is far from complete, but
      this series, along with the socket memory accounting series, provides
      the final semantic changes for the existing memory knobs in the cgroup2
      interface, which is scheduled for initial release in the next merge
      window.
      
      This patch (of 8):
      
      Remove unused css argument frmo memcg_init_kmem()
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d378dac
    • M
      proc read mm's {arg,env}_{start,end} with mmap semaphore taken. · a3b609ef
      Mateusz Guzik 提交于
      Only functions doing more than one read are modified.  Consumeres
      happened to deal with possibly changing data, but it does not seem like
      a good thing to rely on.
      Signed-off-by: NMateusz Guzik <mguzik@redhat.com>
      Acked-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Jarod Wilson <jarod@redhat.com>
      Cc: Jan Stancek <jstancek@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Anshuman Khandual <anshuman.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3b609ef
    • A
      UBSAN: run-time undefined behavior sanity checker · c6d30853
      Andrey Ryabinin 提交于
      UBSAN uses compile-time instrumentation to catch undefined behavior
      (UB).  Compiler inserts code that perform certain kinds of checks before
      operations that could cause UB.  If check fails (i.e.  UB detected)
      __ubsan_handle_* function called to print error message.
      
      So the most of the work is done by compiler.  This patch just implements
      ubsan handlers printing errors.
      
      GCC has this capability since 4.9.x [1] (see -fsanitize=undefined
      option and its suboptions).
      However GCC 5.x has more checkers implemented [2].
      Article [3] has a bit more details about UBSAN in the GCC.
      
      [1] - https://gcc.gnu.org/onlinedocs/gcc-4.9.0/gcc/Debugging-Options.html
      [2] - https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html
      [3] - http://developerblog.redhat.com/2014/10/16/gcc-undefined-behavior-sanitizer-ubsan/
      
      Issues which UBSAN has found thus far are:
      
      Found bugs:
      
       * out-of-bounds access - 97840cb6 ("netfilter: nfnetlink: fix
         insufficient validation in nfnetlink_bind")
      
      undefined shifts:
      
       * d48458d4 ("jbd2: use a better hash function for the revoke
         table")
      
       * 10632008 ("clockevents: Prevent shift out of bounds")
      
       * 'x << -1' shift in ext4 -
         http://lkml.kernel.org/r/<5444EF21.8020501@samsung.com>
      
       * undefined rol32(0) -
         http://lkml.kernel.org/r/<1449198241-20654-1-git-send-email-sasha.levin@oracle.com>
      
       * undefined dirty_ratelimit calculation -
         http://lkml.kernel.org/r/<566594E2.3050306@odin.com>
      
       * undefined roundown_pow_of_two(0) -
         http://lkml.kernel.org/r/<1449156616-11474-1-git-send-email-sasha.levin@oracle.com>
      
       * [WONTFIX] undefined shift in __bpf_prog_run -
         http://lkml.kernel.org/r/<CACT4Y+ZxoR3UjLgcNdUm4fECLMx2VdtfrENMtRRCdgHB2n0bJA@mail.gmail.com>
      
         WONTFIX here because it should be fixed in bpf program, not in kernel.
      
      signed overflows:
      
       * 32a8df4e ("sched: Fix odd values in effective_load()
         calculations")
      
       * mul overflow in ntp -
         http://lkml.kernel.org/r/<1449175608-1146-1-git-send-email-sasha.levin@oracle.com>
      
       * incorrect conversion into rtc_time in rtc_time64_to_tm() -
         http://lkml.kernel.org/r/<1449187944-11730-1-git-send-email-sasha.levin@oracle.com>
      
       * unvalidated timespec in io_getevents() -
         http://lkml.kernel.org/r/<CACT4Y+bBxVYLQ6LtOKrKtnLthqLHcw-BMp3aqP3mjdAvr9FULQ@mail.gmail.com>
      
       * [NOTABUG] signed overflow in ktime_add_safe() -
         http://lkml.kernel.org/r/<CACT4Y+aJ4muRnWxsUe1CMnA6P8nooO33kwG-c8YZg=0Xc8rJqw@mail.gmail.com>
      
      [akpm@linux-foundation.org: fix unused local warning]
      [akpm@linux-foundation.org: fix __int128 build woes]
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yury Gribov <y.gribov@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6d30853
    • J
      ptrace: use fsuid, fsgid, effective creds for fs access checks · caaee623
      Jann Horn 提交于
      By checking the effective credentials instead of the real UID / permitted
      capabilities, ensure that the calling process actually intended to use its
      credentials.
      
      To ensure that all ptrace checks use the correct caller credentials (e.g.
      in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
      flag), use two new flags and require one of them to be set.
      
      The problem was that when a privileged task had temporarily dropped its
      privileges, e.g.  by calling setreuid(0, user_uid), with the intent to
      perform following syscalls with the credentials of a user, it still passed
      ptrace access checks that the user would not be able to pass.
      
      While an attacker should not be able to convince the privileged task to
      perform a ptrace() syscall, this is a problem because the ptrace access
      check is reused for things in procfs.
      
      In particular, the following somewhat interesting procfs entries only rely
      on ptrace access checks:
      
       /proc/$pid/stat - uses the check for determining whether pointers
           should be visible, useful for bypassing ASLR
       /proc/$pid/maps - also useful for bypassing ASLR
       /proc/$pid/cwd - useful for gaining access to restricted
           directories that contain files with lax permissions, e.g. in
           this scenario:
           lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
           drwx------ root root /root
           drwxr-xr-x root root /root/foobar
           -rw-r--r-- root root /root/foobar/secret
      
      Therefore, on a system where a root-owned mode 6755 binary changes its
      effective credentials as described and then dumps a user-specified file,
      this could be used by an attacker to reveal the memory layout of root's
      processes or reveal the contents of files he is not allowed to access
      (through /proc/$pid/cwd).
      
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: NJann Horn <jann@thejh.net>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      caaee623
    • J
      zsmalloc: fix migrate_zspage-zs_free race condition · c102f07c
      Junil Lee 提交于
      record_obj() in migrate_zspage() does not preserve handle's
      HANDLE_PIN_BIT, set by find_aloced_obj()->trypin_tag(), and implicitly
      (accidentally) un-pins the handle, while migrate_zspage() still performs
      an explicit unpin_tag() on the that handle.  This additional explicit
      unpin_tag() introduces a race condition with zs_free(), which can pin
      that handle by this time, so the handle becomes un-pinned.
      
      Schematically, it goes like this:
      
        CPU0                                        CPU1
        migrate_zspage
          find_alloced_obj
            trypin_tag
              set HANDLE_PIN_BIT                    zs_free()
                                                      pin_tag()
        obj_malloc() -- new object, no tag
        record_obj() -- remove HANDLE_PIN_BIT           set HANDLE_PIN_BIT
        unpin_tag()  -- remove zs_free's HANDLE_PIN_BIT
      
      The race condition may result in a NULL pointer dereference:
      
        Unable to handle kernel NULL pointer dereference at virtual address 00000000
        CPU: 0 PID: 19001 Comm: CookieMonsterCl Tainted:
        PC is at get_zspage_mapping+0x0/0x24
        LR is at obj_free.isra.22+0x64/0x128
        Call trace:
           get_zspage_mapping+0x0/0x24
           zs_free+0x88/0x114
           zram_free_page+0x64/0xcc
           zram_slot_free_notify+0x90/0x108
           swap_entry_free+0x278/0x294
           free_swap_and_cache+0x38/0x11c
           unmap_single_vma+0x480/0x5c8
           unmap_vmas+0x44/0x60
           exit_mmap+0x50/0x110
           mmput+0x58/0xe0
           do_exit+0x320/0x8dc
           do_group_exit+0x44/0xa8
           get_signal+0x538/0x580
           do_signal+0x98/0x4b8
           do_notify_resume+0x14/0x5c
      
      This patch keeps the lock bit in migration path and update value
      atomically.
      Signed-off-by: NJunil Lee <junil0814.lee@lge.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: <stable@vger.kernel.org> [4.1+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c102f07c
    • K
      thp: fix interrupt unsafe locking in split_huge_page() · 0b9b6fff
      Kirill A. Shutemov 提交于
      split_queue_lock can be taken from interrupt context in some cases, but
      I forgot to convert locking in split_huge_page() to interrupt-safe
      primitives.
      
      Let's fix this.
      
      lockdep output:
      
        ======================================================
        [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
        4.4.0+ #259 Tainted: G        W
        ------------------------------------------------------
        syz-executor/18183 [HC0[0]:SC0[2]:HE0:SE0] is trying to acquire:
         (split_queue_lock){+.+...}, at: free_transhuge_page+0x24/0x90 mm/huge_memory.c:3436
      
        and this task is already holding:
         (slock-AF_INET){+.-...}, at: spin_lock_bh include/linux/spinlock.h:307
         (slock-AF_INET){+.-...}, at: lock_sock_fast+0x45/0x120 net/core/sock.c:2462
        which would create a new lock dependency:
         (slock-AF_INET){+.-...} -> (split_queue_lock){+.+...}
      
        but this new dependency connects a SOFTIRQ-irq-safe lock:
         (slock-AF_INET){+.-...}
        ... which became SOFTIRQ-irq-safe at:
           mark_irqflags kernel/locking/lockdep.c:2799
           __lock_acquire+0xfd8/0x4700 kernel/locking/lockdep.c:3162
           lock_acquire+0x1dc/0x430 kernel/locking/lockdep.c:3585
           __raw_spin_lock include/linux/spinlock_api_smp.h:144
           _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
           spin_lock include/linux/spinlock.h:302
           udp_queue_rcv_skb+0x781/0x1550 net/ipv4/udp.c:1680
           flush_stack+0x50/0x330 net/ipv6/udp.c:799
           __udp4_lib_mcast_deliver+0x694/0x7f0 net/ipv4/udp.c:1798
           __udp4_lib_rcv+0x17dc/0x23e0 net/ipv4/udp.c:1888
           udp_rcv+0x21/0x30 net/ipv4/udp.c:2108
           ip_local_deliver_finish+0x2b3/0xa50 net/ipv4/ip_input.c:216
           NF_HOOK_THRESH include/linux/netfilter.h:226
           NF_HOOK include/linux/netfilter.h:249
           ip_local_deliver+0x1c4/0x2f0 net/ipv4/ip_input.c:257
           dst_input include/net/dst.h:498
           ip_rcv_finish+0x5ec/0x1730 net/ipv4/ip_input.c:365
           NF_HOOK_THRESH include/linux/netfilter.h:226
           NF_HOOK include/linux/netfilter.h:249
           ip_rcv+0x963/0x1080 net/ipv4/ip_input.c:455
           __netif_receive_skb_core+0x1620/0x2f80 net/core/dev.c:4154
           __netif_receive_skb+0x2a/0x160 net/core/dev.c:4189
           netif_receive_skb_internal+0x1b5/0x390 net/core/dev.c:4217
           napi_skb_finish net/core/dev.c:4542
           napi_gro_receive+0x2bd/0x3c0 net/core/dev.c:4572
           e1000_clean_rx_irq+0x4e2/0x1100 drivers/net/ethernet/intel/e1000e/netdev.c:1038
           e1000_clean+0xa08/0x24a0 drivers/net/ethernet/intel/e1000/e1000_main.c:3819
           napi_poll net/core/dev.c:5074
           net_rx_action+0x7eb/0xdf0 net/core/dev.c:5139
           __do_softirq+0x26a/0x920 kernel/softirq.c:273
           invoke_softirq kernel/softirq.c:350
           irq_exit+0x18f/0x1d0 kernel/softirq.c:391
           exiting_irq ./arch/x86/include/asm/apic.h:659
           do_IRQ+0x86/0x1a0 arch/x86/kernel/irq.c:252
           ret_from_intr+0x0/0x20 arch/x86/entry/entry_64.S:520
           arch_safe_halt ./arch/x86/include/asm/paravirt.h:117
           default_idle+0x52/0x2e0 arch/x86/kernel/process.c:304
           arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:295
           default_idle_call+0x48/0xa0 kernel/sched/idle.c:92
           cpuidle_idle_call kernel/sched/idle.c:156
           cpu_idle_loop kernel/sched/idle.c:252
           cpu_startup_entry+0x554/0x710 kernel/sched/idle.c:300
           rest_init+0x192/0x1a0 init/main.c:412
           start_kernel+0x678/0x69e init/main.c:683
           x86_64_start_reservations+0x2a/0x2c arch/x86/kernel/head64.c:195
           x86_64_start_kernel+0x158/0x167 arch/x86/kernel/head64.c:184
      
        to a SOFTIRQ-irq-unsafe lock:
         (split_queue_lock){+.+...}
         which became SOFTIRQ-irq-unsafe at:
           mark_irqflags kernel/locking/lockdep.c:2817
           __lock_acquire+0x146e/0x4700 kernel/locking/lockdep.c:3162
           lock_acquire+0x1dc/0x430 kernel/locking/lockdep.c:3585
           __raw_spin_lock include/linux/spinlock_api_smp.h:144
           _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
           spin_lock include/linux/spinlock.h:302
           split_huge_page_to_list+0xcc0/0x1c50 mm/huge_memory.c:3399
           split_huge_page include/linux/huge_mm.h:99
           queue_pages_pte_range+0xa38/0xef0 mm/mempolicy.c:507
           walk_pmd_range mm/pagewalk.c:50
           walk_pud_range mm/pagewalk.c:90
           walk_pgd_range mm/pagewalk.c:116
           __walk_page_range+0x653/0xcd0 mm/pagewalk.c:204
           walk_page_range+0xfe/0x2b0 mm/pagewalk.c:281
           queue_pages_range+0xfb/0x130 mm/mempolicy.c:687
           migrate_to_node mm/mempolicy.c:1004
           do_migrate_pages+0x370/0x4e0 mm/mempolicy.c:1109
           SYSC_migrate_pages mm/mempolicy.c:1453
           SyS_migrate_pages+0x640/0x730 mm/mempolicy.c:1374
           entry_SYSCALL_64_fastpath+0x16/0x7a arch/x86/entry/entry_64.S:185
      
        other info that might help us debug this:
      
         Possible interrupt unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(split_queue_lock);
                                       local_irq_disable();
                                       lock(slock-AF_INET);
                                       lock(split_queue_lock);
          <Interrupt>
            lock(slock-AF_INET);
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b9b6fff
    • A
      mm: avoid uninitialized variable in tracepoint · 629d9d1c
      Arnd Bergmann 提交于
      A newly added tracepoint in the hugepage code uses a variable in the
      error handling that is not initialized at that point:
      
      include/trace/events/huge_memory.h:81:230: error: 'isolated' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      
      The result is relatively harmless, as the trace data will in rare
      cases contain incorrect data.
      
      This works around the problem by adding an explicit initialization.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Fixes: 7d2eba05 ("mm: add tracepoint for scanning pages")
      Reviewed-by: NEbru Akagunduz <ebru.akagunduz@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      629d9d1c
  2. 18 1月, 2016 1 次提交
  3. 16 1月, 2016 30 次提交
    • M
      memcg: only free spare array when readers are done · 6611d8d7
      Martijn Coenen 提交于
      A spare array holding mem cgroup threshold events is kept around to make
      sure we can always safely deregister an event and have an array to store
      the new set of events in.
      
      In the scenario where we're going from 1 to 0 registered events, the
      pointer to the primary array containing 1 event is copied to the spare
      slot, and then the spare slot is freed because no events are left.
      However, it is freed before calling synchronize_rcu(), which means
      readers may still be accessing threshold->primary after it is freed.
      
      Fixed by only freeing after synchronize_rcu().
      Signed-off-by: NMartijn Coenen <maco@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6611d8d7
    • N
      mm: soft-offline: exit with failure for non anonymous thp · 98fd1ef4
      Naoya Horiguchi 提交于
      Currently memory_failure() doesn't handle non anonymous thp case,
      because we can hardly expect the error handling to be successful, and it
      can just hit some corner case which results in BUG_ON or something
      severe like that.  This is also the case for soft offline code, so let's
      make it in the same way.
      
      Orignal code has a MF_COUNT_INCREASED check before put_hwpoison_page(),
      but it's unnecessary because get_any_page() is already called when
      running on this code, which takes a refcount of the target page
      regardress of the flag.  So this patch also removes it.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98fd1ef4
    • N
      mm: soft-offline: clean up soft_offline_page() · acc14dc4
      Naoya Horiguchi 提交于
      soft_offline_page() has some deeply indented code, that's the sign of
      demand for cleanup.  So let's do this.  No functionality change.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      acc14dc4
    • H
      mm: make swapoff more robust against soft dirty · 9f8bdb3f
      Hugh Dickins 提交于
      Both s390 and powerpc have hit the issue of swapoff hanging, when
      CONFIG_HAVE_ARCH_SOFT_DIRTY and CONFIG_MEM_SOFT_DIRTY ifdefs were not
      quite as x86_64 had them.  I think it would be much clearer if
      HAVE_ARCH_SOFT_DIRTY was just a Kconfig option set by architectures to
      determine whether the MEM_SOFT_DIRTY option should be offered, and the
      actual code depend upon CONFIG_MEM_SOFT_DIRTY alone.
      
      But won't embark on that change myself: instead make swapoff more
      robust, by using pte_swp_clear_soft_dirty() on each pte it encounters,
      without an explicit #ifdef CONFIG_MEM_SOFT_DIRTY.  That being a no-op,
      whether the bit in question is defined as 0 or the asm-generic fallback
      is used, unless soft dirty is fully turned on.
      
      Why "maybe" in maybe_same_pte()? Rename it pte_same_as_swp().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f8bdb3f
    • K
      mm: fix locking order in mm_take_all_locks() · 88f306b6
      Kirill A. Shutemov 提交于
      Dmitry Vyukov has reported[1] possible deadlock (triggered by his
      syzkaller fuzzer):
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&hugetlbfs_i_mmap_rwsem_key);
                                     lock(&mapping->i_mmap_rwsem);
                                     lock(&hugetlbfs_i_mmap_rwsem_key);
        lock(&mapping->i_mmap_rwsem);
      
      Both traces points to mm_take_all_locks() as a source of the problem.
      It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka
      mapping->i_mmap_rwsem for hugetlb mapping) vs.  i_mmap_rwsem.
      
      huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key
      and allocator can take i_mmap_rwsem if it hit reclaim.  So we need to
      take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from
      rest of VMAs.
      
      The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key.
      
      [1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f306b6
    • L
      mm: mempolicy: skip non-migratable VMAs when setting MPOL_MF_LAZY · d645fc0e
      Liang Chen 提交于
      MPOL_MF_LAZY is not visible from userspace since a720094d ("mm:
      mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now"), but
      it should still skip non-migratable VMAs such as VM_IO, VM_PFNMAP, and
      VM_HUGETLB VMAs, and avoid useless overhead of minor faults.
      Signed-off-by: NLiang Chen <liangchen.linux@gmail.com>
      Signed-off-by: NGavin Guo <gavin.guo@canonical.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d645fc0e
    • A
      mm/page_alloc.c: remove unused struct zone *z variable · f16f091b
      Alexander Kuleshov 提交于
      Remove unused struct zone *z variable which appeared in 86051ca5
      ("mm: fix usemap initialization").
      Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f16f091b
    • W
      mm/mlock.c: change can_do_mlock return value type to boolean · 7f43add4
      Wang Xiaoqiang 提交于
      Since can_do_mlock only return 1 or 0, so make it boolean.
      
      No functional change.
      
      [akpm@linux-foundation.org: update declaration in mm.h]
      Signed-off-by: NWang Xiaoqiang <wangxq10@lzu.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f43add4
    • W
      mm/vmalloc.c: use macro IS_ALIGNED to judge the aligment · 61e16557
      Wang Xiaoqiang 提交于
      Just cleanup, no functional change.
      Signed-off-by: NWang Xiaoqiang <wangxq10@lzu.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61e16557
    • T
      cgroup, memcg, writeback: drop spurious rcu locking around mem_cgroup_css_from_page() · 654a0dd0
      Tejun Heo 提交于
      In earlier versions, mem_cgroup_css_from_page() could return non-root
      css on a legacy hierarchy which can go away and required rcu locking;
      however, the eventual version simply returns the root cgroup if memcg is
      on a legacy hierarchy and thus doesn't need rcu locking around or in it.
      Remove spurious rcu lockings.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      654a0dd0
    • W
      mm/page_isolation: do some cleanup in "undo_isolate_page_range" · 6f8d2b8a
      Wang Xiaoqiang 提交于
      Use "IS_ALIGNED" to judge the alignment, rather than directly judging.
      Signed-off-by: NWang Xiaoqiang <wang_xiaoq@126.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f8d2b8a
    • D
      mm: bring in additional flag for fixup_user_fault to signal unlock · 4a9e1cda
      Dominik Dingel 提交于
      During Jason's work with postcopy migration support for s390 a problem
      regarding gmap faults was discovered.
      
      The gmap code will call fixup_user_fault which will end up always in
      handle_mm_fault.  Till now we never cared about retries, but as the
      userfaultfd code kind of relies on it.  this needs some fix.
      
      This patchset does not take care of the futex code.  I will now look
      closer at this.
      
      This patch (of 2):
      
      With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
      to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
      faulting we ever unlocked mmap_sem.
      
      This patch brings in the logic to handle retries as well as it cleans up
      the current documentation.  fixup_user_fault was not having the same
      semantics as filemap_fault.  It never indicated if a retry happened and
      so a caller wasn't able to handle that case.  So we now changed the
      behaviour to always retry a locked mmap_sem.
      Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: "Jason J. Herne" <jjherne@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a9e1cda
    • D
      mm, x86: get_user_pages() for dax mappings · 3565fce3
      Dan Williams 提交于
      A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
      has established a devm_memremap_pages() mapping, i.e.  when the pfn_t
      return from ->direct_access() has PFN_DEV and PFN_MAP set.  Later, when
      encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
      struct dev_pagemap instance to keep the result of pfn_to_page() valid
      until put_page().
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: NLogan Gunthorpe <logang@deltatee.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3565fce3
    • D
      mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd · 5c7fb56e
      Dan Williams 提交于
      A dax-huge-page mapping while it uses some thp helpers is ultimately not
      a transparent huge page.  The distinction is especially important in the
      get_user_pages() path.  pmd_devmap() is used to distinguish dax-pmds
      from pmd_huge() and pmd_trans_huge() which have slightly different
      semantics.
      
      Explicitly mark the pmd_trans_huge() helpers that dax needs by adding
      pmd_devmap() checks.
      
      [kirill.shutemov@linux.intel.com: fix regression in handling mlocked pages in  __split_huge_pmd()]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c7fb56e
    • D
      mm, dax: convert vmf_insert_pfn_pmd() to pfn_t · f25748e3
      Dan Williams 提交于
      Similar to the conversion of vm_insert_mixed() use pfn_t in the
      vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the
      pfn is backed by a devm_memremap_pages() mapping.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f25748e3
    • D
      mm, dax, gpu: convert vm_insert_mixed to pfn_t · 01c8f1c4
      Dan Williams 提交于
      Convert the raw unsigned long 'pfn' argument to pfn_t for the purpose of
      evaluating the PFN_MAP and PFN_DEV flags.  When both are set it triggers
      _PAGE_DEVMAP to be set in the resulting pte.
      
      There are no functional changes to the gpu drivers as a result of this
      conversion.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: David Airlie <airlied@linux.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01c8f1c4
    • D
      x86, mm: introduce vmem_altmap to augment vmemmap_populate() · 4b94ffdc
      Dan Williams 提交于
      In support of providing struct page for large persistent memory
      capacities, use struct vmem_altmap to change the default policy for
      allocating memory for the memmap array.  The default vmemmap_populate()
      allocates page table storage area from the page allocator.  Given
      persistent memory capacities relative to DRAM it may not be feasible to
      store the memmap in 'System Memory'.  Instead vmem_altmap represents
      pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
      requests.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b94ffdc
    • R
      mm, dax: fix livelock, allow dax pmd mappings to become writeable · 01871e59
      Ross Zwisler 提交于
      Prior to this change DAX PMD mappings that were made read-only were
      never able to be made writable again.  This is because the code in
      insert_pfn_pmd() that calls pmd_mkdirty() and pmd_mkwrite() would skip
      these calls if the PMD already existed in the page table.
      
      Instead, if we are doing a write always mark the PMD entry as dirty and
      writeable.  Without this code we can get into a condition where we mark
      the PMD as read-only, and then on a subsequent write fault we get into
      an infinite loop of PMD faults where we try unsuccessfully to make the
      PMD writeable.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: NJeff Moyer <jmoyer@redhat.com>
      Reported-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01871e59
    • K
      thp: fix split_huge_page() after mremap() of THP · bd56086f
      Kirill A. Shutemov 提交于
      Sasha Levin has reported KASAN out-of-bounds bug[1].  It points to "if
      (!is_swap_pte(pte[i]))" in unfreeze_page_vma() as a problematic access.
      
      The cause is that split_huge_page() doesn't handle THP correctly if it's
      not allingned to PMD boundary.  It can happen after mremap().
      
      Test-case (not always triggers the bug):
      
      	#define _GNU_SOURCE
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <sys/mman.h>
      
      	#define MB (1024UL*1024)
      	#define SIZE (2*MB)
      	#define BASE ((void *)0x400000000000)
      
      	int main()
      	{
      		char *p;
      
      		p = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
      				MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE,
      				-1, 0);
      		if (p == MAP_FAILED)
      			perror("mmap"), exit(1);
      		p = mremap(BASE, SIZE, SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
      				BASE + SIZE + 8192);
      		if (p == MAP_FAILED)
      			perror("mremap"), exit(1);
      		system("echo 1 > /sys/kernel/debug/split_huge_pages");
      		return 0;
      	}
      
      The patch fixes freeze and unfreeze paths to handle page table boundary
      crossing.
      
      It also makes mapcount vs count check in split_huge_page_to_list()
      stricter:
       - after freeze we don't expect any subpage mapped as we remove them
         from rmap when setting up migration entries;
       - count must be 1, meaning only caller has reference to the page;
      
      [1] https://gist.github.com/sashalevin/c67fbea55e7c0576972aSigned-off-by: NKirill A.  Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd56086f
    • M
      mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called · b8d3c4c3
      Minchan Kim 提交于
      We don't need to split THP page when MADV_FREE syscall is called if
      [start, len] is aligned with THP size.  The split could be done when VM
      decide to free it in reclaim path if memory pressure is heavy.  With
      that, we could avoid unnecessary THP split.
      
      For the feature, this patch changes pte dirtness marking logic of THP.
      Now, it marks every ptes of pages dirty unconditionally in splitting,
      which makes MADV_FREE void.  So, instead, this patch propagates pmd
      dirtiness to all pages via PG_dirty and restores pte dirtiness from
      PG_dirty.  With this, if pmd is clean(ie, MADV_FREEed) when split
      happens(e,g, shrink_page_list), all of pages are clean too so we could
      discard them.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Evans <je@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8d3c4c3
    • M
      mm/ksm.c: mark stable page dirty · 337ed7eb
      Minchan Kim 提交于
      The MADV_FREE patchset changes page reclaim to simply free a clean
      anonymous page with no dirty ptes, instead of swapping it out; but KSM
      uses clean write-protected ptes to reference the stable ksm page.  So be
      sure to mark that page dirty, so it's never mistakenly discarded.
      
      [hughd@google.com: adjusted comments]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Evans <je@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      337ed7eb
    • M
      mm: move lazily freed pages to inactive list · 10853a03
      Minchan Kim 提交于
      MADV_FREE is a hint that it's okay to discard pages if there is memory
      pressure and we use reclaimers(ie, kswapd and direct reclaim) to free
      them so there is no value keeping them in the active anonymous LRU so
      this patch moves them to inactive LRU list's head.
      
      This means that MADV_FREE-ed pages which were living on the inactive
      list are reclaimed first because they are more likely to be cold rather
      than recently active pages.
      
      An arguable issue for the approach would be whether we should put the
      page to the head or tail of the inactive list.  I chose head because the
      kernel cannot make sure it's really cold or warm for every MADV_FREE
      usecase but at least we know it's not *hot*, so landing of inactive head
      would be a comprimise for various usecases.
      
      This fixes suboptimal behavior of MADV_FREE when pages living on the
      active list will sit there for a long time even under memory pressure
      while the inactive list is reclaimed heavily.  This basically breaks the
      whole purpose of using MADV_FREE to help the system to free memory which
      is might not be used.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Evans <je@fb.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10853a03
    • M
      mm/madvise.c: free swp_entry in madvise_free · 64b42bc1
      Minchan Kim 提交于
      When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
      consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is
      siginficat slower (ie, 2x times) than madvise_dontneed.
      
           loop = 5;
           mmap(512M);
           while (loop--) {
                   memset(512M);
                   madvise(MADV_FREE or MADV_DONTNEED);
           }
      
      The reason is lots of swapin.
      
      1) dontneed: 1,612 swapin
      2) madvfree: 879,585 swapin
      
      If we find hinted pages were already swapped out when syscall is called,
      it's pointless to keep the swapped-out pages in pte.  Instead, let's
      free the cold page because swapin is more expensive than (alloc page +
      zeroing).
      
      With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time
      
      1) dontneed: 6.10user 233.50system 0:50.44elapsed
      2) madvfree: 6.03user 401.17system 1:30.67elapsed
      2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Evans <je@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64b42bc1
    • M
      mm: support madvise(MADV_FREE) · 854e9ed0
      Minchan Kim 提交于
      Linux doesn't have an ability to free pages lazy while other OS already
      have been supported that named by madvise(MADV_FREE).
      
      The gain is clear that kernel can discard freed pages rather than
      swapping out or OOM if memory pressure happens.
      
      Without memory pressure, freed pages would be reused by userspace
      without another additional overhead(ex, page fault + allocation +
      zeroing).
      
      Jason Evans said:
      
      : Facebook has been using MAP_UNINITIALIZED
      : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
      : several years, but there are operational costs to maintaining this
      : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
      : in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
      : increased throughput for much of our workload by ~5%, and although the
      : benefit has decreased using newer hardware and kernels, there is still
      : enough benefit that we cannot reasonably retire it without a replacement.
      :
      : Aside from Facebook operations, there are numerous broadly used
      : applications that would benefit from MADV_FREE.  The ones that immediately
      : come to mind are redis, varnish, and MariaDB.  I don't have much insight
      : into Android internals and development process, but I would hope to see
      : MADV_FREE support eventually end up there as well to benefit applications
      : linked with the integrated jemalloc.
      :
      : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
      : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
      : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
      : (and AIX, but I'm not sure it even compiles on AIX).  The lack of
      : MADV_FREE on Linux forced me down a long series of increasingly
      : sophisticated heuristics for madvise() volume reduction, and even so this
      : remains a common performance issue for people using jemalloc on Linux.
      : Please integrate MADV_FREE; many people will benefit substantially.
      
      How it works:
      
      When madvise syscall is called, VM clears dirty bit of ptes of the
      range.  If memory pressure happens, VM checks dirty bit of page table
      and if it found still "clean", it means it's a "lazyfree pages" so VM
      could discard the page instead of swapping out.  Once there was store
      operation for the page before VM peek a page to reclaim, dirty bit is
      set so VM can swap out the page instead of discarding.
      
      One thing we should notice is that basically, MADV_FREE relies on dirty
      bit in page table entry to decide whether VM allows to discard the page
      or not.  IOW, if page table entry includes marked dirty bit, VM
      shouldn't discard the page.
      
      However, as a example, if swap-in by read fault happens, page table
      entry doesn't have dirty bit so MADV_FREE could discard the page
      wrongly.
      
      For avoiding the problem, MADV_FREE did more checks with PageDirty and
      PageSwapCache.  It worked out because swapped-in page lives on swap
      cache and since it is evicted from the swap cache, the page has PG_dirty
      flag.  So both page flags check effectively prevent wrong discarding by
      MADV_FREE.
      
      However, a problem in above logic is that swapped-in page has PG_dirty
      still after they are removed from swap cache so VM cannot consider the
      page as freeable any more even if madvise_free is called in future.
      
      Look at below example for detail.
      
          ptr = malloc();
          memset(ptr);
          ..
          ..
          .. heavy memory pressure so all of pages are swapped out
          ..
          ..
          var = *ptr; -> a page swapped-in and could be removed from
                         swapcache. Then, page table doesn't mark
                         dirty bit and page descriptor includes PG_dirty
          ..
          ..
          madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
          ..
          ..
          ..
          .. heavy memory pressure again.
          .. In this time, VM cannot discard the page because the page
          .. has *PG_dirty*
      
      To solve the problem, this patch clears PG_dirty if only the page is
      owned exclusively by current process when madvise is called because
      PG_dirty represents ptes's dirtiness in several processes so we could
      clear it only if we own it exclusively.
      
      Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
      and hope glibc supports it) and jemalloc/tcmalloc already have supported
      the feature for other OS(ex, FreeBSD)
      
        barrios@blaptop:~/benchmark/ebizzy$ lscpu
        Architecture:          x86_64
        CPU op-mode(s):        32-bit, 64-bit
        Byte Order:            Little Endian
        CPU(s):                12
        On-line CPU(s) list:   0-11
        Thread(s) per core:    1
        Core(s) per socket:    1
        Socket(s):             12
        NUMA node(s):          1
        Vendor ID:             GenuineIntel
        CPU family:            6
        Model:                 2
        Stepping:              3
        CPU MHz:               3200.185
        BogoMIPS:              6400.53
        Virtualization:        VT-x
        Hypervisor vendor:     KVM
        Virtualization type:   full
        L1d cache:             32K
        L1i cache:             32K
        L2 cache:              4096K
        NUMA node0 CPU(s):     0-11
        ebizzy benchmark(./ebizzy -S 10 -n 512)
      
        Higher avg is better.
      
         vanilla-jemalloc             MADV_free-jemalloc
      
        1 thread
        records: 10                   records: 10
        avg:   2961.90                avg:  12069.70
        std:     71.96(2.43%)         std:    186.68(1.55%)
        max:   3070.00                max:  12385.00
        min:   2796.00                min:  11746.00
      
        2 thread
        records: 10                   records: 10
        avg:   5020.00                avg:  17827.00
        std:    264.87(5.28%)         std:    358.52(2.01%)
        max:   5244.00                max:  18760.00
        min:   4251.00                min:  17382.00
      
        4 thread
        records: 10                   records: 10
        avg:   8988.80                avg:  27930.80
        std:   1175.33(13.08%)        std:   3317.33(11.88%)
        max:   9508.00                max:  30879.00
        min:   5477.00                min:  21024.00
      
        8 thread
        records: 10                   records: 10
        avg:  13036.50                avg:  33739.40
        std:    170.67(1.31%)         std:   5146.22(15.25%)
        max:  13371.00                max:  40572.00
        min:  12785.00                min:  24088.00
      
        16 thread
        records: 10                   records: 10
        avg:  11092.40                avg:  31424.20
        std:    710.60(6.41%)         std:   3763.89(11.98%)
        max:  12446.00                max:  36635.00
        min:   9949.00                min:  25669.00
      
        32 thread
        records: 10                   records: 10
        avg:  11067.00                avg:  34495.80
        std:    971.06(8.77%)         std:   2721.36(7.89%)
        max:  12010.00                max:  38598.00
        min:   9002.00                min:  30636.00
      
      In summary, MADV_FREE is about much faster than MADV_DONTNEED.
      
      This patch (of 12):
      
      Add core MADV_FREE implementation.
      
      [akpm@linux-foundation.org: small cleanups]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jason Evans <je@fb.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: "Shaohua Li" <shli@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      854e9ed0
    • V
      mm: add page_check_address_transhuge() helper · 8749cfea
      Vladimir Davydov 提交于
      page_referenced_one() and page_idle_clear_pte_refs_one() duplicate the
      code for looking up pte of a (possibly transhuge) page.  Move this code
      to a new helper function, page_check_address_transhuge(), and make the
      above mentioned functions use it.
      
      This is just a cleanup, no functional changes are intended.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8749cfea
    • K
      thp: increase split_huge_page() success rate · d9654322
      Kirill A. Shutemov 提交于
      During freeze_page(), we remove the page from rmap.  It munlocks the
      page if it was mlocked.  clear_page_mlock() uses thelru cache, which
      temporary pins the page.
      
      Let's drain the lru cache before checking page's count vs.  mapcount.
      The change makes mlocked page split on first attempt, if it was not
      pinned by somebody else.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9654322
    • K
      thp: add debugfs handle to split all huge pages · 49071d43
      Kirill A. Shutemov 提交于
      Writing 1 into 'split_huge_pages' will try to find and split all huge
      pages in the system.  This is useful for debuging.
      
      [akpm@linux-foundation.org: fix printk text, per Vlastimil]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49071d43
    • K
      mm: prepare page_referenced() and page_idle to new THP refcounting · b20ce5e0
      Kirill A. Shutemov 提交于
      Both page_referenced() and page_idle_clear_pte_refs_one() assume that
      THP can only be mapped with PMD, so there's no reason to look on PTEs
      for PageTransHuge() pages.  That's no true anymore: THP can be mapped
      with PTEs too.
      
      The patch removes PageTransHuge() test from the functions and opencode
      page table check.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b20ce5e0
    • K
      thp: allow mlocked THP again · e90309c9
      Kirill A. Shutemov 提交于
      Before THP refcounting rework, THP was not allowed to cross VMA
      boundary.  So, if we have THP and we split it, PG_mlocked can be safely
      transferred to small pages.
      
      With new THP refcounting and naive approach to mlocking we can end up
      with this scenario:
       1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
       2. the process does munlock() on the *part* of the THP:
            - the VMA is split into two, one of them VM_LOCKED;
            - huge PMD split into PTE table;
            - THP is still mlocked;
       3. split_huge_page():
            - it transfers PG_mlocked to *all* small pages regrardless if it
      	blong to any VM_LOCKED VMA.
      
      We probably could munlock() all small pages on split_huge_page(), but I
      think we have accounting issue already on step two.
      
      Instead of forbidding mlocked pages altogether, we just avoid mlocking
      PTE-mapped THPs and munlock THPs on split_huge_pmd().
      
      This means PTE-mapped THPs will be on normal lru lists and will be split
      under memory pressure by vmscan.  After the split vmscan will detect
      unevictable small pages and mlock them.
      
      With this approach we shouldn't hit situation like described above.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e90309c9
    • K
      mm: re-enable THP · 61f5d698
      Kirill A. Shutemov 提交于
      All parts of THP with new refcounting are now in place.  We can now
      allow to enable THP.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61f5d698