1. 11 1月, 2017 2 次提交
  2. 08 1月, 2017 1 次提交
    • H
      mm: stop leaking PageTables · b0b9b3df
      Hugh Dickins 提交于
      4.10-rc loadtest (even on x86, and even without THPCache) fails with
      "fork: Cannot allocate memory" or some such; and /proc/meminfo shows
      PageTables growing.
      
      Commit 953c66c2 ("mm: THP page cache support for ppc64") that got
      merged in rc1 removed the freeing of an unused preallocated pagetable
      after do_fault_around() has called map_pages().
      
      This is usually a good optimization, so that the followup doesn't have
      to reallocate one; but it's not sufficient to shift the freeing into
      alloc_set_pte(), since there are failure cases (most commonly
      VM_FAULT_RETRY) which never reach finish_fault().
      
      Check and free it at the outer level in do_fault(), then we don't need
      to worry in alloc_set_pte(), and can restore that to how it was (I
      cannot find any reason to pte_free() under lock as it was doing).
      
      And fix a separate pagetable leak, or crash, introduced by the same
      change, that could only show up on some ppc64: why does do_set_pmd()'s
      failure case attempt to withdraw a pagetable when it never deposited
      one, at the same time overwriting (so leaking) the vmf->prealloc_pte?
      Residue of an earlier implementation, perhaps? Delete it.
      
      Fixes: 953c66c2 ("mm: THP page cache support for ppc64")
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0b9b3df
  3. 25 12月, 2016 1 次提交
  4. 15 12月, 2016 19 次提交
  5. 13 12月, 2016 4 次提交
  6. 23 11月, 2016 1 次提交
    • E
      ptrace: Don't allow accessing an undumpable mm · 84d77d3f
      Eric W. Biederman 提交于
      It is the reasonable expectation that if an executable file is not
      readable there will be no way for a user without special privileges to
      read the file.  This is enforced in ptrace_attach but if ptrace
      is already attached before exec there is no enforcement for read-only
      executables.
      
      As the only way to read such an mm is through access_process_vm
      spin a variant called ptrace_access_vm that will fail if the
      target process is not being ptraced by the current process, or
      the current process did not have sufficient privileges when ptracing
      began to read the target processes mm.
      
      In the ptrace implementations replace access_process_vm by
      ptrace_access_vm.  There remain several ptrace sites that still use
      access_process_vm as they are reading the target executables
      instructions (for kernel consumption) or register stacks.  As such it
      does not appear necessary to add a permission check to those calls.
      
      This bug has always existed in Linux.
      
      Fixes: v1.0
      Cc: stable@vger.kernel.org
      Reported-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      84d77d3f
  7. 10 11月, 2016 2 次提交
  8. 19 10月, 2016 4 次提交
  9. 08 10月, 2016 2 次提交
    • D
      mm: fix cache mode tracking in vm_insert_mixed() · 87744ab3
      Dan Williams 提交于
      vm_insert_mixed() unlike vm_insert_pfn_prot() and vmf_insert_pfn_pmd(),
      fails to check the pgprot_t it uses for the mapping against the one
      recorded in the memtype tracking tree.  Add the missing call to
      track_pfn_insert() to preclude cases where incompatible aliased mappings
      are established for a given physical address range.
      
      Link: http://lkml.kernel.org/r/147328717909.35069.14256589123570653697.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87744ab3
    • M
      mm: make sure that kthreads will not refault oom reaped memory · 3f70dc38
      Michal Hocko 提交于
      There are only few use_mm() users in the kernel right now.  Most of them
      write to the target memory but vhost driver relies on
      copy_from_user/get_user from a kernel thread context.  This makes it
      impossible to reap the memory of an oom victim which shares the mm with
      the vhost kernel thread because it could see a zero page unexpectedly
      and theoretically make an incorrect decision visible outside of the
      killed task context.
      
      To quote Michael S. Tsirkin:
      : Getting an error from __get_user and friends is handled gracefully.
      : Getting zero instead of a real value will cause userspace
      : memory corruption.
      
      The vhost kernel thread is bound to an open fd of the vhost device which
      is not tight to the mm owner life cycle in general.  The device fd can
      be inherited or passed over to another process which means that we
      really have to be careful about unexpected memory corruption because
      unlike for normal oom victims the result will be visible outside of the
      oom victim context.
      
      Make sure that no kthread context (users of use_mm) can ever see
      corrupted data because of the oom reaper and hook into the page fault
      path by checking MMF_UNSTABLE mm flag.  __oom_reap_task_mm will set the
      flag before it starts unmapping the address space while the flag is
      checked after the page fault has been handled.  If the flag is set then
      SIGBUS is triggered so any g-u-p user will get a error code.
      
      Regular tasks do not need this protection because all which share the mm
      are killed when the mm is reaped and so the corruption will not outlive
      them.
      
      This patch shouldn't have any visible effect at this moment because the
      OOM killer doesn't invoke oom reaper for tasks with mm shared with
      kthreads yet.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: N"Michael S. Tsirkin" <mst@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f70dc38
  10. 26 9月, 2016 1 次提交
    • L
      mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing · 38e08854
      Lorenzo Stoakes 提交于
      The NUMA balancing logic uses an arch-specific PROT_NONE page table flag
      defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page
      PMDs respectively as requiring balancing upon a subsequent page fault.
      User-defined PROT_NONE memory regions which also have this flag set will
      not normally invoke the NUMA balancing code as do_page_fault() will send
      a segfault to the process before handle_mm_fault() is even called.
      
      However if access_remote_vm() is invoked to access a PROT_NONE region of
      memory, handle_mm_fault() is called via faultin_page() and
      __get_user_pages() without any access checks being performed, meaning
      the NUMA balancing logic is incorrectly invoked on a non-NUMA memory
      region.
      
      A simple means of triggering this problem is to access PROT_NONE mmap'd
      memory using /proc/self/mem which reliably results in the NUMA handling
      functions being invoked when CONFIG_NUMA_BALANCING is set.
      
      This issue was reported in bugzilla (issue 99101) which includes some
      simple repro code.
      
      There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page()
      added at commit c0e7cad9 to avoid accidentally provoking strange
      behaviour by attempting to apply NUMA balancing to pages that are in
      fact PROT_NONE.  The BUG_ON()'s are consistently triggered by the repro.
      
      This patch moves the PROT_NONE check into mm/memory.c rather than
      invoking BUG_ON() as faulting in these pages via faultin_page() is a
      valid reason for reaching the NUMA check with the PROT_NONE page table
      flag set and is therefore not always a bug.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101Reported-by: NTrevor Saunders <tbsaunde@tbsaunde.org>
      Signed-off-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38e08854
  11. 14 9月, 2016 1 次提交
    • R
      sched/numa, mm: Revert to checking pmd/pte_write instead of VMA flags · d59dc7bc
      Rik van Riel 提交于
      Commit:
      
        4d942466 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations")
      
      changed NUMA balancing from _PAGE_NUMA to using PROT_NONE, and was quickly
      found to introduce a regression with NUMA grouping.
      
      It was followed up by these commits:
      
       53da3bc2 ("mm: fix up numa read-only thread grouping logic")
       bea66fbd ("mm: numa: group related processes based on VMA flags instead of page table flags")
       b191f9b1 ("mm: numa: preserve PTE write permissions across a NUMA hinting fault")
      
      The first of those two commits try alternate approaches to NUMA
      grouping, which apparently do not work as well as looking at the PTE
      write permissions.
      
      The latter patch preserves the PTE write permissions across a NUMA
      protection fault. However, it forgets to revert the condition for
      whether or not to group tasks together back to what it was before
      v3.19, even though the information is now preserved in the page tables
      once again.
      
      This patch brings the NUMA grouping heuristic back to what it was
      before commit 4d942466, which the changelogs of subsequent
      commits suggest worked best.
      
      We have all the information again. We should probably use it.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: aarcange@redhat.com
      Cc: linux-mm@kvack.org
      Cc: mgorman@suse.de
      Link: http://lkml.kernel.org/r/20160908213053.07c992a9@annuminas.surriel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d59dc7bc
  12. 03 8月, 2016 2 次提交
    • M
      mm: move swap-in anonymous page into active list · 1a8018fb
      Minchan Kim 提交于
      Every swap-in anonymous page starts from inactive lru list's head.  It
      should be activated unconditionally when VM decide to reclaim because
      page table entry for the page always usually has marked accessed bit.
      Thus, their window size for getting a new referece is 2 * NR_inactive +
      NR_active while others is NR_inactive + NR_active.
      
      It's not fair that it has more chance to be referenced compared to other
      newly allocated page which starts from active lru list's head.
      
      Johannes:
      
      : The page can still have a valid copy on the swap device, so prefering to
      : reclaim that page over a fresh one could make sense.  But as you point
      : out, having it start inactive instead of active actually ends up giving it
      : *more* LRU time, and that seems to be without justification.
      
      Rik:
      
      : The reason newly read in swap cache pages start on the inactive list is
      : that we do some amount of read-around, and do not know which pages will
      : get used.
      :
      : However, immediately activating the ones that DO get used, like your patch
      : does, is the right thing to do.
      
      Link: http://lkml.kernel.org/r/1469762740-17860-1-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a8018fb
    • V
      mm: fail prefaulting if page table allocation fails · c5f88bd2
      Vegard Nossum 提交于
      I ran into this:
      
          BUG: sleeping function called from invalid context at mm/page_alloc.c:3784
          in_atomic(): 0, irqs_disabled(): 0, pid: 1434, name: trinity-c1
          2 locks held by trinity-c1/1434:
           #0:  (&mm->mmap_sem){......}, at: [<ffffffff810ce31e>] __do_page_fault+0x1ce/0x8f0
           #1:  (rcu_read_lock){......}, at: [<ffffffff81378f86>] filemap_map_pages+0xd6/0xdd0
      
          CPU: 0 PID: 1434 Comm: trinity-c1 Not tainted 4.7.0+ #58
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          Call Trace:
            dump_stack+0x65/0x84
            panic+0x185/0x2dd
            ___might_sleep+0x51c/0x600
            __might_sleep+0x90/0x1a0
            __alloc_pages_nodemask+0x5b1/0x2160
            alloc_pages_current+0xcc/0x370
            pte_alloc_one+0x12/0x90
            __pte_alloc+0x1d/0x200
            alloc_set_pte+0xe3e/0x14a0
            filemap_map_pages+0x42b/0xdd0
            handle_mm_fault+0x17d5/0x28b0
            __do_page_fault+0x310/0x8f0
            trace_do_page_fault+0x18d/0x310
            do_async_page_fault+0x27/0xa0
            async_page_fault+0x28/0x30
      
      The important bits from the above is that filemap_map_pages() is calling
      into the page allocator while holding rcu_read_lock (sleeping is not
      allowed inside RCU read-side critical sections).
      
      According to Kirill Shutemov, the prefaulting code in do_fault_around()
      is supposed to take care of this, but missing error handling means that
      the allocation failure can go unnoticed.
      
      We don't need to return VM_FAULT_OOM (or any other error) here, since we
      can just let the normal fault path try again.
      
      Fixes: 7267ec00 ("mm: postpone page table allocation until we have page to map")
      Link: http://lkml.kernel.org/r/1469708107-11868-1-git-send-email-vegard.nossum@oracle.comSigned-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Hillf Danton" <hillf.zj@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5f88bd2