1. 15 4月, 2011 1 次提交
  2. 13 4月, 2011 1 次提交
  3. 14 1月, 2011 3 次提交
  4. 16 12月, 2010 1 次提交
    • T
      install_special_mapping skips security_file_mmap check. · 462e635e
      Tavis Ormandy 提交于
      The install_special_mapping routine (used, for example, to setup the
      vdso) skips the security check before insert_vm_struct, allowing a local
      attacker to bypass the mmap_min_addr security restriction by limiting
      the available pages for special mappings.
      
      bprm_mm_init() also skips the check, and although I don't think this can
      be used to bypass any restrictions, I don't see any reason not to have
      the security check.
      
        $ uname -m
        x86_64
        $ cat /proc/sys/vm/mmap_min_addr
        65536
        $ cat install_special_mapping.s
        section .bss
            resb BSS_SIZE
        section .text
            global _start
            _start:
                mov     eax, __NR_pause
                int     0x80
        $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
        $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
        $ ./install_special_mapping &
        [1] 14303
        $ cat /proc/14303/maps
        0000f000-00010000 r-xp 00000000 00:00 0                                  [vdso]
        00010000-00011000 r-xp 00001000 00:19 2453665                            /home/taviso/install_special_mapping
        00011000-ffffe000 rwxp 00000000 00:00 0                                  [stack]
      
      It's worth noting that Red Hat are shipping with mmap_min_addr set to
      4096.
      Signed-off-by: NTavis Ormandy <taviso@google.com>
      Acked-by: NKees Cook <kees@ubuntu.com>
      Acked-by: NRobert Swiecki <swiecki@google.com>
      [ Changed to not drop the error code - akpm ]
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      462e635e
  5. 30 10月, 2010 1 次提交
    • A
      audit mmap · 120a795d
      Al Viro 提交于
      Normal syscall audit doesn't catch 5th argument of syscall.  It also
      doesn't catch the contents of userland structures pointed to be
      syscall argument, so for both old and new mmap(2) ABI it doesn't
      record the descriptor we are mapping.  For old one it also misses
      flags.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      120a795d
  6. 23 9月, 2010 1 次提交
  7. 25 8月, 2010 1 次提交
  8. 21 8月, 2010 1 次提交
  9. 10 8月, 2010 4 次提交
  10. 09 6月, 2010 1 次提交
  11. 27 4月, 2010 1 次提交
  12. 13 4月, 2010 2 次提交
    • L
      vma_adjust: fix the copying of anon_vma chains · 287d97ac
      Linus Torvalds 提交于
      When we move the boundaries between two vma's due to things like
      mprotect, we need to make sure that the anon_vma of the pages that got
      moved from one vma to another gets properly copied around.  And that was
      not always the case, in this rather hard-to-follow code sequence.
      
      Clarify the code, and fix it so that it copies the anon_vma from the
      right source.
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: Borislav Petkov <bp@alien8.de> [ "Yeah, not so much this one either" ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      287d97ac
    • L
      Simplify and comment on anon_vma re-use for anon_vma_prepare() · d0e9fe17
      Linus Torvalds 提交于
      This changes the anon_vma reuse case to require that we only reuse
      simple anon_vma's - ie the case when the vma only has a single anon_vma
      associated with it.
      
      This means that a reuse of an anon_vma from an adjacent vma will always
      guarantee that both vma's are associated not only with the same
      anon_vma, they will also have the same anon_vma chain (of just a single
      entry in this case).
      
      And since anon_vma re-use was the only case where the same anon_vma
      might be associated with different chains of anon_vma's, we now have the
      case that every vma that shares the same anon_vma will always also have
      the same chain.  That makes it much easier to think about merging vma's
      that share the same anon_vma's: you can always just drop the other
      anon_vma chain in anon_vma_merge() since you know that they are always
      identical.
      
      This also splits up the function to validate the anon_vma re-use, and
      adds a lot of commentary about the possible races.
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: Borislav Petkov <bp@alien8.de> [ "That didn't fix it" ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0e9fe17
  13. 13 3月, 2010 1 次提交
  14. 07 3月, 2010 5 次提交
    • R
      mm: remove VM_LOCK_RMAP code · fc148a5f
      Rik van Riel 提交于
      When a VMA is in an inconsistent state during setup or teardown, the worst
      that can happen is that the rmap code will not be able to find the page.
      
      The mapping is in the process of being torn down (PTEs just got
      invalidated by munmap), or set up (no PTEs have been instantiated yet).
      
      It is also impossible for the rmap code to follow a pointer to an already
      freed VMA, because the rmap code holds the anon_vma->lock, which the VMA
      teardown code needs to take before the VMA is removed from the anon_vma
      chain.
      
      Hence, we should not need the VM_LOCK_RMAP locking at all.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc148a5f
    • R
      mm: change anon_vma linking to fix multi-process server scalability issue · 5beb4930
      Rik van Riel 提交于
      The old anon_vma code can lead to scalability issues with heavily forking
      workloads.  Specifically, each anon_vma will be shared between the parent
      process and all its child processes.
      
      In a workload with 1000 child processes and a VMA with 1000 anonymous
      pages per process that get COWed, this leads to a system with a million
      anonymous pages in the same anon_vma, each of which is mapped in just one
      of the 1000 processes.  However, the current rmap code needs to walk them
      all, leading to O(N) scanning complexity for each page.
      
      This can result in systems where one CPU is walking the page tables of
      1000 processes in page_referenced_one, while all other CPUs are stuck on
      the anon_vma lock.  This leads to catastrophic failure for a benchmark
      like AIM7, where the total number of processes can reach in the tens of
      thousands.  Real workloads are still a factor 10 less process intensive
      than AIM7, but they are catching up.
      
      This patch changes the way anon_vmas and VMAs are linked, which allows us
      to associate multiple anon_vmas with a VMA.  At fork time, each child
      process gets its own anon_vmas, in which its COWed pages will be
      instantiated.  The parents' anon_vma is also linked to the VMA, because
      non-COWed pages could be present in any of the children.
      
      This reduces rmap scanning complexity to O(1) for the pages of the 1000
      child processes, with O(N) complexity for at most 1/N pages in the system.
       This reduces the average scanning cost in heavily forking workloads from
      O(N) to 2.
      
      The only real complexity in this patch stems from the fact that linking a
      VMA to anon_vmas now involves memory allocations.  This means vma_adjust
      can fail, if it needs to attach a VMA to anon_vma structures.  This in
      turn means error handling needs to be added to the calling functions.
      
      A second source of complexity is that, because there can be multiple
      anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
      "the" anon_vma lock.  To prevent the rmap code from walking up an
      incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
      flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
      to make sure it is impossible to compile a kernel that needs both symbolic
      values for the same bitflag.
      
      Some test results:
      
      Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
      box with 16GB RAM and not quite enough IO), the system ends up running
      >99% in system time, with every CPU on the same anon_vma lock in the
      pageout code.
      
      With these changes, AIM7 hits the cross-over point around 29.7k users.
      This happens with ~99% IO wait time, there never seems to be any spike in
      system time.  The anon_vma lock contention appears to be resolved.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5beb4930
    • J
      mm: use rlimit helpers · 59e99e5b
      Jiri Slaby 提交于
      Make sure compiler won't do weird things with limits.  E.g.  fetching them
      twice may return 2 different values after writable limits are implemented.
      
      I.e.  either use rlimit helpers added in
      3e10e716 ("resource: add helpers for
      fetching rlimits") or ACCESS_ONCE if not applicable.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59e99e5b
    • K
      mm: mlock_vma_pages_range() only return success or failure · 06f9d8c2
      KOSAKI Motohiro 提交于
      Currently, mlock_vma_pages_range() only return len or 0.  then current
      error handling of mmap_region() is meaningless complex.
      
      This patch makes simplify and makes consist with brk() code.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamewzawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06f9d8c2
    • K
      mm: mlock_vma_pages_range() never return negative value · c58267c3
      KOSAKI Motohiro 提交于
      Currently, mlock_vma_pages_range() never return negative value.  Then, we
      can remove some worthless error check.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamewzawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c58267c3
  15. 31 12月, 2009 1 次提交
  16. 16 12月, 2009 2 次提交
    • M
      mm: uncached vma support with writenotify · c9d0bf24
      Magnus Damm 提交于
      Modify the generic mmap() code to keep the cache attribute in
      vma->vm_page_prot regardless if writenotify is enabled or not.  Without
      this patch the cache configuration selected by f_op->mmap() is overwritten
      if writenotify is enabled, making it impossible to keep the vma uncached.
      
      Needed by drivers such as drivers/video/sh_mobile_lcdcfb.c which uses
      deferred io together with uncached memory.
      Signed-off-by: NMagnus Damm <damm@opensource.se>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jaya Kumar <jayakumar.lkml@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9d0bf24
    • K
      mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap() · 659ace58
      KOSAKI Motohiro 提交于
      On ia64, the following test program exit abnormally, because glibc thread
      library called abort().
      
       ========================================================
       (gdb) bt
       #0  0xa000000000010620 in __kernel_syscall_via_break ()
       #1  0x20000000003208e0 in raise () from /lib/libc.so.6.1
       #2  0x2000000000324090 in abort () from /lib/libc.so.6.1
       #3  0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0
       #4  0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0
       #5  0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1
       ========================================================
      
      The fact is, glibc call munmap() when thread exitng time for freeing
      stack, and it assume munlock() never fail.  However, munmap() often make
      vma splitting and it with many mapcount make -ENOMEM.
      
      Oh well, that's crazy, because stack unmapping never increase mapcount.
      The maxcount exceeding is only temporary.  internal temporary exceeding
      shouldn't make ENOMEM.
      
      This patch does it.
      
       test_max_mapcount.c
       ==================================================================
        #include<stdio.h>
        #include<stdlib.h>
        #include<string.h>
        #include<pthread.h>
        #include<errno.h>
        #include<unistd.h>
      
        #define THREAD_NUM 30000
        #define MAL_SIZE (8*1024*1024)
      
       void *wait_thread(void *args)
       {
       	void *addr;
      
       	addr = malloc(MAL_SIZE);
       	sleep(10);
      
       	return NULL;
       }
      
       void *wait_thread2(void *args)
       {
       	sleep(60);
      
       	return NULL;
       }
      
       int main(int argc, char *argv[])
       {
       	int i;
       	pthread_t thread[THREAD_NUM], th;
       	int ret, count = 0;
       	pthread_attr_t attr;
      
       	ret = pthread_attr_init(&attr);
       	if(ret) {
       		perror("pthread_attr_init");
       	}
      
       	ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
       	if(ret) {
       		perror("pthread_attr_setdetachstate");
       	}
      
       	for (i = 0; i < THREAD_NUM; i++) {
       		ret = pthread_create(&th, &attr, wait_thread, NULL);
       		if(ret) {
       			fprintf(stderr, "[%d] ", count);
       			perror("pthread_create");
       		} else {
       			printf("[%d] create OK.\n", count);
       		}
       		count++;
      
       		ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
       		if(ret) {
       			fprintf(stderr, "[%d] ", count);
       			perror("pthread_create");
       		} else {
       			printf("[%d] create OK.\n", count);
       		}
       		count++;
       	}
      
       	sleep(3600);
       	return 0;
       }
       ==================================================================
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      659ace58
  17. 11 12月, 2009 3 次提交
  18. 25 10月, 2009 1 次提交
  19. 28 9月, 2009 1 次提交
  20. 22 9月, 2009 8 次提交
    • E
      hugetlb: add MAP_HUGETLB for mmaping pseudo-anonymous huge page regions · 4e52780d
      Eric B Munson 提交于
      Add a flag for mmap that will be used to request a huge page region that
      will look like anonymous memory to userspace.  This is accomplished by
      using a file on the internal vfsmount.  MAP_HUGETLB is a modifier of
      MAP_ANONYMOUS and so must be specified with it.  The region will behave
      the same as a MAP_ANONYMOUS region using small pages.
      
      [akpm@linux-foundation.org: fix arch definitions of MAP_HUGETLB]
      Signed-off-by: NEric B Munson <ebmunson@us.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e52780d
    • H
      mmap: save some cycles for the shared anonymous mapping · f8dbf0a7
      Huang Shijie 提交于
      shmem_zero_setup() does not change vm_start, pgoff or vm_flags, only some
      drivers change them (such as /driver/video/bfin-t350mcqb-fb.c).
      
      Move these codes to a more proper place to save cycles for shared
      anonymous mapping.
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8dbf0a7
    • L
      mmap: avoid unnecessary anon_vma lock acquisition in vma_adjust() · 252c5f94
      Lee Schermerhorn 提交于
      We noticed very erratic behavior [throughput] with the AIM7 shared
      workload running on recent distro [SLES11] and mainline kernels on an
      8-socket, 32-core, 256GB x86_64 platform.  On the SLES11 kernel
      [2.6.27.19+] with Barcelona processors, as we increased the load [10s of
      thousands of tasks], the throughput would vary between two "plateaus"--one
      at ~65K jobs per minute and one at ~130K jpm.  The simple patch below
      causes the results to smooth out at the ~130k plateau.
      
      But wait, there's more:
      
      We do not see this behavior on smaller platforms--e.g., 4 socket/8 core.
      This could be the result of the larger number of cpus on the larger
      platform--a scalability issue--or it could be the result of the larger
      number of interconnect "hops" between some nodes in this platform and how
      the tasks for a given load end up distributed over the nodes' cpus and
      memories--a stochastic NUMA effect.
      
      The variability in the results are less pronounced [on the same platform]
      with Shanghai processors and with mainline kernels.  With 31-rc6 on
      Shanghai processors and 288 file systems on 288 fibre attached storage
      volumes, the curves [jpm vs load] are both quite flat with the patched
      kernel consistently producing ~3.9% better throughput [~80K jpm vs ~77K
      jpm] than the unpatched kernel.
      
      Profiling indicated that the "slow" runs were incurring high[er]
      contention on an anon_vma lock in vma_adjust(), apparently called from the
      sbrk() system call.
      
      The patch:
      
      A comment in mm/mmap.c:vma_adjust() suggests that we don't really need the
      anon_vma lock when we're only adjusting the end of a vma, as is the case
      for brk().  The comment questions whether it's worth while to optimize for
      this case.  Apparently, on the newer, larger x86_64 platforms, with
      interesting NUMA topologies, it is worth while--especially considering
      that the patch [if correct!] is quite simple.
      
      We can detect this condition--no overlap with next vma--by noting a NULL
      "importer".  The anon_vma pointer will also be NULL in this case, so
      simply avoid loading vma->anon_vma to avoid the lock.
      
      However, we DO need to take the anon_vma lock when we're inserting a vma
      ['insert' non-NULL] even when we have no overlap [NULL "importer"], so we
      need to check for 'insert', as well.  And Hugh points out that we should
      also take it when adjusting vm_start (so that rmap.c can rely upon
      vma_address() while it holds the anon_vma lock).
      
      akpm: Zhang Yanmin reprts a 150% throughput improvement with aim7, so it
      might be -stable material even though thiss isn't a regression: "this
      issue is not clear on dual socket Nehalem machine (2*4*2 cpu), but is
      severe on large machine (4*8*2 cpu)"
      
      [hugh.dickins@tiscali.co.uk: test vma start too]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Tested-by: N"Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      252c5f94
    • H
      mmap: remove unnecessary code · cdf7b341
      Huang Shijie 提交于
      If (flags & MAP_LOCKED) is true, it means vm_flags has already contained
      the bit VM_LOCKED which is set by calc_vm_flag_bits().
      
      So there is no need to reset it again, just remove it.
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdf7b341
    • H
      ksm: clean up obsolete references · a913e182
      Hugh Dickins 提交于
      A few cleanups, given the munlock fix: the comment on ksm_test_exit() no
      longer applies, and it can be made private to ksm.c; there's no more
      reference to mmu_gather or tlb.h, and mmap.c doesn't need ksm.h.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a913e182
    • H
      ksm: remove VM_MERGEABLE_FLAGS · 8314c4f2
      Hugh Dickins 提交于
      KSM originally stood for Kernel Shared Memory: but the kernel has long
      supported shared memory, and VM_SHARED and VM_MAYSHARE vmas, and KSM is
      something else.  So we switched to saying "merge" instead of "share".
      
      But Chris Wright points out that this is confusing where mmap.c merges
      adjacent vmas: most especially in the name VM_MERGEABLE_FLAGS, used by
      is_mergeable_vma() to let vmas be merged despite flags being different.
      
      Call it VMA_MERGE_DESPITE_FLAGS?  Perhaps, but at present it consists
      only of VM_CAN_NONLINEAR: so for now it's clearer on all sides to use
      that directly, with a comment on it in is_mergeable_vma().
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8314c4f2
    • A
      ksm: fix deadlock with munlock in exit_mmap · 1c2fb7a4
      Andrea Arcangeli 提交于
      Rawhide users have reported hang at startup when cryptsetup is run: the
      same problem can be simply reproduced by running a program int main() {
      mlockall(MCL_CURRENT | MCL_FUTURE); return 0; }
      
      The problem is that exit_mmap() applies munlock_vma_pages_all() to
      clean up VM_LOCKED areas, and its current implementation (stupidly)
      tries to fault in absent pages, for example where PROT_NONE prevented
      them being faulted in when mlocking.  Whereas the "ksm: fix oom
      deadlock" patch, knowing there's a race by which KSM might try to fault
      in pages after exit_mmap() had finally zapped the range, backs out of
      such faults doing nothing when its ksm_test_exit() notices mm_users 0.
      
      So revert that part of "ksm: fix oom deadlock" which moved the
      ksm_exit() call from before exit_mmap() to the middle of exit_mmap();
      and remove those ksm_test_exit() checks from the page fault paths, so
      allowing the munlocking to proceed without interference.
      
      ksm_exit, if there are rmap_items still chained on this mm slot, takes
      mmap_sem write side: so preventing KSM from working on an mm while
      exit_mmap runs.  And KSM will bail out as soon as it notices that
      mm_users is already zero, thanks to its internal ksm_test_exit checks.
      So that when a task is killed by OOM killer or the user, KSM will not
      indefinitely prevent it from running exit_mmap to release its memory.
      
      This does break a part of what "ksm: fix oom deadlock" was trying to
      achieve.  When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even
      when ksmd itself has to cancel a KSM page, it is possible that the
      first OOM-kill victim would be the KSM process being faulted: then its
      memory won't be freed until a second victim has been selected (freeing
      memory for the unmerging fault to complete).
      
      But the OOM killer is already liable to kill a second victim once the
      intended victim's p->mm goes to NULL: so there's not much point in
      rejecting this KSM patch before fixing that OOM behaviour.  It is very
      much more important to allow KSM users to boot up, than to haggle over
      an unlikely and poorly supported OOM case.
      
      We also intend to fix munlocking to not fault pages: at which point
      this patch _could_ be reverted; though that would be controversial, so
      we hope to find a better solution.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJustin M. Forbes <jforbes@redhat.com>
      Acked-for-now-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c2fb7a4
    • H
      ksm: fix oom deadlock · 9ba69294
      Hugh Dickins 提交于
      There's a now-obvious deadlock in KSM's out-of-memory handling:
      imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
      trying to allocate a page to break KSM in an mm which becomes the
      OOM victim (quite likely in the unmerge case): it's killed and goes
      to exit, and hangs there waiting to acquire ksm_thread_mutex.
      
      Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
      though that made everything else: perhaps use mmap_sem somehow?
      And part of the answer lies in the comments on unmerge_ksm_pages:
      __ksm_exit should also leave all the rmap_item removal to ksmd.
      
      But there's a fundamental problem, that KSM relies upon mmap_sem to
      guarantee the consistency of the mm it's dealing with, yet exit_mmap
      tears down an mm without taking mmap_sem.  And bumping mm_users won't
      help at all, that just ensures that the pages the OOM killer assumes
      are on their way to being freed will not be freed.
      
      The best answer seems to be, to move the ksm_exit callout from just
      before exit_mmap, to the middle of exit_mmap: after the mm's pages
      have been freed (if the mmu_gather is flushed), but before its page
      tables and vma structures have been freed; and down_write,up_write
      mmap_sem there to serialize with KSM's own reliance on mmap_sem.
      
      But KSM then needs to be careful, whenever it downs mmap_sem, to
      check that the mm is not already exiting: there's a danger of using
      find_vma on a layout that's being torn apart, or writing into page
      tables which have been freed for reuse; and even do_anonymous_page
      and __do_fault need to check they're not being called by break_ksm
      to reinstate a pte after zap_pte_range has zapped that page table.
      
      Though it might be clearer to add an exiting flag, set while holding
      mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
      a zapped pte.  All we need is to check whether mm_users is 0 - but
      must remember that ksmd may detect that before __ksm_exit is reached.
      So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.
      
      __ksm_exit now has to leave clearing up the rmap_items to ksmd,
      that needs ksm_thread_mutex; but shift the exiting mm just after the
      ksm_scan cursor so that it will soon be dealt with.  __ksm_enter raise
      mm_count to hold the mm_struct, ksmd's exit processing (exactly like
      its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
      similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).
      
      But also give __ksm_exit a fast path: when there's no complication
      (no rmap_items attached to mm and it's not at the ksm_scan cursor),
      it can safely do all the exiting work itself.  This is not just an
      optimization: when ksmd is not running, the raised mm_count would
      otherwise leak mm_structs.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ba69294