1. 26 1月, 2014 1 次提交
  2. 15 1月, 2014 1 次提交
    • M
      mm: fix crash when using XFS on loopback · 03e5ac2f
      Mikulas Patocka 提交于
      Commit 8456a648 ("slab: use struct page for slab management") causes
      a crash in the LVM2 testsuite on PA-RISC (the crashing test is
      fsadm.sh).  The testsuite doesn't crash on 3.12, crashes on 3.13-rc1 and
      later.
      
       Bad Address (null pointer deref?): Code=15 regs=000000413edd89a0 (Addr=000006202224647d)
       CPU: 3 PID: 24008 Comm: loop0 Not tainted 3.13.0-rc6 #5
       task: 00000001bf3c0048 ti: 000000413edd8000 task.ti: 000000413edd8000
      
            YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
       PSW: 00001000000001101111100100001110 Not tainted
       r00-03  000000ff0806f90e 00000000405c8de0 000000004013e6c0 000000413edd83f0
       r04-07  00000000405a95e0 0000000000000200 00000001414735f0 00000001bf349e40
       r08-11  0000000010fe3d10 0000000000000001 00000040829c7778 000000413efd9000
       r12-15  0000000000000000 000000004060d800 0000000010fe3000 0000000010fe3000
       r16-19  000000413edd82a0 00000041078ddbc0 0000000000000010 0000000000000001
       r20-23  0008f3d0d83a8000 0000000000000000 00000040829c7778 0000000000000080
       r24-27  00000001bf349e40 00000001bf349e40 202d66202224640d 00000000405a95e0
       r28-31  202d662022246465 000000413edd88f0 000000413edd89a0 0000000000000001
       sr00-03  000000000532c000 0000000000000000 0000000000000000 000000000532c000
       sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
      
       IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000401fe42c 00000000401fe430
        IIR: 539c0030    ISR: 00000000202d6000  IOR: 000006202224647d
        CPU:        3   CR30: 000000413edd8000 CR31: 0000000000000000
        ORIG_R28: 00000000405a95e0
        IAOQ[0]: vma_interval_tree_iter_first+0x14/0x48
        IAOQ[1]: vma_interval_tree_iter_first+0x18/0x48
        RP(r2): flush_dcache_page+0x128/0x388
       Backtrace:
         flush_dcache_page+0x128/0x388
         lo_splice_actor+0x90/0x148 [loop]
         splice_from_pipe_feed+0xc0/0x1d0
         __splice_from_pipe+0xac/0xc0
         lo_direct_splice_actor+0x1c/0x70 [loop]
         splice_direct_to_actor+0xec/0x228
         lo_receive+0xe4/0x298 [loop]
         loop_thread+0x478/0x640 [loop]
         kthread+0x134/0x168
         end_fault_vector+0x20/0x28
         xfs_setsize_buftarg+0x0/0x90 [xfs]
      
       Kernel panic - not syncing: Bad Address (null pointer deref?)
      
      Commit 8456a648 changes the page structure so that the slab
      subsystem reuses the page->mapping field.
      
      The crash happens in the following way:
       * XFS allocates some memory from slab and issues a bio to read data
         into it.
       * the bio is sent to the loopback device.
       * lo_receive creates an actor and calls splice_direct_to_actor.
       * lo_splice_actor copies data to the target page.
       * lo_splice_actor calls flush_dcache_page because the page may be
         mapped by userspace.  In that case we need to flush the kernel cache.
       * flush_dcache_page asks for the list of userspace mappings, however
         that page->mapping field is reused by the slab subsystem for a
         different purpose.  This causes the crash.
      
      Note that other architectures without coherent caches (sparc, arm, mips)
      also call page_mapping from flush_dcache_page, so they may crash in the
      same way.
      
      This patch fixes this bug by testing if the page is a slab page in
      page_mapping and returning NULL if it is.
      
      The patch also fixes VM_BUG_ON(PageSlab(page)) that could happen in
      earlier kernels in the same scenario on architectures without cache
      coherence when CONFIG_DEBUG_VM is enabled - so it should be backported
      to stable kernels.
      
      In the old kernels, the function page_mapping is placed in
      include/linux/mm.h, so you should modify the patch accordingly when
      backporting it.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: John David Anglin <dave.anglin@bell.net>]
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Reviewed-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03e5ac2f
  3. 12 1月, 2014 1 次提交
    • H
      thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once only · eecc1e42
      Hugh Dickins 提交于
      We see General Protection Fault on RSI in copy_page_rep: that RSI is
      what you get from a NULL struct page pointer.
      
        RIP: 0010:[<ffffffff81154955>]  [<ffffffff81154955>] copy_page_rep+0x5/0x10
        RSP: 0000:ffff880136e15c00  EFLAGS: 00010286
        RAX: ffff880000000000 RBX: ffff880136e14000 RCX: 0000000000000200
        RDX: 6db6db6db6db6db7 RSI: db73880000000000 RDI: ffff880dd0c00000
        RBP: ffff880136e15c18 R08: 0000000000000200 R09: 000000000005987c
        R10: 000000000005987c R11: 0000000000000200 R12: 0000000000000001
        R13: ffffea00305aa000 R14: 0000000000000000 R15: 0000000000000000
        FS:  00007f195752f700(0000) GS:ffff880c7fc20000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000093010000 CR3: 00000001458e1000 CR4: 00000000000027e0
        Call Trace:
          copy_user_huge_page+0x93/0xab
          do_huge_pmd_wp_page+0x710/0x815
          handle_mm_fault+0x15d8/0x1d70
          __do_page_fault+0x14d/0x840
          do_page_fault+0x2f/0x90
          page_fault+0x22/0x30
      
      do_huge_pmd_wp_page() tests is_huge_zero_pmd(orig_pmd) four times: but
      since shrink_huge_zero_page() can free the huge_zero_page, and we have
      no hold of our own on it here (except where the fourth test holds
      page_table_lock and has checked pmd_same), it's possible for it to
      answer yes the first time, but no to the second or third test.  Change
      all those last three to tests for NULL page.
      
      (Note: this is not the same issue as trinity's DEBUG_PAGEALLOC BUG
      in copy_page_rep with RSI: ffff88009c422000, reported by Sasha Levin
      in https://lkml.org/lkml/2013/3/29/103.  I believe that one is due
      to the source page being split, and a tail page freed, while copy
      is in progress; and not a problem without DEBUG_PAGEALLOC, since
      the pmd_same check will prevent a miscopy from being made visible.)
      
      Fixes: 97ae1749 ("thp: implement refcounting for huge zero page")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org # v3.10 v3.11 v3.12
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eecc1e42
  4. 03 1月, 2014 6 次提交
    • N
      mm/memory-failure.c: transfer page count from head page to tail page after split thp · a3e0f9e4
      Naoya Horiguchi 提交于
      Memory failures on thp tail pages cause kernel panic like below:
      
         mce: [Hardware Error]: Machine check events logged
         MCE exception done on CPU 7
         BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
         IP: [<ffffffff811b7cd1>] dequeue_hwpoisoned_huge_page+0x131/0x1e0
         PGD bae42067 PUD ba47d067 PMD 0
         Oops: 0000 [#1] SMP
        ...
         CPU: 7 PID: 128 Comm: kworker/7:2 Tainted: G   M       O 3.13.0-rc4-131217-1558-00003-g83b7df08e462 #25
        ...
         Call Trace:
           me_huge_page+0x3e/0x50
           memory_failure+0x4bb/0xc20
           mce_process_work+0x3e/0x70
           process_one_work+0x171/0x420
           worker_thread+0x11b/0x3a0
           ? manage_workers.isra.25+0x2b0/0x2b0
           kthread+0xe4/0x100
           ? kthread_create_on_node+0x190/0x190
           ret_from_fork+0x7c/0xb0
           ? kthread_create_on_node+0x190/0x190
        ...
         RIP   dequeue_hwpoisoned_huge_page+0x131/0x1e0
         CR2: 0000000000000058
      
      The reasoning of this problem is shown below:
       - when we have a memory error on a thp tail page, the memory error
         handler grabs a refcount of the head page to keep the thp under us.
       - Before unmapping the error page from processes, we split the thp,
         where page refcounts of both of head/tail pages don't change.
       - Then we call try_to_unmap() over the error page (which was a tail
         page before). We didn't pin the error page to handle the memory error,
         this error page is freed and removed from LRU list.
       - We never have the error page on LRU list, so the first page state
         check returns "unknown page," then we move to the second check
         with the saved page flag.
       - The saved page flag have PG_tail set, so the second page state check
         returns "hugepage."
       - We call me_huge_page() for freed error page, then we hit the above panic.
      
      The root cause is that we didn't move refcount from the head page to the
      tail page after split thp.  So this patch suggests to do this.
      
      This panic was introduced by commit 524fca1e ("HWPOISON: fix
      misjudgement of page_action() for errors on mlocked pages").  Note that we
      did have the same refcount problem before this commit, but it was just
      ignored because we had only first page state check which returned "unknown
      page." The commit changed the refcount problem from "doesn't work" to
      "kernel panic."
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: <stable@vger.kernel.org>	[3.9+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3e0f9e4
    • M
      mm: remove bogus warning in copy_huge_pmd() · d0319bd5
      Mel Gorman 提交于
      Sasha Levin reported the following warning being triggered
      
        WARNING: CPU: 28 PID: 35287 at mm/huge_memory.c:887 copy_huge_pmd+0x145/ 0x3a0()
        Call Trace:
          copy_huge_pmd+0x145/0x3a0
          copy_page_range+0x3f2/0x560
          dup_mmap+0x2c9/0x3d0
          dup_mm+0xad/0x150
          copy_process+0xa68/0x12e0
          do_fork+0x96/0x270
          SyS_clone+0x16/0x20
          stub_clone+0x69/0x90
      
      This warning was introduced by "mm: numa: Avoid unnecessary disruption
      of NUMA hinting during migration" for paranoia reasons but the warning
      is bogus.  I was thinking of parallel races between NUMA hinting faults
      and forks but this warning would also be triggered by a parallel reclaim
      splitting a THP during a fork.  Remote the bogus warning.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0319bd5
    • V
      memcg: fix memcg_size() calculation · 695c6083
      Vladimir Davydov 提交于
      The mem_cgroup structure contains nr_node_ids pointers to
      mem_cgroup_per_node objects, not the objects themselves.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      695c6083
    • R
      mm: fix use-after-free in sys_remap_file_pages · 4eb91982
      Rik van Riel 提交于
      remap_file_pages calls mmap_region, which may merge the VMA with other
      existing VMAs, and free "vma".  This can lead to a use-after-free bug.
      Avoid the bug by remembering vm_flags before calling mmap_region, and
      not trying to dereference vma later.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: PaX Team <pageexec@freemail.hu>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4eb91982
    • V
      mm: munlock: fix deadlock in __munlock_pagevec() · 3b25df93
      Vlastimil Babka 提交于
      Commit 7225522b ("mm: munlock: batch non-THP page isolation and
      munlock+putback using pagevec" introduced __munlock_pagevec() to speed
      up munlock by holding lru_lock over multiple isolated pages.  Pages that
      fail to be isolated are put_page()d immediately, also within the lock.
      
      This can lead to deadlock when __munlock_pagevec() becomes the holder of
      the last page pin and put_page() leads to __page_cache_release() which
      also locks lru_lock.  The deadlock has been observed by Sasha Levin
      using trinity.
      
      This patch avoids the deadlock by deferring put_page() operations until
      lru_lock is released.  Another pagevec (which is also used by later
      phases of the function is reused to gather the pages for put_page()
      operation.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b25df93
    • V
      mm: munlock: fix a bug where THP tail page is encountered · c424be1c
      Vlastimil Babka 提交于
      Since commit ff6a6da6 ("mm: accelerate munlock() treatment of THP
      pages") munlock skips tail pages of a munlocked THP page.  However, when
      the head page already has PageMlocked unset, it will not skip the tail
      pages.
      
      Commit 7225522b ("mm: munlock: batch non-THP page isolation and
      munlock+putback using pagevec") has added a PageTransHuge() check which
      contains VM_BUG_ON(PageTail(page)).  Sasha Levin found this triggered
      using trinity, on the first tail page of a THP page without PageMlocked
      flag.
      
      This patch fixes the issue by skipping tail pages also in the case when
      PageMlocked flag is unset.  There is still a possibility of race with
      THP page split between clearing PageMlocked and determining how many
      pages to skip.  The race might result in former tail pages not being
      skipped, which is however no longer a bug, as during the skip the
      PageTail flags are cleared.
      
      However this race also affects correctness of NR_MLOCK accounting, which
      is to be fixed in a separate patch.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c424be1c
  5. 22 12月, 2013 1 次提交
    • B
      aio/migratepages: make aio migrate pages sane · 8e321fef
      Benjamin LaHaise 提交于
      The arbitrary restriction on page counts offered by the core
      migrate_page_move_mapping() code results in rather suspicious looking
      fiddling with page reference counts in the aio_migratepage() operation.
      To fix this, make migrate_page_move_mapping() take an extra_count parameter
      that allows aio to tell the code about its own reference count on the page
      being migrated.
      
      While cleaning up aio_migratepage(), make it validate that the old page
      being passed in is actually what aio_migratepage() expects to prevent
      misbehaviour in the case of races.
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      8e321fef
  6. 21 12月, 2013 4 次提交
  7. 19 12月, 2013 18 次提交
  8. 13 12月, 2013 4 次提交
    • J
      mm: memcg: do not allow task about to OOM kill to bypass the limit · 1f14c1ac
      Johannes Weiner 提交于
      Commit 49426420 ("mm: memcg: handle non-error OOM situations more
      gracefully") allowed tasks that already entered a memcg OOM condition to
      bypass the memcg limit on subsequent allocation attempts hoping this
      would expedite finishing the page fault and executing the kill.
      
      David Rientjes is worried that this breaks memcg isolation guarantees
      and since there is no evidence that the bypass actually speeds up fault
      processing just change it so that these subsequent charge attempts fail
      outright.  The notable exception being __GFP_NOFAIL charges which are
      required to bypass the limit regardless.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-bt: David Rientjes <rientjes@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f14c1ac
    • J
      mm: memcg: fix race condition between memcg teardown and swapin · 96f1c58d
      Johannes Weiner 提交于
      There is a race condition between a memcg being torn down and a swapin
      triggered from a different memcg of a page that was recorded to belong
      to the exiting memcg on swapout (with CONFIG_MEMCG_SWAP extension).  The
      result is unreclaimable pages pointing to dead memcgs, which can lead to
      anything from endless loops in later memcg teardown (the page is charged
      to all hierarchical parents but is not on any LRU list) or crashes from
      following the dangling memcg pointer.
      
      Memcgs with tasks in them can not be torn down and usually charges don't
      show up in memcgs without tasks.  Swapin with the CONFIG_MEMCG_SWAP
      extension is the notable exception because it charges the cgroup that
      was recorded as owner during swapout, which may be empty and in the
      process of being torn down when a task in another memcg triggers the
      swapin:
      
        teardown:                 swapin:
      
                                  lookup_swap_cgroup_id()
                                  rcu_read_lock()
                                  mem_cgroup_lookup()
                                  css_tryget()
                                  rcu_read_unlock()
        disable css_tryget()
        call_rcu()
          offline_css()
            reparent_charges()
                                  res_counter_charge() (hierarchical!)
                                  css_put()
                                    css_free()
                                  pc->mem_cgroup = dead memcg
                                  add page to dead lru
      
      Add a final reparenting step into css_free() to make sure any such raced
      charges are moved out of the memcg before it's finally freed.
      
      In the longer term it would be cleaner to have the css_tryget() and the
      res_counter charge under the same RCU lock section so that the charge
      reparenting is deferred until the last charge whose tryget succeeded is
      visible.  But this will require more invasive changes that will be
      harder to evaluate and backport into stable, so better defer them to a
      separate change set.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96f1c58d
    • K
      thp: move preallocated PTE page table on move_huge_pmd() · 3592806c
      Kirill A. Shutemov 提交于
      Andrey Wagin reported crash on VM_BUG_ON() in pgtable_pmd_page_dtor() with
      fallowing backtrace:
      
        free_pgd_range+0x2bf/0x410
        free_pgtables+0xce/0x120
        unmap_region+0xe0/0x120
        do_munmap+0x249/0x360
        move_vma+0x144/0x270
        SyS_mremap+0x3b9/0x510
        system_call_fastpath+0x16/0x1b
      
      The crash can be reproduce with this test case:
      
        #define _GNU_SOURCE
        #include <sys/mman.h>
        #include <stdio.h>
        #include <unistd.h>
      
        #define MB (1024 * 1024UL)
        #define GB (1024 * MB)
      
        int main(int argc, char **argv)
        {
      	char *p;
      	int i;
      
      	p = mmap((void *) GB, 10 * MB, PROT_READ | PROT_WRITE,
      			MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
      	for (i = 0; i < 10 * MB; i += 4096)
      		p[i] = 1;
      	mremap(p, 10 * MB, 10 * MB, MREMAP_FIXED | MREMAP_MAYMOVE, 2 * GB);
      	return 0;
        }
      
      Due to split PMD lock, we now store preallocated PTE tables for THP
      pages per-PMD table.  It means we need to move them to other PMD table
      if huge PMD moved there.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NAndrey Vagin <avagin@openvz.org>
      Tested-by: NAndrey Vagin <avagin@openvz.org>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3592806c
    • J
      mm: memcg: do not declare OOM from __GFP_NOFAIL allocations · a0d8b00a
      Johannes Weiner 提交于
      Commit 84235de3 ("fs: buffer: move allocation failure loop into the
      allocator") started recognizing __GFP_NOFAIL in memory cgroups but
      forgot to disable the OOM killer.
      
      Any task that does not fail allocation will also not enter the OOM
      completion path.  So don't declare an OOM state in this case or it'll be
      leaked and the task be able to bypass the limit until the next
      userspace-triggered page fault cleans up the OOM state.
      Reported-by: NWilliam Dauchy <wdauchy@gmail.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[3.12.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0d8b00a
  9. 02 12月, 2013 1 次提交
    • E
      security: shmem: implement kernel private shmem inodes · c7277090
      Eric Paris 提交于
      We have a problem where the big_key key storage implementation uses a
      shmem backed inode to hold the key contents.  Because of this detail of
      implementation LSM checks are being done between processes trying to
      read the keys and the tmpfs backed inode.  The LSM checks are already
      being handled on the key interface level and should not be enforced at
      the inode level (since the inode is an implementation detail, not a
      part of the security model)
      
      This patch implements a new function shmem_kernel_file_setup() which
      returns the equivalent to shmem_file_setup() only the underlying inode
      has S_PRIVATE set.  This means that all LSM checks for the inode in
      question are skipped.  It should only be used for kernel internal
      operations where the inode is not exposed to userspace without proper
      LSM checking.  It is possible that some other users of
      shmem_file_setup() should use the new interface, but this has not been
      explored.
      
      Reproducing this bug is a little bit difficult.  The steps I used on
      Fedora are:
      
       (1) Turn off selinux enforcing:
      
      	setenforce 0
      
       (2) Create a huge key
      
      	k=`dd if=/dev/zero bs=8192 count=1 | keyctl padd big_key test-key @s`
      
       (3) Access the key in another context:
      
      	runcon system_u:system_r:httpd_t:s0-s0:c0.c1023 keyctl print $k >/dev/null
      
       (4) Examine the audit logs:
      
      	ausearch -m AVC -i --subject httpd_t | audit2allow
      
      If the last command's output includes a line that looks like:
      
      	allow httpd_t user_tmpfs_t:file { open read };
      
      There was an inode check between httpd and the tmpfs filesystem.  With
      this patch no such denial will be seen.  (NOTE! you should clear your
      audit log if you have tested for this previously)
      
      (Please return you box to enforcing)
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Hugh Dickins <hughd@google.com>
      cc: linux-mm@kvack.org
      c7277090
  10. 22 11月, 2013 3 次提交
    • D
      mm, mempolicy: silence gcc warning · b7a9f420
      David Rientjes 提交于
      Fengguang Wu reports that compiling mm/mempolicy.c results in a warning:
      
        mm/mempolicy.c: In function 'mpol_to_str':
        mm/mempolicy.c:2878:2: error: format not a string literal and no format arguments
      
      Kees says this is because he is using -Wformat-security.
      
      Silence the warning.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Suggested-by: NKees Cook <keescook@chromium.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7a9f420
    • A
      mm: hugetlbfs: fix hugetlbfs optimization · 27c73ae7
      Andrea Arcangeli 提交于
      Commit 7cb2ef56 ("mm: fix aio performance regression for database
      caused by THP") can cause dereference of a dangling pointer if
      split_huge_page runs during PageHuge() if there are updates to the
      tail_page->private field.
      
      Also it is repeating compound_head twice for hugetlbfs and it is running
      compound_head+compound_trans_head for THP when a single one is needed in
      both cases.
      
      The new code within the PageSlab() check doesn't need to verify that the
      THP page size is never bigger than the smallest hugetlbfs page size, to
      avoid memory corruption.
      
      A longstanding theoretical race condition was found while fixing the
      above (see the change right after the skip_unlock label, that is
      relevant for the compound_lock path too).
      
      By re-establishing the _mapcount tail refcounting for all compound
      pages, this also fixes the below problem:
      
        echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
      
        BUG: Bad page state in process bash  pfn:59a01
        page:ffffea000139b038 count:0 mapcount:10 mapping:          (null) index:0x0
        page flags: 0x1c00000000008000(tail)
        Modules linked in:
        CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x55/0x76
          bad_page+0xd5/0x130
          free_pages_prepare+0x213/0x280
          __free_pages+0x36/0x80
          update_and_free_page+0xc1/0xd0
          free_pool_huge_page+0xc2/0xe0
          set_max_huge_pages.part.58+0x14c/0x220
          nr_hugepages_store_common.isra.60+0xd0/0xf0
          nr_hugepages_store+0x13/0x20
          kobj_attr_store+0xf/0x20
          sysfs_write_file+0x189/0x1e0
          vfs_write+0xc5/0x1f0
          SyS_write+0x55/0xb0
          system_call_fastpath+0x16/0x1b
      Signed-off-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27c73ae7
    • D
      mm: thp: give transparent hugepage code a separate copy_page · 30b0a105
      Dave Hansen 提交于
      Right now, the migration code in migrate_page_copy() uses copy_huge_page()
      for hugetlbfs and thp pages:
      
             if (PageHuge(page) || PageTransHuge(page))
                      copy_huge_page(newpage, page);
      
      So, yay for code reuse.  But:
      
        void copy_huge_page(struct page *dst, struct page *src)
        {
              struct hstate *h = page_hstate(src);
      
      and a non-hugetlbfs page has no page_hstate().  This works 99% of the
      time because page_hstate() determines the hstate from the page order
      alone.  Since the page order of a THP page matches the default hugetlbfs
      page order, it works.
      
      But, if you change the default huge page size on the boot command-line
      (say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
      so page_hstate() returns null and copy_huge_page() oopses pretty fast
      since copy_huge_page() dereferences the hstate:
      
        void copy_huge_page(struct page *dst, struct page *src)
        {
              struct hstate *h = page_hstate(src);
              if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
        ...
      
      Mel noticed that the migration code is really the only user of these
      functions.  This moves all the copy code over to migrate.c and makes
      copy_huge_page() work for THP by checking for it explicitly.
      
      I believe the bug was introduced in commit b32967ff ("mm: numa: Add
      THP migration for the NUMA working set scanning fault case")
      
      [akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Tested-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30b0a105