1. 05 6月, 2014 25 次提交
  2. 24 5月, 2014 5 次提交
    • N
      mm/memory-failure.c: fix memory leak by race between poison and unpoison · 3e030ecc
      Naoya Horiguchi 提交于
      When a memory error happens on an in-use page or (free and in-use)
      hugepage, the victim page is isolated with its refcount set to one.
      
      When you try to unpoison it later, unpoison_memory() calls put_page()
      for it twice in order to bring the page back to free page pool (buddy or
      free hugepage list).  However, if another memory error occurs on the
      page which we are unpoisoning, memory_failure() returns without
      releasing the refcount which was incremented in the same call at first,
      which results in memory leak and unconsistent num_poisoned_pages
      statistics.  This patch fixes it.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: <stable@vger.kernel.org>    [2.6.32+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e030ecc
    • M
      memcg: fix swapcache charge from kernel thread context · 6f6acb00
      Michal Hocko 提交于
      Commit 284f39af ("mm: memcg: push !mm handling out to page cache
      charge function") explicitly checks for page cache charges without any
      mm context (from kernel thread context[1]).
      
      This seemed to be the only possible case where memory could be charged
      without mm context so commit 03583f1a ("memcg: remove unnecessary
      !mm check from try_get_mem_cgroup_from_mm()") removed the mm check from
      get_mem_cgroup_from_mm().  This however caused another NULL ptr
      dereference during early boot when loopback kernel thread splices to
      tmpfs as reported by Stephan Kulow:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000360
        IP: get_mem_cgroup_from_mm.isra.42+0x2b/0x60
        Oops: 0000 [#1] SMP
        Modules linked in: btrfs dm_multipath dm_mod scsi_dh multipath raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod parport_pc parport nls_utf8 isofs usb_storage iscsi_ibft iscsi_boot_sysfs arc4 ecb fan thermal nfs lockd fscache nls_iso8859_1 nls_cp437 sg st hid_generic usbhid af_packet sunrpc sr_mod cdrom ata_generic uhci_hcd virtio_net virtio_blk ehci_hcd usbcore ata_piix floppy processor button usb_common virtio_pci virtio_ring virtio edd squashfs loop ppa]
        CPU: 0 PID: 97 Comm: loop1 Not tainted 3.15.0-rc5-5-default #1
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        Call Trace:
          __mem_cgroup_try_charge_swapin+0x40/0xe0
          mem_cgroup_charge_file+0x8b/0xd0
          shmem_getpage_gfp+0x66b/0x7b0
          shmem_file_splice_read+0x18f/0x430
          splice_direct_to_actor+0xa2/0x1c0
          do_lo_receive+0x5a/0x60 [loop]
          loop_thread+0x298/0x720 [loop]
          kthread+0xc6/0xe0
          ret_from_fork+0x7c/0xb0
      
      Also Branimir Maksimovic reported the following oops which is tiggered
      for the swapcache charge path from the accounting code for kernel threads:
      
        CPU: 1 PID: 160 Comm: kworker/u8:5 Tainted: P           OE 3.15.0-rc5-core2-custom #159
        Hardware name: System manufacturer System Product Name/MAXIMUSV GENE, BIOS 1903 08/19/2013
        task: ffff880404e349b0 ti: ffff88040486a000 task.ti: ffff88040486a000
        RIP: get_mem_cgroup_from_mm.isra.42+0x2b/0x60
        Call Trace:
          __mem_cgroup_try_charge_swapin+0x45/0xf0
          mem_cgroup_charge_file+0x9c/0xe0
          shmem_getpage_gfp+0x62c/0x770
          shmem_write_begin+0x38/0x40
          generic_perform_write+0xc5/0x1c0
          __generic_file_aio_write+0x1d1/0x3f0
          generic_file_aio_write+0x4f/0xc0
          do_sync_write+0x5a/0x90
          do_acct_process+0x4b1/0x550
          acct_process+0x6d/0xa0
          do_exit+0x827/0xa70
          kthread+0xc3/0xf0
      
      This patch fixes the issue by reintroducing mm check into
      get_mem_cgroup_from_mm.  We could do the same trick in
      __mem_cgroup_try_charge_swapin as we do for the regular page cache path
      but it is not worth troubles.  The check is not that expensive and it is
      better to have get_mem_cgroup_from_mm more robust.
      
      [1] - http://marc.info/?l=linux-mm&m=139463617808941&w=2
      
      Fixes: 03583f1a ("memcg: remove unnecessary !mm check from try_get_mem_cgroup_from_mm()")
      Reported-and-tested-by: NStephan Kulow <coolo@suse.com>
      Reported-by: NBranimir Maksimovic <branimir.maksimovic@gmail.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f6acb00
    • J
      mm: madvise: fix MADV_WILLNEED on shmem swapouts · 55231e5c
      Johannes Weiner 提交于
      MADV_WILLNEED currently does not read swapped out shmem pages back in.
      
      Commit 0cd6144a ("mm + fs: prepare for non-page entries in page
      cache radix trees") made find_get_page() filter exceptional radix tree
      entries but failed to convert all find_get_page() callers that WANT
      exceptional entries over to find_get_entry().  One of them is shmem swap
      readahead in madvise, which now skips over any swap-out records.
      
      Convert it to find_get_entry().
      
      Fixes: 0cd6144a ("mm + fs: prepare for non-page entries in page cache radix trees")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55231e5c
    • J
      mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT · 7fcbbaf1
      Jens Axboe 提交于
      In some testing I ran today (some fio jobs that spread over two nodes),
      we end up spending 40% of the time in filemap_check_errors().  That
      smells fishy.  Looking further, this is basically what happens:
      
      blkdev_aio_read()
          generic_file_aio_read()
              filemap_write_and_wait_range()
                  if (!mapping->nr_pages)
                      filemap_check_errors()
      
      and filemap_check_errors() always attempts two test_and_clear_bit() on
      the mapping flags, thus dirtying it for every single invocation.  The
      patch below tests each of these bits before clearing them, avoiding this
      issue.  In my test case (4-socket box), performance went from 1.7M IOPS
      to 4.0M IOPS.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7fcbbaf1
    • C
      hwpoison, hugetlb: lock_page/unlock_page does not match for handling a free hugepage · b985194c
      Chen Yucong 提交于
      For handling a free hugepage in memory failure, the race will happen if
      another thread hwpoisoned this hugepage concurrently.  So we need to
      check PageHWPoison instead of !PageHWPoison.
      
      If hwpoison_filter(p) returns true or a race happens, then we need to
      unlock_page(hpage).
      Signed-off-by: NChen Yucong <slaoub@gmail.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Tested-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Cc: <stable@vger.kernel.org>	[2.6.36+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b985194c
  3. 20 5月, 2014 3 次提交
  4. 15 5月, 2014 1 次提交
    • H
      parisc,metag: Do not hardcode maximum userspace stack size · 042d27ac
      Helge Deller 提交于
      This patch affects only architectures where the stack grows upwards
      (currently parisc and metag only). On those do not hardcode the maximum
      initial stack size to 1GB for 32-bit processes, but make it configurable
      via a config option.
      
      The main problem with the hardcoded stack size is, that we have two
      memory regions which grow upwards: stack and heap. To keep most of the
      memory available for heap in a flexmap memory layout, it makes no sense
      to hard allocate up to 1GB of the memory for stack which can't be used
      as heap then.
      
      This patch makes the stack size for 32-bit processes configurable and
      uses 80MB as default value which has been in use during the last few
      years on parisc and which hasn't showed any problems yet.
      Signed-off-by: NHelge Deller <deller@gmx.de>
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: linux-parisc@vger.kernel.org
      Cc: linux-metag@vger.kernel.org
      Cc: John David Anglin <dave.anglin@bell.net>
      042d27ac
  5. 11 5月, 2014 2 次提交
  6. 07 5月, 2014 4 次提交
    • R
      mm/numa: Remove BUG_ON() in __handle_mm_fault() · 107437fe
      Rik van Riel 提交于
      Changing PTEs and PMDs to pte_numa & pmd_numa is done with the
      mmap_sem held for reading, which means a pmd can be instantiated
      and turned into a numa one while __handle_mm_fault() is examining
      the value of old_pmd.
      
      If that happens, __handle_mm_fault() should just return and let
      the page fault retry, instead of throwing an oops. This is
      handled by the test for pmd_trans_huge(*pmd) below.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NSunil Pandey <sunil.k.pandey@intel.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: linux-mm@kvack.org
      Cc: lwoodman@redhat.com
      Cc: dave.hansen@intel.com
      Link: http://lkml.kernel.org/r/20140429153615.2d72098e@annuminas.surriel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      107437fe
    • C
      slub: use sysfs'es release mechanism for kmem_cache · 41a21285
      Christoph Lameter 提交于
      debugobjects warning during netfilter exit:
      
          ------------[ cut here ]------------
          WARNING: CPU: 6 PID: 4178 at lib/debugobjects.c:260 debug_print_object+0x8d/0xb0()
          ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
          Modules linked in:
          CPU: 6 PID: 4178 Comm: kworker/u16:2 Tainted: G        W 3.11.0-next-20130906-sasha #3984
          Workqueue: netns cleanup_net
          Call Trace:
            dump_stack+0x52/0x87
            warn_slowpath_common+0x8c/0xc0
            warn_slowpath_fmt+0x46/0x50
            debug_print_object+0x8d/0xb0
            __debug_check_no_obj_freed+0xa5/0x220
            debug_check_no_obj_freed+0x15/0x20
            kmem_cache_free+0x197/0x340
            kmem_cache_destroy+0x86/0xe0
            nf_conntrack_cleanup_net_list+0x131/0x170
            nf_conntrack_pernet_exit+0x5d/0x70
            ops_exit_list+0x5e/0x70
            cleanup_net+0xfb/0x1c0
            process_one_work+0x338/0x550
            worker_thread+0x215/0x350
            kthread+0xe7/0xf0
            ret_from_fork+0x7c/0xb0
      
      Also during dcookie cleanup:
      
          WARNING: CPU: 12 PID: 9725 at lib/debugobjects.c:260 debug_print_object+0x8c/0xb0()
          ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
          Modules linked in:
          CPU: 12 PID: 9725 Comm: trinity-c141 Not tainted 3.15.0-rc2-next-20140423-sasha-00018-gc4ff6c4 #408
          Call Trace:
            dump_stack (lib/dump_stack.c:52)
            warn_slowpath_common (kernel/panic.c:430)
            warn_slowpath_fmt (kernel/panic.c:445)
            debug_print_object (lib/debugobjects.c:262)
            __debug_check_no_obj_freed (lib/debugobjects.c:697)
            debug_check_no_obj_freed (lib/debugobjects.c:726)
            kmem_cache_free (mm/slub.c:2689 mm/slub.c:2717)
            kmem_cache_destroy (mm/slab_common.c:363)
            dcookie_unregister (fs/dcookies.c:302 fs/dcookies.c:343)
            event_buffer_release (arch/x86/oprofile/../../../drivers/oprofile/event_buffer.c:153)
            __fput (fs/file_table.c:217)
            ____fput (fs/file_table.c:253)
            task_work_run (kernel/task_work.c:125 (discriminator 1))
            do_notify_resume (include/linux/tracehook.h:196 arch/x86/kernel/signal.c:751)
            int_signal (arch/x86/kernel/entry_64.S:807)
      
      Sysfs has a release mechanism.  Use that to release the kmem_cache
      structure if CONFIG_SYSFS is enabled.
      
      Only slub is changed - slab currently only supports /proc/slabinfo and
      not /sys/kernel/slab/*.  We talked about adding that and someone was
      working on it.
      
      [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build]
      [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build even more]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Acked-by: NGreg KH <greg@kroah.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41a21285
    • J
      revert "mm: vmscan: do not swap anon pages just because free+file is low" · 62376251
      Johannes Weiner 提交于
      This reverts commit 0bf1457f ("mm: vmscan: do not swap anon pages
      just because free+file is low") because it introduced a regression in
      mostly-anonymous workloads, where reclaim would become ineffective and
      trap every allocating task in direct reclaim.
      
      The problem is that there is a runaway feedback loop in the scan balance
      between file and anon, where the balance tips heavily towards a tiny
      thrashing file LRU and anonymous pages are no longer being looked at.
      The commit in question removed the safe guard that would detect such
      situations and respond with forced anonymous reclaim.
      
      This commit was part of a series to fix premature swapping in loads with
      relatively little cache, and while it made a small difference, the cure
      is obviously worse than the disease.  Revert it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@kernel.org>		[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62376251
    • J
      mm: filemap: update find_get_pages_tag() to deal with shadow entries · 139b6a6f
      Johannes Weiner 提交于
      Dave Jones reports the following crash when find_get_pages_tag() runs
      into an exceptional entry:
      
        kernel BUG at mm/filemap.c:1347!
        RIP: find_get_pages_tag+0x1cb/0x220
        Call Trace:
          find_get_pages_tag+0x36/0x220
          pagevec_lookup_tag+0x21/0x30
          filemap_fdatawait_range+0xbe/0x1e0
          filemap_fdatawait+0x27/0x30
          sync_inodes_sb+0x204/0x2a0
          sync_inodes_one_sb+0x19/0x20
          iterate_supers+0xb2/0x110
          sys_sync+0x44/0xb0
          ia32_do_call+0x13/0x13
      
        1343                         /*
        1344                          * This function is never used on a shmem/tmpfs
        1345                          * mapping, so a swap entry won't be found here.
        1346                          */
        1347                         BUG();
      
      After commit 0cd6144a ("mm + fs: prepare for non-page entries in
      page cache radix trees") this comment and BUG() are out of date because
      exceptional entries can now appear in all mappings - as shadows of
      recently evicted pages.
      
      However, as Hugh Dickins notes,
      
        "it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably
         any other PAGECACHE_TAG_*) to appear on an exceptional entry.
      
         I expect it comes down to an occasional race in RCU lookup of the
         radix_tree: lacking absolute synchronization, we might sometimes
         catch an exceptional entry, with the tag which really belongs with
         the unexceptional entry which was there an instant before."
      
      And indeed, not only is the tree walk lockless, the tags are also read
      in chunks, one radix tree node at a time.  There is plenty of time for
      page reclaim to swoop in and replace a page that was already looked up
      as tagged with a shadow entry.
      
      Remove the BUG() and update the comment.  While reviewing all other
      lookup sites for whether they properly deal with shadow entries of
      evicted pages, update all the comments and fix memcg file charge moving
      to not miss shmem/tmpfs swapcache pages.
      
      Fixes: 0cd6144a ("mm + fs: prepare for non-page entries in page cache radix trees")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NDave Jones <davej@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      139b6a6f