1. 04 7月, 2013 16 次提交
  2. 03 7月, 2013 1 次提交
    • J
      vfs: export lseek_execute() to modules · 46a1c2c7
      Jie Liu 提交于
      For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
      SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
      matter in lseek_execute() to update the current file offset
      to the desired offset if it is valid, ceph also does the
      simliar things at ceph_llseek().
      
      To reduce the duplications, this patch make lseek_execute()
      public accessible so that we can call it directly from the
      underlying file systems.
      
      Thanks Dave Chinner for this suggestion.
      
      [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]
      
      v2->v1:
      - Add kernel-doc comments for lseek_execute()
      - Call lseek_execute() in ceph->llseek()
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Ted Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Sage Weil <sage@inktank.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      46a1c2c7
  3. 29 6月, 2013 1 次提交
  4. 26 6月, 2013 1 次提交
    • Z
      futex: Take hugepages into account when generating futex_key · 13d60f4b
      Zhang Yi 提交于
      The futex_keys of process shared futexes are generated from the page
      offset, the mapping host and the mapping index of the futex user space
      address. This should result in an unique identifier for each futex.
      
      Though this is not true when futexes are located in different subpages
      of an hugepage. The reason is, that the mapping index for all those
      futexes evaluates to the index of the base page of the hugetlbfs
      mapping. So a futex at offset 0 of the hugepage mapping and another
      one at offset PAGE_SIZE of the same hugepage mapping have identical
      futex_keys. This happens because the futex code blindly uses
      page->index.
      
      Steps to reproduce the bug:
      
      1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0
         and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs
         mapping.
      
         The mutexes must be initialized as PTHREAD_PROCESS_SHARED because
         PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as
         their keys solely depend on the user space address.
      
      2. Lock mutex1 and mutex2
      
      3. Create thread1 and in the thread function lock mutex1, which
         results in thread1 blocking on the locked mutex1.
      
      4. Create thread2 and in the thread function lock mutex2, which
         results in thread2 blocking on the locked mutex2.
      
      5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2
         still blocks on mutex2 because the futex_key points to mutex1.
      
      To solve this issue we need to take the normal page index of the page
      which contains the futex into account, if the futex is in an hugetlbfs
      mapping. In other words, we calculate the normal page mapping index of
      the subpage in the hugetlbfs mapping.
      
      Mappings which are not based on hugetlbfs are not affected and still
      use page->index.
      
      Thanks to Mel Gorman who provided a patch for adding proper evaluation
      functions to the hugetlbfs code to avoid exposing hugetlbfs specific
      details to the futex code.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NZhang Yi <zhang.yi20@zte.com.cn>
      Reviewed-by: NJiang Biao <jiang.biao2@zte.com.cn>
      Tested-by: NMa Chenggong <ma.chenggong@zte.com.cn>
      Reviewed-by: N'Mel Gorman' <mgorman@suse.de>
      Acked-by: N'Darren Hart' <dvhart@linux.intel.com>
      Cc: 'Peter Zijlstra' <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      13d60f4b
  5. 14 6月, 2013 2 次提交
  6. 13 6月, 2013 7 次提交
    • S
      slab: prevent warnings when allocating with __GFP_NOWARN · 907985f4
      Sasha Levin 提交于
      Sasha Levin noticed that the warning introduced by commit 6286ae97
      ("slab: Return NULL for oversized allocations) is being triggered:
      
        WARNING: CPU: 15 PID: 21519 at mm/slab_common.c:376 kmalloc_slab+0x2f/0xb0()
        can: request_module (can-proto-4) failed.
        mpoa: proc_mpc_write: could not parse ''
        Modules linked in:
        CPU: 15 PID: 21519 Comm: trinity-child15 Tainted: G W    3.10.0-rc4-next-20130607-sasha-00011-gcd78395-dirty #2
         0000000000000009 ffff880020a95e30 ffffffff83ff4041 0000000000000000
         ffff880020a95e68 ffffffff8111fe12 fffffffffffffff0 00000000000082d0
         0000000000080000 0000000000080000 0000000001400000 ffff880020a95e78
        Call Trace:
         [<ffffffff83ff4041>] dump_stack+0x4e/0x82
         [<ffffffff8111fe12>] warn_slowpath_common+0x82/0xb0
         [<ffffffff8111fe55>] warn_slowpath_null+0x15/0x20
         [<ffffffff81243dcf>] kmalloc_slab+0x2f/0xb0
         [<ffffffff81278d54>] __kmalloc+0x24/0x4b0
         [<ffffffff8196ffe3>] ? security_capable+0x13/0x20
         [<ffffffff812a26b7>] ? pipe_fcntl+0x107/0x210
         [<ffffffff812a26b7>] pipe_fcntl+0x107/0x210
         [<ffffffff812b7ea0>] ? fget_raw_light+0x130/0x3f0
         [<ffffffff812aa5fb>] SyS_fcntl+0x60b/0x6a0
         [<ffffffff8403ca98>] tracesys+0xe1/0xe6
      
      Andrew Morton writes:
      
        __GFP_NOWARN is frequently used by kernel code to probe for "how big
        an allocation can I get".  That's a bit lame, but it's used on slow
        paths and is pretty simple.
      
      However, SLAB would still spew a warning when a big allocation happens
      if the __GFP_NOWARN flag is _not_ set to expose kernel bugs.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      [ penberg@kernel.org: improve changelog ]
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      907985f4
    • J
      mm: memcontrol: fix lockless reclaim hierarchy iterator · 89dc991f
      Johannes Weiner 提交于
      The lockless reclaim hierarchy iterator currently has a misplaced
      barrier that can lead to use-after-free crashes.
      
      The reclaim hierarchy iterator consist of a sequence count and a
      position pointer that are read and written locklessly, with memory
      barriers enforcing ordering.
      
      The write side sets the position pointer first, then updates the
      sequence count to "publish" the new position.  Likewise, the read side
      must read the sequence count first, then the position.  If the sequence
      count is up to date, it's guaranteed that the position is up to date as
      well:
      
        writer:                         reader:
        iter->position = position       if iter->sequence == expected:
        smp_wmb()                           smp_rmb()
        iter->sequence = sequence           position = iter->position
      
      However, the read side barrier is currently misplaced, which can lead to
      dereferencing stale position pointers that no longer point to valid
      memory.  Fix this.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: <stable@kernel.org>		[3.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89dc991f
    • A
      frontswap: fix incorrect zeroing and allocation size for frontswap_map · 7b57976d
      Akinobu Mita 提交于
      The bitmap accessed by bitops must have enough size to hold the required
      numbers of bits rounded up to a multiple of BITS_PER_LONG.  And the
      bitmap must not be zeroed by memset() if the number of bits cleared is
      not a multiple of BITS_PER_LONG.
      
      This fixes incorrect zeroing and allocation size for frontswap_map.  The
      incorrect zeroing part doesn't cause any problem because frontswap_map
      is freed just after zeroing.  But the wrongly calculated allocation size
      may cause the problem.
      
      For 32bit systems, the allocation size of frontswap_map is about twice
      as large as required size.  For 64bit systems, the allocation size is
      smaller than requeired if the number of bits is not a multiple of
      BITS_PER_LONG.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b57976d
    • N
      mm: migration: add migrate_entry_wait_huge() · 30dad309
      Naoya Horiguchi 提交于
      When we have a page fault for the address which is backed by a hugepage
      under migration, the kernel can't wait correctly and do busy looping on
      hugepage fault until the migration finishes.  As a result, users who try
      to kick hugepage migration (via soft offlining, for example) occasionally
      experience long delay or soft lockup.
      
      This is because pte_offset_map_lock() can't get a correct migration entry
      or a correct page table lock for hugepage.  This patch introduces
      migration_entry_wait_huge() to solve this.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>	[2.6.35+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30dad309
    • T
      mm/page_alloc.c: fix watermark check in __zone_watermark_ok() · 026b0814
      Tomasz Stanislawski 提交于
      The watermark check consists of two sub-checks.  The first one is:
      
      	if (free_pages <= min + lowmem_reserve)
      		return false;
      
      The check assures that there is minimal amount of RAM in the zone.  If
      CMA is used then the free_pages is reduced by the number of free pages
      in CMA prior to the over-mentioned check.
      
      	if (!(alloc_flags & ALLOC_CMA))
      		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
      
      This prevents the zone from being drained from pages available for
      non-movable allocations.
      
      The second check prevents the zone from getting too fragmented.
      
      	for (o = 0; o < order; o++) {
      		free_pages -= z->free_area[o].nr_free << o;
      		min >>= 1;
      		if (free_pages <= min)
      			return false;
      	}
      
      The field z->free_area[o].nr_free is equal to the number of free pages
      including free CMA pages.  Therefore the CMA pages are subtracted twice.
      This may cause a false positive fail of __zone_watermark_ok() if the CMA
      area gets strongly fragmented.  In such a case there are many 0-order
      free pages located in CMA.  Those pages are subtracted twice therefore
      they will quickly drain free_pages during the check against
      fragmentation.  The test fails even though there are many free non-cma
      pages in the zone.
      
      This patch fixes this issue by subtracting CMA pages only for a purpose of
      (free_pages <= min + lowmem_reserve) check.
      
      Laura said:
      
        We were observing allocation failures of higher order pages (order 5 =
        128K typically) under tight memory conditions resulting in driver
        failure.  The output from the page allocation failure showed plenty of
        free pages of the appropriate order/type/zone and mostly CMA pages in
        the lower orders.
      
        For full disclosure, we still observed some page allocation failures
        even after applying the patch but the number was drastically reduced and
        those failures were attributed to fragmentation/other system issues.
      Signed-off-by: NTomasz Stanislawski <t.stanislaws@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Tested-by: NLaura Abbott <lauraa@codeaurora.org>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Tested-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Cc: <stable@vger.kernel.org>	[3.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      026b0814
    • R
      swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion · cbab0e4e
      Rafael Aquini 提交于
      read_swap_cache_async() can race against get_swap_page(), and stumble
      across a SWAP_HAS_CACHE entry in the swap map whose page wasn't brought
      into the swapcache yet.
      
      This transient swap_map state is expected to be transitory, but the
      actual placement of discard at scan_swap_map() inserts a wait for I/O
      completion thus making the thread at read_swap_cache_async() to loop
      around its -EEXIST case, while the other end at get_swap_page() is
      scheduled away at scan_swap_map().  This can leave the system deadlocked
      if the I/O completion happens to be waiting on the CPU waitqueue where
      read_swap_cache_async() is busy looping and !CONFIG_PREEMPT.
      
      This patch introduces a cond_resched() call to make the aforementioned
      read_swap_cache_async() busy loop condition to bail out when necessary,
      thus avoiding the subtle race window.
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbab0e4e
    • A
      memcg: don't initialize kmem-cache destroying work for root caches · f101a946
      Andrey Vagin 提交于
      struct memcg_cache_params has a union.  Different parts of this union
      are used for root and non-root caches.  A part with destroying work is
      used only for non-root caches.
      
        BUG: unable to handle kernel paging request at 0000000fffffffe0
        IP: kmem_cache_alloc+0x41/0x1f0
        Modules linked in: netlink_diag af_packet_diag udp_diag tcp_diag inet_diag unix_diag ip6table_filter ip6_tables i2c_piix4 virtio_net virtio_balloon microcode i2c_core pcspkr floppy
        CPU: 0 PID: 1929 Comm: lt-vzctl Tainted: G      D      3.10.0-rc1+ #2
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        RIP: kmem_cache_alloc+0x41/0x1f0
        Call Trace:
         getname_flags.part.34+0x30/0x140
         getname+0x38/0x60
         do_sys_open+0xc5/0x1e0
         SyS_open+0x22/0x30
         system_call_fastpath+0x16/0x1b
        Code: f4 53 48 83 ec 18 8b 05 8e 53 b7 00 4c 8b 4d 08 21 f0 a8 10 74 0d 4c 89 4d c0 e8 1b 76 4a 00 4c 8b 4d c0 e9 92 00 00 00 4d 89 f5 <4d> 8b 45 00 65 4c 03 04 25 48 cd 00 00 49 8b 50 08 4d 8b 38 49
        RIP  [<ffffffff8116b641>] kmem_cache_alloc+0x41/0x1f0
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: <stable@vger.kernel.org>	[3.9.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f101a946
  7. 06 6月, 2013 1 次提交
    • P
      arch, mm: Remove tlb_fast_mode() · 29eb7782
      Peter Zijlstra 提交于
      Since the introduction of preemptible mmu_gather TLB fast mode has been
      broken. TLB fast mode relies on there being absolutely no concurrency;
      it frees pages first and invalidates TLBs later.
      
      However now we can get concurrency and stuff goes *bang*.
      
      This patch removes all tlb_fast_mode() code; it was found the better
      option vs trying to patch the hole by entangling tlb invalidation with
      the scheduler.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Reported-by: NMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29eb7782
  8. 04 6月, 2013 1 次提交
  9. 28 5月, 2013 3 次提交
    • M
      mm, sched: Allow uaccess in atomic with pagefault_disable() · 662bbcb2
      Michael S. Tsirkin 提交于
      This changes might_fault() so that it does not
      trigger a false positive diagnostic for e.g. the following
      sequence:
      
      	spin_lock_irqsave()
      	pagefault_disable()
      	copy_to_user()
      	pagefault_enable()
      	spin_unlock_irqrestore()
      
      In particular vhost wants to do this, to call
      socket ops from under a lock.
      
      There are 3 cases to consider:
      
       - CONFIG_PROVE_LOCKING - might_fault is non-inline
         so it's easy to move the in_atomic test to fix
         up the false positive warning.
      
       - CONFIG_DEBUG_ATOMIC_SLEEP - might_fault
         is currently inline, but we are calling a
         non-inline __might_sleep anyway,
         so let's use the non-line version of might_fault
         that does the right thing.
      
       - !CONFIG_DEBUG_ATOMIC_SLEEP && !CONFIG_PROVE_LOCKING
         __might_sleep is a nop so might_fault is a nop.
      
      Make this explicit.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1369577426-26721-11-git-send-email-mst@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      662bbcb2
    • M
      mm, sched: Drop voluntary schedule from might_fault() · 114276ac
      Michael S. Tsirkin 提交于
      might_fault() is called from functions like copy_to_user()
      which most callers expect to be very fast, like a couple of
      instructions.
      
      So functions like memcpy_toiovec() call them many times in a loop.
      
      But might_fault() calls might_sleep() and with CONFIG_PREEMPT_VOLUNTARY
      this results in a function call.
      
      Let's not do this - just call __might_sleep() that produces
      a diagnostic for sleep within atomic, but drop
      might_preempt().
      
      Here's a test sending traffic between the VM and the host,
      host is built with CONFIG_PREEMPT_VOLUNTARY:
      
       before:
      	incoming: 7122.77   Mb/s
      	outgoing: 8480.37   Mb/s
      
       after:
      	incoming: 8619.24   Mb/s
      	outgoing: 9455.42   Mb/s
      
      As a side effect, this fixes an issue pointed
      out by Ingo: might_fault might schedule differently
      depending on PROVE_LOCKING. Now there's no
      preemption point in both cases, so it's consistent.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1369577426-26721-10-git-send-email-mst@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      114276ac
    • L
      mm: teach truncate_inode_pages_range() to handle non page aligned ranges · 5a720394
      Lukas Czerner 提交于
      This commit changes truncate_inode_pages_range() so it can handle non
      page aligned regions of the truncate. Currently we can hit BUG_ON when
      the end of the range is not page aligned, but we can handle unaligned
      start of the range.
      
      Being able to handle non page aligned regions of the page can help file
      system punch_hole implementations and save some work, because once we're
      holding the page we might as well deal with it right away.
      
      In previous commits we've changed ->invalidatepage() prototype to accept
      'length' argument to be able to specify range to invalidate. No we can
      use that new ability in truncate_inode_pages_range().
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      5a720394
  10. 25 5月, 2013 6 次提交
    • C
      mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas · a9ff785e
      Cliff Wickman 提交于
      A panic can be caused by simply cat'ing /proc/<pid>/smaps while an
      application has a VM_PFNMAP range.  It happened in-house when a
      benchmarker was trying to decipher the memory layout of his program.
      
      /proc/<pid>/smaps and similar walks through a user page table should not
      be looking at VM_PFNMAP areas.
      
      Certain tests in walk_page_range() (specifically split_huge_page_pmd())
      assume that all the mapped PFN's are backed with page structures.  And
      this is not usually true for VM_PFNMAP areas.  This can result in panics
      on kernel page faults when attempting to address those page structures.
      
      There are a half dozen callers of walk_page_range() that walk through a
      task's entire page table (as N.  Horiguchi pointed out).  So rather than
      change all of them, this patch changes just walk_page_range() to ignore
      VM_PFNMAP areas.
      
      The logic of hugetlb_vma() is moved back into walk_page_range(), as we
      want to test any vma in the range.
      
      VM_PFNMAP areas are used by:
      - graphics memory manager   gpu/drm/drm_gem.c
      - global reference unit     sgi-gru/grufile.c
      - sgi special memory        char/mspec.c
      - and probably several out-of-tree modules
      
      [akpm@linux-foundation.org: remove now-unused hugetlb_vma() stub]
      Signed-off-by: NCliff Wickman <cpw@sgi.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9ff785e
    • R
      mm/memory_hotplug.c: fix printk format warnings · 348f9f05
      Randy Dunlap 提交于
      Fix printk format warnings in mm/memory_hotplug.c by using "%pa":
      
        mm/memory_hotplug.c: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 2 has type 'resource_size_t' [-Wformat]
        mm/memory_hotplug.c: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 3 has type 'resource_size_t' [-Wformat]
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      348f9f05
    • A
      mm/THP: use pmd_populate() to update the pmd with pgtable_t pointer · 7c342512
      Aneesh Kumar K.V 提交于
      We should not use set_pmd_at to update pmd_t with pgtable_t pointer.
      set_pmd_at is used to set pmd with huge pte entries and architectures
      like ppc64, clear few flags from the pte when saving a new entry.
      Without this change we observe bad pte errors like below on ppc64 with
      THP enabled.
      
        BUG: Bad page map in process ld mm=0xc000001ee39f4780 pte:7fc3f37848000001 pmd:c000001ec0000000
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c342512
    • L
      mm compaction: fix of improper cache flush in migration code · c2cc499c
      Leonid Yegoshin 提交于
      Page 'new' during MIGRATION can't be flushed with flush_cache_page().
      Using flush_cache_page(vma, addr, pfn) is justified only if the page is
      already placed in process page table, and that is done right after
      flush_cache_page().  But without it the arch function has no knowledge
      of process PTE and does nothing.
      
      Besides that, flush_cache_page() flushes an application cache page, but
      the kernel has a different page virtual address and dirtied it.
      
      Replace it with flush_dcache_page(new) which is the proper usage.
      
      The old page is flushed in try_to_unmap_one() before migration.
      
      This bug takes place in Sead3 board with M14Kc MIPS CPU without cache
      aliasing (but Harvard arch - separate I and D cache) in tight memory
      environment (128MB) each 1-3days on SOAK test.  It fails in cc1 during
      kernel build (SIGILL, SIGBUS, SIGSEG) if CONFIG_COMPACTION is switched
      ON.
      Signed-off-by: NLeonid Yegoshin <Leonid.Yegoshin@imgtec.com>
      Cc: Leonid Yegoshin <yegoshin@mips.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: David Miller <davem@davemloft.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2cc499c
    • J
      mm: memcg: remove incorrect VM_BUG_ON for swap cache pages in uncharge · 28ccddf7
      Johannes Weiner 提交于
      Commit 0c59b89c ("mm: memcg: push down PageSwapCache check into
      uncharge entry functions") added a VM_BUG_ON() on PageSwapCache in the
      uncharge path after checking that page flag once, assuming that the
      state is stable in all paths, but this is not the case and the condition
      triggers in user environments.  An uncharge after the last page table
      reference to the page goes away can race with reclaim adding the page to
      swap cache.
      
      Swap cache pages are usually uncharged when they are freed after
      swapout, from a path that also handles swap usage accounting and memcg
      lifetime management.  However, since the last page table reference is
      gone and thus no references to the swap slot left, the swap slot will be
      freed shortly when reclaim attempts to write the page to disk.  The
      whole swap accounting is not even necessary.
      
      So while the race condition for which this VM_BUG_ON was added is real
      and actually existed all along, there are no negative effects.  Remove
      the VM_BUG_ON again.
      Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Reported-by: NLingzhu Xiang <lxiang@redhat.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28ccddf7
    • X
      mm: mmu_notifier: re-fix freed page still mapped in secondary MMU · d34883d4
      Xiao Guangrong 提交于
      Commit 751efd86 ("mmu_notifier_unregister NULL Pointer deref and
      multiple ->release()") breaks the fix 3ad3d901 ("mm: mmu_notifier:
      fix freed page still mapped in secondary MMU").
      
      Since hlist_for_each_entry_rcu() is changed now, we can not revert that
      patch directly, so this patch reverts the commit and simply fix the bug
      spotted by that patch
      
      This bug spotted by commit 751efd86 is:
      
          There is a race condition between mmu_notifier_unregister() and
          __mmu_notifier_release().
      
          Assume two tasks, one calling mmu_notifier_unregister() as a result
          of a filp_close() ->flush() callout (task A), and the other calling
          mmu_notifier_release() from an mmput() (task B).
      
                              A                               B
          t1                                            srcu_read_lock()
          t2            if (!hlist_unhashed())
          t3                                            srcu_read_unlock()
          t4            srcu_read_lock()
          t5                                            hlist_del_init_rcu()
          t6                                            synchronize_srcu()
          t7            srcu_read_unlock()
          t8            hlist_del_rcu()  <--- NULL pointer deref.
      
      This can be fixed by using hlist_del_init_rcu instead of hlist_del_rcu.
      
      The another issue spotted in the commit is "multiple ->release()
      callouts", we needn't care it too much because it is really rare (e.g,
      can not happen on kvm since mmu-notify is unregistered after
      exit_mmap()) and the later call of multiple ->release should be fast
      since all the pages have already been released by the first call.
      Anyway, this issue should be fixed in a separate patch.
      
      -stable suggestions: Any version that has commit 751efd86 need to be
      backported.  I find the oldest version has this commit is 3.0-stable.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Tested-by: NRobin Holt <holt@sgi.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d34883d4
  11. 22 5月, 2013 1 次提交
    • R
      mm: Fix virt_to_page() warning · bb3ec6b0
      Ralf Baechle 提交于
      virt_to_page() is typically implemented as a macro containing a cast so
      that it will accept both pointers and unsigned long without causing a
      warning.
      
      But MIPS virt_to_page() uses virt_to_phys which is a function so passing
      an unsigned long will cause a warning:
      
          CC      mm/page_alloc.o
        mm/page_alloc.c: In function ‘free_reserved_area’:
        mm/page_alloc.c:5161:3: warning: passing argument 1 of ‘virt_to_phys’ makes pointer from integer without a cast [enabled by default]
        arch/mips/include/asm/io.h:119:100: note: expected ‘const volatile void *’ but argument is of type ‘long unsigned int’
      
      All others users of virt_to_page() in mm/ are passing a void *.
      Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      Reported-by: NEunbong Song <eunb.song@samsung.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb3ec6b0