1. 04 2月, 2016 1 次提交
  2. 22 1月, 2016 2 次提交
  3. 21 1月, 2016 2 次提交
    • K
      thp: fix interrupt unsafe locking in split_huge_page() · 0b9b6fff
      Kirill A. Shutemov 提交于
      split_queue_lock can be taken from interrupt context in some cases, but
      I forgot to convert locking in split_huge_page() to interrupt-safe
      primitives.
      
      Let's fix this.
      
      lockdep output:
      
        ======================================================
        [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
        4.4.0+ #259 Tainted: G        W
        ------------------------------------------------------
        syz-executor/18183 [HC0[0]:SC0[2]:HE0:SE0] is trying to acquire:
         (split_queue_lock){+.+...}, at: free_transhuge_page+0x24/0x90 mm/huge_memory.c:3436
      
        and this task is already holding:
         (slock-AF_INET){+.-...}, at: spin_lock_bh include/linux/spinlock.h:307
         (slock-AF_INET){+.-...}, at: lock_sock_fast+0x45/0x120 net/core/sock.c:2462
        which would create a new lock dependency:
         (slock-AF_INET){+.-...} -> (split_queue_lock){+.+...}
      
        but this new dependency connects a SOFTIRQ-irq-safe lock:
         (slock-AF_INET){+.-...}
        ... which became SOFTIRQ-irq-safe at:
           mark_irqflags kernel/locking/lockdep.c:2799
           __lock_acquire+0xfd8/0x4700 kernel/locking/lockdep.c:3162
           lock_acquire+0x1dc/0x430 kernel/locking/lockdep.c:3585
           __raw_spin_lock include/linux/spinlock_api_smp.h:144
           _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
           spin_lock include/linux/spinlock.h:302
           udp_queue_rcv_skb+0x781/0x1550 net/ipv4/udp.c:1680
           flush_stack+0x50/0x330 net/ipv6/udp.c:799
           __udp4_lib_mcast_deliver+0x694/0x7f0 net/ipv4/udp.c:1798
           __udp4_lib_rcv+0x17dc/0x23e0 net/ipv4/udp.c:1888
           udp_rcv+0x21/0x30 net/ipv4/udp.c:2108
           ip_local_deliver_finish+0x2b3/0xa50 net/ipv4/ip_input.c:216
           NF_HOOK_THRESH include/linux/netfilter.h:226
           NF_HOOK include/linux/netfilter.h:249
           ip_local_deliver+0x1c4/0x2f0 net/ipv4/ip_input.c:257
           dst_input include/net/dst.h:498
           ip_rcv_finish+0x5ec/0x1730 net/ipv4/ip_input.c:365
           NF_HOOK_THRESH include/linux/netfilter.h:226
           NF_HOOK include/linux/netfilter.h:249
           ip_rcv+0x963/0x1080 net/ipv4/ip_input.c:455
           __netif_receive_skb_core+0x1620/0x2f80 net/core/dev.c:4154
           __netif_receive_skb+0x2a/0x160 net/core/dev.c:4189
           netif_receive_skb_internal+0x1b5/0x390 net/core/dev.c:4217
           napi_skb_finish net/core/dev.c:4542
           napi_gro_receive+0x2bd/0x3c0 net/core/dev.c:4572
           e1000_clean_rx_irq+0x4e2/0x1100 drivers/net/ethernet/intel/e1000e/netdev.c:1038
           e1000_clean+0xa08/0x24a0 drivers/net/ethernet/intel/e1000/e1000_main.c:3819
           napi_poll net/core/dev.c:5074
           net_rx_action+0x7eb/0xdf0 net/core/dev.c:5139
           __do_softirq+0x26a/0x920 kernel/softirq.c:273
           invoke_softirq kernel/softirq.c:350
           irq_exit+0x18f/0x1d0 kernel/softirq.c:391
           exiting_irq ./arch/x86/include/asm/apic.h:659
           do_IRQ+0x86/0x1a0 arch/x86/kernel/irq.c:252
           ret_from_intr+0x0/0x20 arch/x86/entry/entry_64.S:520
           arch_safe_halt ./arch/x86/include/asm/paravirt.h:117
           default_idle+0x52/0x2e0 arch/x86/kernel/process.c:304
           arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:295
           default_idle_call+0x48/0xa0 kernel/sched/idle.c:92
           cpuidle_idle_call kernel/sched/idle.c:156
           cpu_idle_loop kernel/sched/idle.c:252
           cpu_startup_entry+0x554/0x710 kernel/sched/idle.c:300
           rest_init+0x192/0x1a0 init/main.c:412
           start_kernel+0x678/0x69e init/main.c:683
           x86_64_start_reservations+0x2a/0x2c arch/x86/kernel/head64.c:195
           x86_64_start_kernel+0x158/0x167 arch/x86/kernel/head64.c:184
      
        to a SOFTIRQ-irq-unsafe lock:
         (split_queue_lock){+.+...}
         which became SOFTIRQ-irq-unsafe at:
           mark_irqflags kernel/locking/lockdep.c:2817
           __lock_acquire+0x146e/0x4700 kernel/locking/lockdep.c:3162
           lock_acquire+0x1dc/0x430 kernel/locking/lockdep.c:3585
           __raw_spin_lock include/linux/spinlock_api_smp.h:144
           _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
           spin_lock include/linux/spinlock.h:302
           split_huge_page_to_list+0xcc0/0x1c50 mm/huge_memory.c:3399
           split_huge_page include/linux/huge_mm.h:99
           queue_pages_pte_range+0xa38/0xef0 mm/mempolicy.c:507
           walk_pmd_range mm/pagewalk.c:50
           walk_pud_range mm/pagewalk.c:90
           walk_pgd_range mm/pagewalk.c:116
           __walk_page_range+0x653/0xcd0 mm/pagewalk.c:204
           walk_page_range+0xfe/0x2b0 mm/pagewalk.c:281
           queue_pages_range+0xfb/0x130 mm/mempolicy.c:687
           migrate_to_node mm/mempolicy.c:1004
           do_migrate_pages+0x370/0x4e0 mm/mempolicy.c:1109
           SYSC_migrate_pages mm/mempolicy.c:1453
           SyS_migrate_pages+0x640/0x730 mm/mempolicy.c:1374
           entry_SYSCALL_64_fastpath+0x16/0x7a arch/x86/entry/entry_64.S:185
      
        other info that might help us debug this:
      
         Possible interrupt unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(split_queue_lock);
                                       local_irq_disable();
                                       lock(slock-AF_INET);
                                       lock(split_queue_lock);
          <Interrupt>
            lock(slock-AF_INET);
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b9b6fff
    • A
      mm: avoid uninitialized variable in tracepoint · 629d9d1c
      Arnd Bergmann 提交于
      A newly added tracepoint in the hugepage code uses a variable in the
      error handling that is not initialized at that point:
      
      include/trace/events/huge_memory.h:81:230: error: 'isolated' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      
      The result is relatively harmless, as the trace data will in rare
      cases contain incorrect data.
      
      This works around the problem by adding an explicit initialization.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Fixes: 7d2eba05 ("mm: add tracepoint for scanning pages")
      Reviewed-by: NEbru Akagunduz <ebru.akagunduz@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      629d9d1c
  4. 18 1月, 2016 1 次提交
  5. 16 1月, 2016 26 次提交
  6. 15 1月, 2016 1 次提交
    • E
      mm: add tracepoint for scanning pages · 7d2eba05
      Ebru Akagunduz 提交于
      This patch series makes swapin readahead up to a certain number to gain
      more thp performance and adds tracepoint for khugepaged_scan_pmd,
      collapse_huge_page, __collapse_huge_page_isolate.
      
      This patch series was written to deal with programs that access most,
      but not all, of their memory after they get swapped out.  Currently
      these programs do not get their memory collapsed into THPs after the
      system swapped their memory out, while they would get THPs before
      swapping happened.
      
      This patch series was tested with a test program, it allocates 400MB of
      memory, writes to it, and then sleeps.  I force the system to swap out
      all.  Afterwards, the test program touches the area by writing and
      leaves a piece of it without writing.  This shows how much swap in
      readahead made by the patch.
      
      Test results:
      
                              After swapped out
      -------------------------------------------------------------------
                    | Anonymous | AnonHugePages | Swap      | Fraction  |
      -------------------------------------------------------------------
      With patch    | 90076 kB    | 88064 kB    | 309928 kB |    %99    |
      -------------------------------------------------------------------
      Without patch | 194068 kB | 192512 kB     | 205936 kB |    %99    |
      -------------------------------------------------------------------
      
                              After swapped in
      -------------------------------------------------------------------
                    | Anonymous | AnonHugePages | Swap      | Fraction  |
      -------------------------------------------------------------------
      With patch    | 201408 kB | 198656 kB     | 198596 kB |    %98    |
      -------------------------------------------------------------------
      Without patch | 292624 kB | 192512 kB     | 107380 kB |    %65    |
      -------------------------------------------------------------------
      
      This patch (of 3):
      
      Using static tracepoints, data of functions is recorded.  It is good to
      automatize debugging without doing a lot of changes in the source code.
      
      This patch adds tracepoint for khugepaged_scan_pmd, collapse_huge_page
      and __collapse_huge_page_isolate.
      
      [dan.carpenter@oracle.com: add a missing tab]
      Signed-off-by: NEbru Akagunduz <ebru.akagunduz@gmail.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d2eba05
  7. 21 11月, 2015 1 次提交
  8. 07 11月, 2015 4 次提交
    • K
      mm: make compound_head() robust · 1d798ca3
      Kirill A. Shutemov 提交于
      Hugh has pointed that compound_head() call can be unsafe in some
      context. There's one example:
      
      	CPU0					CPU1
      
      isolate_migratepages_block()
        page_count()
          compound_head()
            !!PageTail() == true
      					put_page()
      					  tail->first_page = NULL
            head = tail->first_page
      					alloc_pages(__GFP_COMP)
      					   prep_compound_page()
      					     tail->first_page = head
      					     __SetPageTail(p);
            !!PageTail() == true
          <head == NULL dereferencing>
      
      The race is pure theoretical. I don't it's possible to trigger it in
      practice. But who knows.
      
      We can fix the race by changing how encode PageTail() and compound_head()
      within struct page to be able to update them in one shot.
      
      The patch introduces page->compound_head into third double word block in
      front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
      the rest bits are pointer to head page if bit zero is set.
      
      The patch moves page->pmd_huge_pte out of word, just in case if an
      architecture defines pgtable_t into something what can have the bit 0
      set.
      
      hugetlb_cgroup uses page->lru.next in the second tail page to store
      pointer struct hugetlb_cgroup. The patch switch it to use page->private
      in the second tail page instead. The space is free since ->first_page is
      removed from the union.
      
      The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
      limitation, since there's now space in first tail page to store struct
      hugetlb_cgroup pointer. But that's out of scope of the patch.
      
      That means page->compound_head shares storage space with:
      
       - page->lru.next;
       - page->next;
       - page->rcu_head.next;
      
      That's too long list to be absolutely sure, but looks like nobody uses
      bit 0 of the word.
      
      page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
      call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
      call_rcu_lazy() is not allowed as it makes use of the bit and we can
      get false positive PageTail().
      
      [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d798ca3
    • A
      thp: remove unused vma parameter from khugepaged_alloc_page · d6669d68
      Aaron Tomlin 提交于
      The "vma" parameter to khugepaged_alloc_page() is unused.  It has to
      remain unused or the drop read lock 'map_sem' optimisation introduce by
      commit 8b164568 ("mm, THP: don't hold mmap_sem in khugepaged when
      allocating THP") wouldn't be safe.  So let's remove it.
      Signed-off-by: NAaron Tomlin <atomlin@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6669d68
    • M
      mm, page_alloc: remove MIGRATE_RESERVE · 974a786e
      Mel Gorman 提交于
      MIGRATE_RESERVE preserves an old property of the buddy allocator that
      existed prior to fragmentation avoidance -- min_free_kbytes worth of pages
      tended to remain contiguous until the only alternative was to fail the
      allocation.  At the time it was discovered that high-order atomic
      allocations relied on this property so MIGRATE_RESERVE was introduced.  A
      later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch
      deletes MIGRATE_RESERVE and supporting code so it'll be easier to review.
      Note that this patch in isolation may look like a false regression if
      someone was bisecting high-order atomic allocation failures.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      974a786e
    • M
      mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM · 71baba4b
      Mel Gorman 提交于
      __GFP_WAIT was used to signal that the caller was in atomic context and
      could not sleep.  Now it is possible to distinguish between true atomic
      context and callers that are not willing to sleep.  The latter should
      clear __GFP_DIRECT_RECLAIM so kswapd will still wake.  As clearing
      __GFP_WAIT behaves differently, there is a risk that people will clear the
      wrong flags.  This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
      indicate what it does -- setting it allows all reclaim activity, clearing
      them prevents it.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71baba4b
  9. 06 11月, 2015 1 次提交
    • E
      mm: introduce VM_LOCKONFAULT · de60f5f1
      Eric B Munson 提交于
      The cost of faulting in all memory to be locked can be very high when
      working with large mappings.  If only portions of the mapping will be used
      this can incur a high penalty for locking.
      
      For the example of a large file, this is the usage pattern for a large
      statical language model (probably applies to other statical or graphical
      models as well).  For the security example, any application transacting in
      data that cannot be swapped out (credit card data, medical records, etc).
      
      This patch introduces the ability to request that pages are not
      pre-faulted, but are placed on the unevictable LRU when they are finally
      faulted in.  The VM_LOCKONFAULT flag will be used together with VM_LOCKED
      and has no effect when set without VM_LOCKED.  Setting the VM_LOCKONFAULT
      flag for a VMA will cause pages faulted into that VMA to be added to the
      unevictable LRU when they are faulted or if they are already present, but
      will not cause any missing pages to be faulted in.
      
      Exposing this new lock state means that we cannot overload the meaning of
      the FOLL_POPULATE flag any longer.  Prior to this patch it was used to
      mean that the VMA for a fault was locked.  This means we need the new
      FOLL_MLOCK flag to communicate the locked state of a VMA.  FOLL_POPULATE
      will now only control if the VMA should be populated and in the case of
      VM_LOCKONFAULT, it will not be set.
      Signed-off-by: NEric B Munson <emunson@akamai.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de60f5f1
  10. 23 10月, 2015 1 次提交