1. 22 3月, 2020 5 次提交
  2. 19 3月, 2020 2 次提交
    • L
      mm: slub: be more careful about the double cmpxchg of freelist · 5076190d
      Linus Torvalds 提交于
      This is just a cleanup addition to Jann's fix to properly update the
      transaction ID for the slub slowpath in commit fd4d9c7d ("mm: slub:
      add missing TID bump..").
      
      The transaction ID is what protects us against any concurrent accesses,
      but we should really also make sure to make the 'freelist' comparison
      itself always use the same freelist value that we then used as the new
      next free pointer.
      
      Jann points out that if we do all of this carefully, we could skip the
      transaction ID update for all the paths that only remove entries from
      the lists, and only update the TID when adding entries (to avoid the ABA
      issue with cmpxchg and list handling re-adding a previously seen value).
      
      But this patch just does the "make sure to cmpxchg the same value we
      used" rather than then try to be clever.
      Acked-by: NJann Horn <jannh@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5076190d
    • J
      mm: slub: add missing TID bump in kmem_cache_alloc_bulk() · fd4d9c7d
      Jann Horn 提交于
      When kmem_cache_alloc_bulk() attempts to allocate N objects from a percpu
      freelist of length M, and N > M > 0, it will first remove the M elements
      from the percpu freelist, then call ___slab_alloc() to allocate the next
      element and repopulate the percpu freelist. ___slab_alloc() can re-enable
      IRQs via allocate_slab(), so the TID must be bumped before ___slab_alloc()
      to properly commit the freelist head change.
      
      Fix it by unconditionally bumping c->tid when entering the slowpath.
      
      Cc: stable@vger.kernel.org
      Fixes: ebe909e0 ("slub: improve bulk alloc strategy")
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd4d9c7d
  3. 11 3月, 2020 2 次提交
    • S
      net: memcg: late association of sock to memcg · d752a498
      Shakeel Butt 提交于
      If a TCP socket is allocated in IRQ context or cloned from unassociated
      (i.e. not associated to a memcg) in IRQ context then it will remain
      unassociated for its whole life. Almost half of the TCPs created on the
      system are created in IRQ context, so, memory used by such sockets will
      not be accounted by the memcg.
      
      This issue is more widespread in cgroup v1 where network memory
      accounting is opt-in but it can happen in cgroup v2 if the source socket
      for the cloning was created in root memcg.
      
      To fix the issue, just do the association of the sockets at the accept()
      time in the process context and then force charge the memory buffer
      already used and reserved by the socket.
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d752a498
    • S
      cgroup: memcg: net: do not associate sock with unrelated cgroup · e876ecc6
      Shakeel Butt 提交于
      We are testing network memory accounting in our setup and noticed
      inconsistent network memory usage and often unrelated cgroups network
      usage correlates with testing workload. On further inspection, it
      seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in
      irq context specially for cgroup v1.
      
      mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context
      and kind of assumes that this can only happen from sk_clone_lock()
      and the source sock object has already associated cgroup. However in
      cgroup v1, where network memory accounting is opt-in, the source sock
      can be unassociated with any cgroup and the new cloned sock can get
      associated with unrelated interrupted cgroup.
      
      Cgroup v2 can also suffer if the source sock object was created by
      process in the root cgroup or if sk_alloc() is called in irq context.
      The fix is to just do nothing in interrupt.
      
      WARNING: Please note that about half of the TCP sockets are allocated
      from the IRQ context, so, memory used by such sockets will not be
      accouted by the memcg.
      
      The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
      
      CPU: 70 PID: 12720 Comm: ssh Tainted:  5.6.0-smp-DEV #1
      Hardware name: ...
      Call Trace:
       <IRQ>
       dump_stack+0x57/0x75
       mem_cgroup_sk_alloc+0xe9/0xf0
       sk_clone_lock+0x2a7/0x420
       inet_csk_clone_lock+0x1b/0x110
       tcp_create_openreq_child+0x23/0x3b0
       tcp_v6_syn_recv_sock+0x88/0x730
       tcp_check_req+0x429/0x560
       tcp_v6_rcv+0x72d/0xa40
       ip6_protocol_deliver_rcu+0xc9/0x400
       ip6_input+0x44/0xd0
       ? ip6_protocol_deliver_rcu+0x400/0x400
       ip6_rcv_finish+0x71/0x80
       ipv6_rcv+0x5b/0xe0
       ? ip6_sublist_rcv+0x2e0/0x2e0
       process_backlog+0x108/0x1e0
       net_rx_action+0x26b/0x460
       __do_softirq+0x104/0x2a6
       do_softirq_own_stack+0x2a/0x40
       </IRQ>
       do_softirq.part.19+0x40/0x50
       __local_bh_enable_ip+0x51/0x60
       ip6_finish_output2+0x23d/0x520
       ? ip6table_mangle_hook+0x55/0x160
       __ip6_finish_output+0xa1/0x100
       ip6_finish_output+0x30/0xd0
       ip6_output+0x73/0x120
       ? __ip6_finish_output+0x100/0x100
       ip6_xmit+0x2e3/0x600
       ? ipv6_anycast_cleanup+0x50/0x50
       ? inet6_csk_route_socket+0x136/0x1e0
       ? skb_free_head+0x1e/0x30
       inet6_csk_xmit+0x95/0xf0
       __tcp_transmit_skb+0x5b4/0xb20
       __tcp_send_ack.part.60+0xa3/0x110
       tcp_send_ack+0x1d/0x20
       tcp_rcv_state_process+0xe64/0xe80
       ? tcp_v6_connect+0x5d1/0x5f0
       tcp_v6_do_rcv+0x1b1/0x3f0
       ? tcp_v6_do_rcv+0x1b1/0x3f0
       __release_sock+0x7f/0xd0
       release_sock+0x30/0xa0
       __inet_stream_connect+0x1c3/0x3b0
       ? prepare_to_wait+0xb0/0xb0
       inet_stream_connect+0x3b/0x60
       __sys_connect+0x101/0x120
       ? __sys_getsockopt+0x11b/0x140
       __x64_sys_connect+0x1a/0x20
       do_syscall_64+0x51/0x200
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
      Fixes: 2d758073 ("mm: memcontrol: consolidate cgroup socket tracking")
      Fixes: d979a39d ("cgroup: duplicate cgroup reference when cloning sockets")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e876ecc6
  4. 06 3月, 2020 5 次提交
    • V
      mm, hotplug: fix page online with DEBUG_PAGEALLOC compiled but not enabled · c87cbc1f
      Vlastimil Babka 提交于
      Commit cd02cf1a ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
      fixed memory hotplug with debug_pagealloc enabled, where onlining a page
      goes through page freeing, which removes the direct mapping.  Some arches
      don't like when the page is not mapped in the first place, so
      generic_online_page() maps it first.  This is somewhat wasteful, but
      better than special casing page freeing fast paths.
      
      The commit however missed that DEBUG_PAGEALLOC configured doesn't mean
      it's actually enabled.  One has to test debug_pagealloc_enabled() since
      031bc574 ("mm/debug-pagealloc: make debug-pagealloc boottime
      configurable"), or alternatively debug_pagealloc_enabled_static() since
      8e57f8ac ("mm, debug_pagealloc: don't rely on static keys too early"),
      but this is not done.
      
      As a result, a s390 kernel with DEBUG_PAGEALLOC configured but not enabled
      will crash:
      
      Unable to handle kernel pointer dereference in virtual kernel address space
      Failing address: 0000000000000000 TEID: 0000000000000483
      Fault in home space mode while using kernel ASCE.
      AS:0000001ece13400b R2:000003fff7fd000b R3:000003fff7fcc007 S:000003fff7fd7000 P:000000000000013d
      Oops: 0004 ilc:2 [#1] SMP
      CPU: 1 PID: 26015 Comm: chmem Kdump: loaded Tainted: GX 5.3.18-5-default #1 SLE15-SP2 (unreleased)
      Krnl PSW : 0704e00180000000 0000001ecd281b9e (__kernel_map_pages+0x166/0x188)
      R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
      Krnl GPRS: 0000000000000000 0000000000000800 0000400b00000000 0000000000000100
      0000000000000001 0000000000000000 0000000000000002 0000000000000100
      0000001ece139230 0000001ecdd98d40 0000400b00000100 0000000000000000
      000003ffa17e4000 001fffe0114f7d08 0000001ecd4d93ea 001fffe0114f7b20
      Krnl Code: 0000001ecd281b8e: ec17ffff00d8 ahik %r1,%r7,-1
      0000001ecd281b94: ec111dbc0355 risbg %r1,%r1,29,188,3
      >0000001ecd281b9e: 94fb5006 ni 6(%r5),251
      0000001ecd281ba2: 41505008 la %r5,8(%r5)
      0000001ecd281ba6: ec51fffc6064 cgrj %r5,%r1,6,1ecd281b9e
      0000001ecd281bac: 1a07 ar %r0,%r7
      0000001ecd281bae: ec03ff584076 crj %r0,%r3,4,1ecd281a5e
      Call Trace:
      [<0000001ecd281b9e>] __kernel_map_pages+0x166/0x188
      [<0000001ecd4d9516>] online_pages_range+0xf6/0x128
      [<0000001ecd2a8186>] walk_system_ram_range+0x7e/0xd8
      [<0000001ecda28aae>] online_pages+0x2fe/0x3f0
      [<0000001ecd7d02a6>] memory_subsys_online+0x8e/0xc0
      [<0000001ecd7add42>] device_online+0x5a/0xc8
      [<0000001ecd7d0430>] state_store+0x88/0x118
      [<0000001ecd5b9f62>] kernfs_fop_write+0xc2/0x200
      [<0000001ecd5064b6>] vfs_write+0x176/0x1e0
      [<0000001ecd50676a>] ksys_write+0xa2/0x100
      [<0000001ecda315d4>] system_call+0xd8/0x2c8
      
      Fix this by checking debug_pagealloc_enabled_static() before calling
      kernel_map_pages(). Backports for kernel before 5.5 should use
      debug_pagealloc_enabled() instead. Also add comments.
      
      Fixes: cd02cf1a ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
      Reported-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/20200224094651.18257-1-vbabka@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c87cbc1f
    • S
      mm/z3fold.c: do not include rwlock.h directly · a8198fed
      Sebastian Andrzej Siewior 提交于
      rwlock.h should not be included directly. Instead linux/splinlock.h
      should be included. One thing it does is to break the RT build.
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20200224133631.1510569-1-bigeasy@linutronix.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8198fed
    • K
      mm: avoid data corruption on CoW fault into PFN-mapped VMA · c3e5ea6e
      Kirill A. Shutemov 提交于
      Jeff Moyer has reported that one of xfstests triggers a warning when run
      on DAX-enabled filesystem:
      
      	WARNING: CPU: 76 PID: 51024 at mm/memory.c:2317 wp_page_copy+0xc40/0xd50
      	...
      	wp_page_copy+0x98c/0xd50 (unreliable)
      	do_wp_page+0xd8/0xad0
      	__handle_mm_fault+0x748/0x1b90
      	handle_mm_fault+0x120/0x1f0
      	__do_page_fault+0x240/0xd70
      	do_page_fault+0x38/0xd0
      	handle_page_fault+0x10/0x30
      
      The warning happens on failed __copy_from_user_inatomic() which tries to
      copy data into a CoW page.
      
      This happens because of race between MADV_DONTNEED and CoW page fault:
      
      	CPU0					CPU1
       handle_mm_fault()
         do_wp_page()
           wp_page_copy()
             do_wp_page()
      					madvise(MADV_DONTNEED)
      					  zap_page_range()
      					    zap_pte_range()
      					      ptep_get_and_clear_full()
      					      <TLB flush>
      	 __copy_from_user_inatomic()
      	 sees empty PTE and fails
      	 WARN_ON_ONCE(1)
      	 clear_page()
      
      The solution is to re-try __copy_from_user_inatomic() under PTL after
      checking that PTE is matches the orig_pte.
      
      The second copy attempt can still fail, like due to non-readable PTE, but
      there's nothing reasonable we can do about, except clearing the CoW page.
      Reported-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: <stable@vger.kernel.org>
      Cc: Justin He <Justin.He@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Link: http://lkml.kernel.org/r/20200218154151.13349-1-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3e5ea6e
    • H
      mm: fix possible PMD dirty bit lost in set_pmd_migration_entry() · 8a8683ad
      Huang Ying 提交于
      In set_pmd_migration_entry(), pmdp_invalidate() is used to change PMD
      atomically.  But the PMD is read before that with an ordinary memory
      reading.  If the THP (transparent huge page) is written between the PMD
      reading and pmdp_invalidate(), the PMD dirty bit may be lost, and cause
      data corruption.  The race window is quite small, but still possible in
      theory, so need to be fixed.
      
      The race is fixed via using the return value of pmdp_invalidate() to get
      the original content of PMD, which is a read/modify/write atomic
      operation.  So no THP writing can occur in between.
      
      The race has been introduced when the THP migration support is added in
      the commit 616b8371 ("mm: thp: enable thp migration in generic path").
      But this fix depends on the commit d52605d7 ("mm: do not lose dirty
      and accessed bits in pmdp_invalidate()").  So it's easy to be backported
      after v4.16.  But the race window is really small, so it may be fine not
      to backport the fix at all.
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Link: http://lkml.kernel.org/r/20200220075220.2327056-1-ying.huang@intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a8683ad
    • M
      mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa · 8b272b3c
      Mel Gorman 提交于
      : A user reported a bug against a distribution kernel while running a
      : proprietary workload described as "memory intensive that is not swapping"
      : that is expected to apply to mainline kernels.  The workload is
      : read/write/modifying ranges of memory and checking the contents.  They
      : reported that within a few hours that a bad PMD would be reported followed
      : by a memory corruption where expected data was all zeros.  A partial
      : report of the bad PMD looked like
      :
      :   [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
      :   [ 5195.341184] ------------[ cut here ]------------
      :   [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
      :   ....
      :   [ 5195.410033] Call Trace:
      :   [ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
      :   [ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
      :   [ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
      :   [ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
      :   [ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
      :   [ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
      :   [ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10
      :
      : Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
      : was a false detection.  The bug does not trigger if automatic NUMA
      : balancing or transparent huge pages is disabled.
      :
      : The bug is due a race in change_pmd_range between a pmd_trans_huge and
      : pmd_nond_or_clear_bad check without any locks held.  During the
      : pmd_trans_huge check, a parallel protection update under lock can have
      : cleared the PMD and filled it with a prot_numa entry between the transhuge
      : check and the pmd_none_or_clear_bad check.
      :
      : While this could be fixed with heavy locking, it's only necessary to make
      : a copy of the PMD on the stack during change_pmd_range and avoid races.  A
      : new helper is created for this as the check if quite subtle and the
      : existing similar helpful is not suitable.  This passed 154 hours of
      : testing (usually triggers between 20 minutes and 24 hours) without
      : detecting bad PMDs or corruption.  A basic test of an autonuma-intensive
      : workload showed no significant change in behaviour.
      
      Although Mel withdrew the patch on the face of LKML comment
      https://lkml.org/lkml/2017/4/10/922 the race window aforementioned is
      still open, and we have reports of Linpack test reporting bad residuals
      after the bad PMD warning is observed.  In addition to that, bad
      rss-counter and non-zero pgtables assertions are triggered on mm teardown
      for the task hitting the bad PMD.
      
       host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
       ....
       host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
       host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096
      
      The issue is observed on a v4.18-based distribution kernel, but the race
      window is expected to be applicable to mainline kernels, as well.
      
      [akpm@linux-foundation.org: fix comment typo, per Rafael]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200216191800.22423-1-aquini@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b272b3c
  5. 22 2月, 2020 4 次提交
    • W
      mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM · 18e19f19
      Wei Yang 提交于
      When we use SPARSEMEM instead of SPARSEMEM_VMEMMAP, pfn_to_page()
      doesn't work before sparse_init_one_section() is called.
      
      This leads to a crash when hotplug memory:
      
          BUG: unable to handle page fault for address: 0000000006400000
          #PF: supervisor write access in kernel mode
          #PF: error_code(0x0002) - not-present page
          PGD 0 P4D 0
          Oops: 0002 [#1] SMP PTI
          CPU: 3 PID: 221 Comm: kworker/u16:1 Tainted: G        W         5.5.0-next-20200205+ #343
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
          Workqueue: kacpi_hotplug acpi_hotplug_work_fn
          RIP: 0010:__memset+0x24/0x30
          Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
          RSP: 0018:ffffb43ac0373c80 EFLAGS: 00010a87
          RAX: ffffffffffffffff RBX: ffff8a1518800000 RCX: 0000000000050000
          RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000006400000
          RBP: 0000000000140000 R08: 0000000000100000 R09: 0000000006400000
          R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
          R13: 0000000000000028 R14: 0000000000000000 R15: ffff8a153ffd9280
          FS:  0000000000000000(0000) GS:ffff8a153ab00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 0000000006400000 CR3: 0000000136fca000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           sparse_add_section+0x1c9/0x26a
           __add_pages+0xbf/0x150
           add_pages+0x12/0x60
           add_memory_resource+0xc8/0x210
           __add_memory+0x62/0xb0
           acpi_memory_device_add+0x13f/0x300
           acpi_bus_attach+0xf6/0x200
           acpi_bus_scan+0x43/0x90
           acpi_device_hotplug+0x275/0x3d0
           acpi_hotplug_work_fn+0x1a/0x30
           process_one_work+0x1a7/0x370
           worker_thread+0x30/0x380
           kthread+0x112/0x130
           ret_from_fork+0x35/0x40
      
      We should use memmap as it did.
      
      On x86 the impact is limited to x86_32 builds, or x86_64 configurations
      that override the default setting for SPARSEMEM_VMEMMAP.
      
      Other memory hotplug archs (arm64, ia64, and ppc) also default to
      SPARSEMEM_VMEMMAP=y.
      
      [dan.j.williams@intel.com: changelog update]
      {rppt@linux.ibm.com: changelog update]
      Link: http://lkml.kernel.org/r/20200219030454.4844-1-bhe@redhat.com
      Fixes: ba72b4c8 ("mm/sparsemem: support sub-section hotplug")
      Signed-off-by: NWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18e19f19
    • G
      mm/vmscan.c: don't round up scan size for online memory cgroup · 76073c64
      Gavin Shan 提交于
      Commit 68600f62 ("mm: don't miss the last page because of round-off
      error") makes the scan size round up to @denominator regardless of the
      memory cgroup's state, online or offline.  This affects the overall
      reclaiming behavior: the corresponding LRU list is eligible for
      reclaiming only when its size logically right shifted by @sc->priority
      is bigger than zero in the former formula.
      
      For example, the inactive anonymous LRU list should have at least 0x4000
      pages to be eligible for reclaiming when we have 60/12 for
      swappiness/priority and without taking scan/rotation ratio into account.
      
      After the roundup is applied, the inactive anonymous LRU list becomes
      eligible for reclaiming when its size is bigger than or equal to 0x1000
      in the same condition.
      
          (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
          ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1
      
      aarch64 has 512MB huge page size when the base page size is 64KB.  The
      memory cgroup that has a huge page is always eligible for reclaiming in
      that case.
      
      The reclaiming is likely to stop after the huge page is reclaimed,
      meaing the further iteration on @sc->priority and the silbing and child
      memory cgroups will be skipped.  The overall behaviour has been changed.
      This fixes the issue by applying the roundup to offlined memory cgroups
      only, to give more preference to reclaim memory from offlined memory
      cgroup.  It sounds reasonable as those memory is unlikedly to be used by
      anyone.
      
      The issue was found by starting up 8 VMs on a Ampere Mustang machine,
      which has 8 CPUs and 16 GB memory.  Each VM is given with 2 vCPUs and
      2GB memory.  It took 264 seconds for all VMs to be completely up and
      784MB swap is consumed after that.  With this patch applied, it took 236
      seconds and 60MB swap to do same thing.  So there is 10% performance
      improvement for my case.  Note that KSM is disable while THP is enabled
      in the testing.
      
               total     used    free   shared  buff/cache   available
         Mem:  16196    10065    2049       16        4081        3749
         Swap:  8175      784    7391
               total     used    free   shared  buff/cache   available
         Mem:  16196    11324    3656       24        1215        2936
         Swap:  8175       60    8115
      
      Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com
      Fixes: 68600f62 ("mm: don't miss the last page because of round-off error")
      Signed-off-by: NGavin Shan <gshan@redhat.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>	[4.20+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76073c64
    • V
      mm/memcontrol.c: lost css_put in memcg_expand_shrinker_maps() · 75866af6
      Vasily Averin 提交于
      for_each_mem_cgroup() increases css reference counter for memory cgroup
      and requires to use mem_cgroup_iter_break() if the walk is cancelled.
      
      Link: http://lkml.kernel.org/r/c98414fb-7e1f-da0f-867a-9340ec4bd30b@virtuozzo.com
      Fixes: 0a4465d3 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75866af6
    • C
  6. 20 2月, 2020 1 次提交
  7. 19 2月, 2020 1 次提交
  8. 08 2月, 2020 3 次提交
  9. 07 2月, 2020 2 次提交
  10. 04 2月, 2020 15 次提交
    • A
      proc: convert everything to "struct proc_ops" · 97a32539
      Alexey Dobriyan 提交于
      The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
      seq_file.h.
      
      Conversion rule is:
      
      	llseek		=> proc_lseek
      	unlocked_ioctl	=> proc_ioctl
      
      	xxx		=> proc_xxx
      
      	delete ".owner = THIS_MODULE" line
      
      [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
      [sfr@canb.auug.org.au: fix kernel/sched/psi.c]
        Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a32539
    • P
      asm-generic/tlb: provide MMU_GATHER_TABLE_FREE · 0d6e24d4
      Peter Zijlstra 提交于
      As described in the comment, the correct order for freeing pages is:
      
       1) unhook page
       2) TLB invalidate page
       3) free page
      
      This order equally applies to page directories.
      
      Currently there are two correct options:
      
       - use tlb_remove_page(), when all page directores are full pages and
         there are no futher contraints placed by things like software
         walkers (HAVE_FAST_GUP).
      
       - use MMU_GATHER_RCU_TABLE_FREE and tlb_remove_table() when the
         architecture does not do IPI based TLB invalidate and has
         HAVE_FAST_GUP (or software TLB fill).
      
      This however leaves architectures that don't have page based directories
      but don't need RCU in a bind.  For those, provide MMU_GATHER_TABLE_FREE,
      which provides the independent batching for directories without the
      additional RCU freeing.
      
      Link: http://lkml.kernel.org/r/20200116064531.483522-10-aneesh.kumar@linux.ibm.comSigned-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d6e24d4
    • P
      asm-generic/tlb: rename HAVE_MMU_GATHER_NO_GATHER · 580a586c
      Peter Zijlstra 提交于
      Towards a more consistent naming scheme.
      
      Link: http://lkml.kernel.org/r/20200116064531.483522-9-aneesh.kumar@linux.ibm.comSigned-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      580a586c
    • P
      asm-generic/tlb: rename HAVE_MMU_GATHER_PAGE_SIZE · 3af4bd03
      Peter Zijlstra 提交于
      Towards a more consistent naming scheme.
      
      Link: http://lkml.kernel.org/r/20200116064531.483522-8-aneesh.kumar@linux.ibm.comSigned-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3af4bd03
    • P
      asm-generic/tlb: rename HAVE_RCU_TABLE_FREE · ff2e6d72
      Peter Zijlstra 提交于
      Towards a more consistent naming scheme.
      
      [akpm@linux-foundation.org: fix sparc64 Kconfig]
      Link: http://lkml.kernel.org/r/20200116064531.483522-7-aneesh.kumar@linux.ibm.comSigned-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff2e6d72
    • P
      mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush · 0ed13259
      Peter Zijlstra 提交于
      Architectures for which we have hardware walkers of Linux page table
      should flush TLB on mmu gather batch allocation failures and batch flush.
      Some architectures like POWER supports multiple translation modes (hash
      and radix) and in the case of POWER only radix translation mode needs the
      above TLBI.  This is because for hash translation mode kernel wants to
      avoid this extra flush since there are no hardware walkers of linux page
      table.  With radix translation, the hardware also walks linux page table
      and with that, kernel needs to make sure to TLB invalidate page walk cache
      before page table pages are freed.
      
      More details in commit d86564a2 ("mm/tlb, x86/mm: Support invalidating
      TLB caches for RCU_TABLE_FREE")
      
      The changes to sparc are to make sure we keep the old behavior since we
      are now removing HAVE_RCU_TABLE_NO_INVALIDATE.  The default value for
      tlb_needs_table_invalidate is to always force an invalidate and sparc can
      avoid the table invalidate.  Hence we define tlb_needs_table_invalidate to
      false for sparc architecture.
      
      Link: http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.kumar@linux.ibm.com
      Fixes: a46cc7a9 ("powerpc/mm/radix: Improve TLB/PWC flushes")
      Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ed13259
    • S
      x86: mm: avoid allocating struct mm_struct on the stack · e47690d7
      Steven Price 提交于
      struct mm_struct is quite large (~1664 bytes) and so allocating on the
      stack may cause problems as the kernel stack size is small.
      
      Since ptdump_walk_pgd_level_core() was only allocating the structure so
      that it could modify the pgd argument we can instead introduce a pgd
      override in struct mm_walk and pass this down the call stack to where it
      is needed.
      
      Since the correct mm_struct is now being passed down, it is now also
      unnecessary to take the mmap_sem semaphore because ptdump_walk_pgd() will
      now take the semaphore on the real mm.
      
      [steven.price@arm.com: restore missed arm64 changes]
        Link: http://lkml.kernel.org/r/20200108145710.34314-1-steven.price@arm.com
      Link: http://lkml.kernel.org/r/20200108145710.34314-1-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e47690d7
    • S
      mm: ptdump: reduce level numbers by 1 in note_page() · f8f0d0b6
      Steven Price 提交于
      Rather than having to increment the 'depth' number by 1 in ptdump_hole(),
      let's change the meaning of 'level' in note_page() since that makes the
      code simplier.
      
      Note that for x86, the level numbers were previously increased by 1 in
      commit 45dcd209 ("x86/mm/dump_pagetables: Fix printout of p4d level")
      and the comment "Bit 7 has a different meaning" was not updated, so this
      change also makes the code match the comment again.
      
      Link: http://lkml.kernel.org/r/20191218162402.45610-24-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8f0d0b6
    • S
      mm: add generic ptdump · 30d621f6
      Steven Price 提交于
      Add a generic version of page table dumping that architectures can opt-in
      to.
      
      Link: http://lkml.kernel.org/r/20191218162402.45610-20-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30d621f6
    • S
      mm: pagewalk: add 'depth' parameter to pte_hole · b7a16c7a
      Steven Price 提交于
      The pte_hole() callback is called at multiple levels of the page tables.
      Code dumping the kernel page tables needs to know what at what depth the
      missing entry is.  Add this is an extra parameter to pte_hole().  When the
      depth isn't know (e.g.  processing a vma) then -1 is passed.
      
      The depth that is reported is the actual level where the entry is missing
      (ignoring any folding that is in place), i.e.  any levels where
      PTRS_PER_P?D is set to 1 are ignored.
      
      Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
      natural numbers as levels 2/3/4.
      
      Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Tested-by: NZong Li <zong.li@sifive.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7a16c7a
    • S
      mm: pagewalk: fix termination condition in walk_pte_range() · c02a9875
      Steven Price 提交于
      If walk_pte_range() is called with a 'end' argument that is beyond the
      last page of memory (e.g.  ~0UL) then the comparison between 'addr' and
      'end' will always fail and the loop will be infinite.  Instead change the
      comparison to >= while accounting for overflow.
      
      Link: http://lkml.kernel.org/r/20191218162402.45610-15-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c02a9875
    • S
      mm: pagewalk: don't lock PTEs for walk_page_range_novma() · fbf56346
      Steven Price 提交于
      walk_page_range_novma() can be used to walk page tables or the kernel or
      for firmware.  These page tables may contain entries that are not backed
      by a struct page and so it isn't (in general) possible to take the PTE
      lock for the pte_entry() callback.  So update walk_pte_range() to only
      take the lock when no_vma==false by splitting out the inner loop to a
      separate function and add a comment explaining the difference to
      walk_page_range_novma().
      
      Link: http://lkml.kernel.org/r/20191218162402.45610-14-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fbf56346
    • S
      mm: pagewalk: allow walking without vma · 488ae6a2
      Steven Price 提交于
      Since 48684a65: "mm: pagewalk: fix misbehavior of walk_page_range for
      vma(VM_PFNMAP)", page_table_walk() will report any kernel area as a hole,
      because it lacks a vma.
      
      This means each arch has re-implemented page table walking when needed,
      for example in the per-arch ptdump walker.
      
      Remove the requirement to have a vma in the generic code and add a new
      function walk_page_range_novma() which ignores the VMAs and simply walks
      the page tables.
      
      Link: http://lkml.kernel.org/r/20191218162402.45610-13-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      488ae6a2
    • S
      mm: pagewalk: add p4d_entry() and pgd_entry() · 3afc4236
      Steven Price 提交于
      pgd_entry() and pud_entry() were removed by commit 0b1fbfe5
      ("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were no
      users.  We're about to add users so reintroduce them, along with
      p4d_entry() as we now have 5 levels of tables.
      
      Note that commit a00cc7d9 ("mm, x86: add support for PUD-sized
      transparent hugepages") already re-added pud_entry() but with different
      semantics to the other callbacks.  This commit reverts the semantics back
      to match the other callbacks.
      
      To support hmm.c which now uses the new semantics of pud_entry() a new
      member ('action') of struct mm_walk is added which allows the callbacks to
      either descend (ACTION_SUBTREE, the default), skip (ACTION_CONTINUE) or
      repeat the callback (ACTION_AGAIN).  hmm.c is then updated to call
      pud_trans_huge_lock() itself and make use of the splitting/retry logic of
      the core code.
      
      After this change pud_entry() is called for all entries, not just
      transparent huge pages.
      
      [arnd@arndb.de: fix unused variable warning]
       Link: http://lkml.kernel.org/r/20200107204607.1533842-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/20191218162402.45610-12-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3afc4236
    • F
      mm: remove __krealloc · 1c948715
      Florian Westphal 提交于
      Since 5.5-rc1 the last user of this function is gone, so remove the
      functionality.
      
      See commit
      2ad9d774 ("netfilter: conntrack: free extension area immediately")
      for details.
      
      Link: http://lkml.kernel.org/r/20191212223442.22141-1-fw@strlen.deSigned-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c948715