1. 10 12月, 2021 1 次提交
  2. 03 12月, 2021 1 次提交
  3. 29 11月, 2021 2 次提交
  4. 11 11月, 2021 1 次提交
  5. 30 10月, 2021 1 次提交
  6. 12 10月, 2021 1 次提交
    • H
      mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() · 1bfa3cc3
      Hugh Dickins 提交于
      stable inclusion
      from stable-5.10.47
      commit 0010275ca243e6260893207d41843bb8dc3846e4
      bugzilla: 172973 https://gitee.com/openeuler/kernel/issues/I4DAKB
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0010275ca243e6260893207d41843bb8dc3846e4
      
      --------------------------------
      
      [ Upstream commit 22061a1f ]
      
      There is a race between THP unmapping and truncation, when truncate sees
      pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
      it, but before its page_remove_rmap() gets to decrement
      compound_mapcount: generating false "BUG: Bad page cache" reports that
      the page is still mapped when deleted.  This commit fixes that, but not
      in the way I hoped.
      
      The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
      instead of unmap_mapping_range() in truncate_cleanup_page(): it has
      often been an annoyance that we usually call unmap_mapping_range() with
      no pages locked, but there apply it to a single locked page.
      try_to_unmap() looks more suitable for a single locked page.
      
      However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
      it is used to insert THP migration entries, but not used to unmap THPs.
      Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
      needs are different, I'm too ignorant of the DAX cases, and couldn't
      decide how far to go for anon+swap.  Set that aside.
      
      The second attempt took a different tack: make no change in truncate.c,
      but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
      clearing it initially, then pmd_clear() between page_remove_rmap() and
      unlocking at the end.  Nice.  But powerpc blows that approach out of the
      water, with its serialize_against_pte_lookup(), and interesting pgtable
      usage.  It would need serious help to get working on powerpc (with a
      minor optimization issue on s390 too).  Set that aside.
      
      Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
      delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
      that's likely to reduce or eliminate the number of incidents, it would
      give less assurance of whether we had identified the problem correctly.
      
      This successful iteration introduces "unmap_mapping_page(page)" instead
      of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
      with an addition to details.  Then zap_pmd_range() watches for this
      case, and does spin_unlock(pmd_lock) if so - just like
      page_vma_mapped_walk() now does in the PVMW_SYNC case.  Not pretty, but
      safe.
      
      Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
      assert its interface; but currently that's only used to make sure that
      page->mapping is stable, and zap_pmd_range() doesn't care if the page is
      locked or not.  Along these lines, in invalidate_inode_pages2_range()
      move the initial unmap_mapping_range() out from under page lock, before
      then calling unmap_mapping_page() under page lock if still mapped.
      
      Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
      Fixes: fc127da0 ("truncate: handle file thp")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      Note on stable backport: fixed up call to truncate_cleanup_page()
      in truncate_inode_pages_range().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      
      Conflict:
      	mm/truncate.c
      [Backport from mainline 22061a1f]
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      1bfa3cc3
  7. 19 7月, 2021 1 次提交
  8. 16 7月, 2021 1 次提交
  9. 14 7月, 2021 4 次提交
  10. 03 6月, 2021 1 次提交
  11. 19 4月, 2021 1 次提交
    • A
      kasan: fix per-page tags for non-page_alloc pages · f2756b96
      Andrey Konovalov 提交于
      stable inclusion
      from stable-5.10.27
      commit 6e63cc1fe2532d1aa851a540677e29ba802bf071
      bugzilla: 51493
      
      --------------------------------
      
      commit cf10bd4c upstream.
      
      To allow performing tag checks on page_alloc addresses obtained via
      page_address(), tag-based KASAN modes store tags for page_alloc
      allocations in page->flags.
      
      Currently, the default tag value stored in page->flags is 0x00.
      Therefore, page_address() returns a 0x00ffff...  address for pages that
      were not allocated via page_alloc.
      
      This might cause problems.  A particular case we encountered is a
      conflict with KFENCE.  If a KFENCE-allocated slab object is being freed
      via kfree(page_address(page) + offset), the address passed to kfree()
      will get tagged with 0x00 (as slab pages keep the default per-page
      tags).  This leads to is_kfence_address() check failing, and a KFENCE
      object ending up in normal slab freelist, which causes memory
      corruptions.
      
      This patch changes the way KASAN stores tag in page-flags: they are now
      stored xor'ed with 0xff.  This way, KASAN doesn't need to initialize
      per-page flags for every created page, which might be slow.
      
      With this change, page_address() returns natively-tagged (with 0xff)
      pointers for pages that didn't have tags set explicitly.
      
      This patch fixes the encountered conflict with KFENCE and prevents more
      similar issues that can occur in the future.
      
      Link: https://lkml.kernel.org/r/1a41abb11c51b264511d9e71c303bb16d5cb367b.1615475452.git.andreyknvl@google.com
      Fixes: 2813b9c0 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
      Signed-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NMarco Elver <elver@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Branislav Rankov <Branislav.Rankov@arm.com>
      Cc: Kevin Brodsky <kevin.brodsky@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      f2756b96
  12. 09 4月, 2021 2 次提交
  13. 18 1月, 2021 1 次提交
    • B
      mm: memmap defer init doesn't work as expected · ad6a5557
      Baoquan He 提交于
      stable inclusion
      from stable-5.10.5
      commit 98b57685c26d8f41040ecf71e190250fb2eb2a0c
      bugzilla: 46931
      
      --------------------------------
      
      commit dc2da7b4 upstream.
      
      VMware observed a performance regression during memmap init on their
      platform, and bisected to commit 73a6e474 ("mm: memmap_init:
      iterate over memblock regions rather that check each PFN") causing it.
      
      Before the commit:
      
        [0.033176] Normal zone: 1445888 pages used for memmap
        [0.033176] Normal zone: 89391104 pages, LIFO batch:63
        [0.035851] ACPI: PM-Timer IO Port: 0x448
      
      With commit
      
        [0.026874] Normal zone: 1445888 pages used for memmap
        [0.026875] Normal zone: 89391104 pages, LIFO batch:63
        [2.028450] ACPI: PM-Timer IO Port: 0x448
      
      The root cause is the current memmap defer init doesn't work as expected.
      
      Before, memmap_init_zone() was used to do memmap init of one whole zone,
      to initialize all low zones of one numa node, but defer memmap init of
      the last zone in that numa node.  However, since commit 73a6e474,
      function memmap_init() is adapted to iterater over memblock regions
      inside one zone, then call memmap_init_zone() to do memmap init for each
      region.
      
      E.g, on VMware's system, the memory layout is as below, there are two
      memory regions in node 2.  The current code will mistakenly initialize the
      whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to
      iniatialize only one memmory section on the 2nd region [mem
      0x10000000000-0x1033fffffff].  In fact, we only expect to see that there's
      only one memory section's memmap initialized.  That's why more time is
      costed at the time.
      
      [    0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
      [    0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
      [    0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff]
      [    0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff]
      [    0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff]
      [    0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff]
      
      Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
      down the real zone end pfn so that defer_init() can use it to judge
      whether defer need be taken in zone wide.
      
      Link: https://lkml.kernel.org/r/20201223080811.16211-1-bhe@redhat.com
      Link: https://lkml.kernel.org/r/20201223080811.16211-2-bhe@redhat.com
      Fixes: commit 73a6e474 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Reported-by: NRahul Gopakumar <gopakumarr@vmware.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      ad6a5557
  14. 03 11月, 2020 1 次提交
    • J
      mm: always have io_remap_pfn_range() set pgprot_decrypted() · f8f6ae5d
      Jason Gunthorpe 提交于
      The purpose of io_remap_pfn_range() is to map IO memory, such as a
      memory mapped IO exposed through a PCI BAR.  IO devices do not
      understand encryption, so this memory must always be decrypted.
      Automatically call pgprot_decrypted() as part of the generic
      implementation.
      
      This fixes a bug where enabling AMD SME causes subsystems, such as RDMA,
      using io_remap_pfn_range() to expose BAR pages to user space to fail.
      The CPU will encrypt access to those BAR pages instead of passing
      unencrypted IO directly to the device.
      
      Places not mapping IO should use remap_pfn_range().
      
      Fixes: aca20d54 ("x86/mm: Add support to make use of Secure Memory Encryption")
      Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "Dave Young" <dyoung@redhat.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Toshimitsu Kani <toshi.kani@hpe.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/0-v1-025d64bdf6c4+e-amd_sme_fix_jgg@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8f6ae5d
  15. 19 10月, 2020 1 次提交
    • M
      mm/madvise: pass mm to do_madvise · 0726b01e
      Minchan Kim 提交于
      Patch series "introduce memory hinting API for external process", v9.
      
      Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API.  With
      that, application could give hints to kernel what memory range are
      preferred to be reclaimed.  However, in some platform(e.g., Android), the
      information required to make the hinting decision is not known to the app.
      Instead, it is known to a centralized userspace daemon(e.g.,
      ActivityManagerService), and that daemon must be able to initiate reclaim
      on its own without any app involvement.
      
      To solve the concern, this patch introduces new syscall -
      process_madvise(2).  Bascially, it's same with madvise(2) syscall but it
      has some differences.
      
      1. It needs pidfd of target process to provide the hint
      
      2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
         moment.  Other hints in madvise will be opened when there are explicit
         requests from community to prevent unexpected bugs we couldn't support.
      
      3. Only privileged processes can do something for other process's
         address space.
      
      For more detail of the new API, please see "mm: introduce external memory
      hinting API" description in this patchset.
      
      This patch (of 3):
      
      In upcoming patches, do_madvise will be called from external process
      context so we shouldn't asssume "current" is always hinted process's
      task_struct.
      
      Furthermore, we must not access mm_struct via task->mm, but obtain it via
      access_mm() once (in the following patch) and only use that pointer [1],
      so pass it to do_madvise() as well.  Note the vma->vm_mm pointers are
      safe, so we can use them further down the call stack.
      
      And let's pass current->mm as arguments of do_madvise so it shouldn't
      change existing behavior but prepare next patch to make review easy.
      
      [vbabka@suse.cz: changelog tweak]
      [minchan@kernel.org: use current->mm for io_uring]
        Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
      [akpm@linux-foundation.org: fix it for upstream changes]
      [akpm@linux-foundation.org: whoops]
      [rdunlap@infradead.org: add missing includes]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jann Horn <jannh@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: <linux-man@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
      Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0726b01e
  16. 17 10月, 2020 4 次提交
  17. 14 10月, 2020 3 次提交
  18. 28 9月, 2020 1 次提交
  19. 27 9月, 2020 1 次提交
    • L
      mm: replace memmap_context by meminit_context · c1d0da83
      Laurent Dufour 提交于
      Patch series "mm: fix memory to node bad links in sysfs", v3.
      
      Sometimes, firmware may expose interleaved memory layout like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      In that case, we can see memory blocks assigned to multiple nodes in
      sysfs:
      
        $ ls -l /sys/devices/system/memory/memory21
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
        drwxr-xr-x 2 root root     0 Aug 24 05:27 power
        -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
        lrwxrwxrwx 1 root root     0 Aug 24 05:25 subsystem -> ../../../../bus/memory
        -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
        -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones
      
      The same applies in the node's directory with a memory21 link in both
      the node1 and node2's directory.
      
      This is wrong but doesn't prevent the system to run.  However when
      later, one of these memory blocks is hot-unplugged and then hot-plugged,
      the system is detecting an inconsistency in the sysfs layout and a
      BUG_ON() is raised:
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This has been seen on PowerPC LPAR.
      
      The root cause of this issue is that when node's memory is registered,
      the range used can overlap another node's range, thus the memory block
      is registered to multiple nodes in sysfs.
      
      There are two issues here:
      
       (a) The sysfs memory and node's layouts are broken due to these
           multiple links
      
       (b) The link errors in link_mem_sections() should not lead to a system
           panic.
      
      To address (a) register_mem_sect_under_node should not rely on the
      system state to detect whether the link operation is triggered by a hot
      plug operation or not.  This is addressed by the patches 1 and 2 of this
      series.
      
      Issue (b) will be addressed separately.
      
      This patch (of 2):
      
      The memmap_context enum is used to detect whether a memory operation is
      due to a hot-add operation or happening at boot time.
      
      Make it general to the hotplug operation and rename it as
      meminit_context.
      
      There is no functional change introduced by this patch
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J . Wysocki" <rafael@kernel.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
      Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1d0da83
  20. 18 9月, 2020 1 次提交
    • L
      mm: allow a controlled amount of unfairness in the page lock · 5ef64cc8
      Linus Torvalds 提交于
      Commit 2a9127fc ("mm: rewrite wait_on_page_bit_common() logic") made
      the page locking entirely fair, in that if a waiter came in while the
      lock was held, the lock would be transferred to the lockers strictly in
      order.
      
      That was intended to finally get rid of the long-reported watchdog
      failures that involved the page lock under extreme load, where a process
      could end up waiting essentially forever, as other page lockers stole
      the lock from under it.
      
      It also improved some benchmarks, but it ended up causing huge
      performance regressions on others, simply because fair lock behavior
      doesn't end up giving out the lock as aggressively, causing better
      worst-case latency, but potentially much worse average latencies and
      throughput.
      
      Instead of reverting that change entirely, this introduces a controlled
      amount of unfairness, with a sysctl knob to tune it if somebody needs
      to.  But the default value should hopefully be good for any normal load,
      allowing a few rounds of lock stealing, but enforcing the strict
      ordering before the lock has been stolen too many times.
      
      There is also a hint from Matthieu Baerts that the fair page coloring
      may end up exposing an ABBA deadlock that is hidden by the usual
      optimistic lock stealing, and while the unfairness doesn't fix the
      fundamental issue (and I'm still looking at that), it avoids it in
      practice.
      
      The amount of unfairness can be modified by writing a new value to the
      'sysctl_page_lock_unfairness' variable (default value of 5, exposed
      through /proc/sys/vm/page_lock_unfairness), but that is hopefully
      something we'd use mainly for debugging rather than being necessary for
      any deep system tuning.
      
      This whole issue has exposed just how critical the page lock can be, and
      how contended it gets under certain locks.  And the main contention
      doesn't really seem to be anything related to IO (which was the origin
      of this lock), but for things like just verifying that the page file
      mapping is stable while faulting in the page into a page table.
      
      Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
      Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
      Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/Reported-and-tested-by: NMichael Larabel <Michael@michaellarabel.com>
      Tested-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ef64cc8
  21. 04 9月, 2020 1 次提交
    • C
      arm64: mte: Add PROT_MTE support to mmap() and mprotect() · 9f341931
      Catalin Marinas 提交于
      To enable tagging on a memory range, the user must explicitly opt in via
      a new PROT_MTE flag passed to mmap() or mprotect(). Since this is a new
      memory type in the AttrIndx field of a pte, simplify the or'ing of these
      bits over the protection_map[] attributes by making MT_NORMAL index 0.
      
      There are two conditions for arch_vm_get_page_prot() to return the
      MT_NORMAL_TAGGED memory type: (1) the user requested it via PROT_MTE,
      registered as VM_MTE in the vm_flags, and (2) the vma supports MTE,
      decided during the mmap() call (only) and registered as VM_MTE_ALLOWED.
      
      arch_calc_vm_prot_bits() is responsible for registering the user request
      as VM_MTE. The newly introduced arch_calc_vm_flag_bits() sets
      VM_MTE_ALLOWED if the mapping is MAP_ANONYMOUS. An MTE-capable
      filesystem (RAM-based) may be able to set VM_MTE_ALLOWED during its
      mmap() file ops call.
      
      In addition, update VM_DATA_DEFAULT_FLAGS to allow mprotect(PROT_MTE) on
      stack or brk area.
      
      The Linux mmap() syscall currently ignores unknown PROT_* flags. In the
      presence of MTE, an mmap(PROT_MTE) on a file which does not support MTE
      will not report an error and the memory will not be mapped as Normal
      Tagged. For consistency, mprotect(PROT_MTE) will not report an error
      either if the memory range does not support MTE. Two subsequent patches
      in the series will propose tightening of this behaviour.
      Co-developed-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
      Signed-off-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      9f341931
  22. 24 8月, 2020 2 次提交
  23. 15 8月, 2020 4 次提交
  24. 13 8月, 2020 3 次提交
    • P
      mm/gup: remove task_struct pointer for all gup code · 64019a2e
      Peter Xu 提交于
      After the cleanup of page fault accounting, gup does not need to pass
      task_struct around any more.  Remove that parameter in the whole gup
      stack.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64019a2e
    • P
      mm: do page fault accounting in handle_mm_fault · bce617ed
      Peter Xu 提交于
      Patch series "mm: Page fault accounting cleanups", v5.
      
      This is v5 of the pf accounting cleanup series.  It originates from Gerald
      Schaefer's report on an issue a week ago regarding to incorrect page fault
      accountings for retried page fault after commit 4064b982 ("mm: allow
      VM_FAULT_RETRY for multiple times"):
      
        https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/
      
      What this series did:
      
        - Correct page fault accounting: we do accounting for a page fault
          (no matter whether it's from #PF handling, or gup, or anything else)
          only with the one that completed the fault.  For example, page fault
          retries should not be counted in page fault counters.  Same to the
          perf events.
      
        - Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf
          event is used in an adhoc way across different archs.
      
          Case (1): for many archs it's done at the entry of a page fault
          handler, so that it will also cover e.g.  errornous faults.
      
          Case (2): for some other archs, it is only accounted when the page
          fault is resolved successfully.
      
          Case (3): there're still quite some archs that have not enabled
          this perf event.
      
          Since this series will touch merely all the archs, we unify this
          perf event to always follow case (1), which is the one that makes most
          sense.  And since we moved the accounting into handle_mm_fault, the
          other two MAJ/MIN perf events are well taken care of naturally.
      
        - Unify definition of "major faults": the definition of "major
          fault" is slightly changed when used in accounting (not
          VM_FAULT_MAJOR).  More information in patch 1.
      
        - Always account the page fault onto the one that triggered the page
          fault.  This does not matter much for #PF handlings, but mostly for
          gup.  More information on this in patch 25.
      
      Patchset layout:
      
      Patch 1:     Introduced the accounting in handle_mm_fault(), not enabled.
      Patch 2-23:  Enable the new accounting for arch #PF handlers one by one.
      Patch 24:    Enable the new accounting for the rest outliers (gup, iommu, etc.)
      Patch 25:    Cleanup GUP task_struct pointer since it's not needed any more
      
      This patch (of 25):
      
      This is a preparation patch to move page fault accountings into the
      general code in handle_mm_fault().  This includes both the per task
      flt_maj/flt_min counters, and the major/minor page fault perf events.  To
      do this, the pt_regs pointer is passed into handle_mm_fault().
      
      PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault
      handlers.
      
      So far, all the pt_regs pointer that passed into handle_mm_fault() is
      NULL, which means this patch should have no intented functional change.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com
      Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bce617ed
    • R