1. 30 6月, 2021 1 次提交
    • M
      mm/page_alloc: fix memory map initialization for descending nodes · 122e093c
      Mike Rapoport 提交于
      On systems with memory nodes sorted in descending order, for instance Dell
      Precision WorkStation T5500, the struct pages for higher PFNs and
      respectively lower nodes, could be overwritten by the initialization of
      struct pages corresponding to the holes in the memory sections.
      
      For example for the below memory layout
      
      [    0.245624] Early memory node ranges
      [    0.248496]   node   1: [mem 0x0000000000001000-0x0000000000090fff]
      [    0.251376]   node   1: [mem 0x0000000000100000-0x00000000dbdf8fff]
      [    0.254256]   node   1: [mem 0x0000000100000000-0x0000001423ffffff]
      [    0.257144]   node   0: [mem 0x0000001424000000-0x0000002023ffffff]
      
      the range 0x1424000000 - 0x1428000000 in the beginning of node 0 starts in
      the middle of a section and will be considered as a hole during the
      initialization of the last section in node 1.
      
      The wrong initialization of the memory map causes panic on boot when
      CONFIG_DEBUG_VM is enabled.
      
      Reorder loop order of the memory map initialization so that the outer loop
      will always iterate over populated memory regions in the ascending order
      and the inner loop will select the zone corresponding to the PFN range.
      
      This way initialization of the struct pages for the memory holes will be
      always done for the ranges that are actually not populated.
      
      [akpm@linux-foundation.org: coding style fixes]
      
      Link: https://lkml.kernel.org/r/YNXlMqBbL+tBG7yq@kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=213073
      Link: https://lkml.kernel.org/r/20210624062305.10940-1-rppt@kernel.org
      Fixes: 0740a50b ("mm/page_alloc.c: refactor initialization of struct page for holes in memory layout")
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Boris Petkov <bp@alien8.de>
      Cc: Robert Shteynfeld <robert.shteynfeld@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      122e093c
  2. 28 6月, 2021 1 次提交
    • L
      Revert "signal: Allow tasks to cache one sigqueue struct" · b4b27b9e
      Linus Torvalds 提交于
      This reverts commits 4bad58eb (and
      399f8dd9, which tried to fix it).
      
      I do not believe these are correct, and I'm about to release 5.13, so am
      reverting them out of an abundance of caution.
      
      The locking is odd, and appears broken.
      
      On the allocation side (in __sigqueue_alloc()), the locking is somewhat
      straightforward: it depends on sighand->siglock.  Since one caller
      doesn't hold that lock, it further then tests 'sigqueue_flags' to avoid
      the case with no locks held.
      
      On the freeing side (in sigqueue_cache_or_free()), there is no locking
      at all, and the logic instead depends on 'current' being a single
      thread, and not able to race with itself.
      
      To make things more exciting, there's also the data race between freeing
      a signal and allocating one, which is handled by using WRITE_ONCE() and
      READ_ONCE(), and being mutually exclusive wrt the initial state (ie
      freeing will only free if the old state was NULL, while allocating will
      obviously only use the value if it was non-NULL, so only one or the
      other will actually act on the value).
      
      However, while the free->alloc paths do seem mutually exclusive thanks
      to just the data value dependency, it's not clear what the memory
      ordering constraints are on it.  Could writes from the previous
      allocation possibly be delayed and seen by the new allocation later,
      causing logical inconsistencies?
      
      So it's all very exciting and unusual.
      
      And in particular, it seems that the freeing side is incorrect in
      depending on "current" being single-threaded.  Yes, 'current' is a
      single thread, but in the presense of asynchronous events even a single
      thread can have data races.
      
      And such asynchronous events can and do happen, with interrupts causing
      signals to be flushed and thus free'd (for example - sending a
      SIGCONT/SIGSTOP can happen from interrupt context, and can flush
      previously queued process control signals).
      
      So regardless of all the other questions about the memory ordering and
      locking for this new cached allocation, the sigqueue_cache_or_free()
      assumptions seem to be fundamentally incorrect.
      
      It may be that people will show me the errors of my ways, and tell me
      why this is all safe after all.  We can reinstate it if so.  But my
      current belief is that the WRITE_ONCE() that sets the cached entry needs
      to be a smp_store_release(), and the READ_ONCE() that finds a cached
      entry needs to be a smp_load_acquire() to handle memory ordering
      correctly.
      
      And the sequence in sigqueue_cache_or_free() would need to either use a
      lock or at least be interrupt-safe some way (perhaps by using something
      like the percpu 'cmpxchg': it doesn't need to be SMP-safe, but like the
      percpu operations it needs to be interrupt-safe).
      
      Fixes: 399f8dd9 ("signal: Prevent sigqueue caching after task got released")
      Fixes: 4bad58eb ("signal: Allow tasks to cache one sigqueue struct")
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4b27b9e
  3. 25 6月, 2021 4 次提交
  4. 22 6月, 2021 1 次提交
  5. 17 6月, 2021 7 次提交
    • D
      net/mlx5e: Don't create devices during unload flow · a5ae8fc9
      Dmytro Linkin 提交于
      Running devlink reload command for port in switchdev mode cause
      resources to corrupt: driver can't release allocated EQ and reclaim
      memory pages, because "rdma" auxiliary device had add CQs which blocks
      EQ from deletion.
      Erroneous sequence happens during reload-down phase, and is following:
      
      1. detach device - suspends auxiliary devices which support it, destroys
         others. During this step "eth-rep" and "rdma-rep" are destroyed,
         "eth" - suspended.
      2. disable SRIOV - moves device to legacy mode; as part of disablement -
         rescans drivers. This step adds "rdma" auxiliary device.
      3. destroy EQ table - <failure>.
      
      Driver shouldn't create any device during unload flows. To handle that
      implement MLX5_PRIV_FLAGS_DETACH flag, set it on device detach and unset
      on device attach. If flag is set do no-op on drivers rescan.
      
      Fixes: a925b5e3 ("net/mlx5: Register mlx5 devices to auxiliary virtual bus")
      Signed-off-by: NDmytro Linkin <dlinkin@nvidia.com>
      Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: NRoi Dayan <roid@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      a5ae8fc9
    • H
      mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() · 22061a1f
      Hugh Dickins 提交于
      There is a race between THP unmapping and truncation, when truncate sees
      pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
      it, but before its page_remove_rmap() gets to decrement
      compound_mapcount: generating false "BUG: Bad page cache" reports that
      the page is still mapped when deleted.  This commit fixes that, but not
      in the way I hoped.
      
      The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
      instead of unmap_mapping_range() in truncate_cleanup_page(): it has
      often been an annoyance that we usually call unmap_mapping_range() with
      no pages locked, but there apply it to a single locked page.
      try_to_unmap() looks more suitable for a single locked page.
      
      However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
      it is used to insert THP migration entries, but not used to unmap THPs.
      Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
      needs are different, I'm too ignorant of the DAX cases, and couldn't
      decide how far to go for anon+swap.  Set that aside.
      
      The second attempt took a different tack: make no change in truncate.c,
      but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
      clearing it initially, then pmd_clear() between page_remove_rmap() and
      unlocking at the end.  Nice.  But powerpc blows that approach out of the
      water, with its serialize_against_pte_lookup(), and interesting pgtable
      usage.  It would need serious help to get working on powerpc (with a
      minor optimization issue on s390 too).  Set that aside.
      
      Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
      delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
      that's likely to reduce or eliminate the number of incidents, it would
      give less assurance of whether we had identified the problem correctly.
      
      This successful iteration introduces "unmap_mapping_page(page)" instead
      of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
      with an addition to details.  Then zap_pmd_range() watches for this
      case, and does spin_unlock(pmd_lock) if so - just like
      page_vma_mapped_walk() now does in the PVMW_SYNC case.  Not pretty, but
      safe.
      
      Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
      assert its interface; but currently that's only used to make sure that
      page->mapping is stable, and zap_pmd_range() doesn't care if the page is
      locked or not.  Along these lines, in invalidate_inode_pages2_range()
      move the initial unmap_mapping_range() out from under page lock, before
      then calling unmap_mapping_page() under page lock if still mapped.
      
      Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
      Fixes: fc127da0 ("truncate: handle file thp")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22061a1f
    • H
      mm/thp: try_to_unmap() use TTU_SYNC for safe splitting · 732ed558
      Hugh Dickins 提交于
      Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
      (!unmap_success): with dump_page() showing mapcount:1, but then its raw
      struct page output showing _mapcount ffffffff i.e.  mapcount 0.
      
      And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
      it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
      and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
      all indicative of some mapcount difficulty in development here perhaps.
      But the !CONFIG_DEBUG_VM path handles the failures correctly and
      silently.
      
      I believe the problem is that once a racing unmap has cleared pte or
      pmd, try_to_unmap_one() may skip taking the page table lock, and emerge
      from try_to_unmap() before the racing task has reached decrementing
      mapcount.
      
      Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
      follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding
      TTU_SYNC to the options, and passing that from unmap_page().
      
      When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same
      for both: the slight overhead added should rarely matter, except perhaps
      if splitting sparsely-populated multiply-mapped shmem.  Once confident
      that bugs are fixed, TTU_SYNC here can be removed, and the race
      tolerated.
      
      Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com
      Fixes: fec89c10 ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      732ed558
    • H
      mm/thp: make is_huge_zero_pmd() safe and quicker · 3b77e8c8
      Hugh Dickins 提交于
      Most callers of is_huge_zero_pmd() supply a pmd already verified
      present; but a few (notably zap_huge_pmd()) do not - it might be a pmd
      migration entry, in which the pfn is encoded differently from a present
      pmd: which might pass the is_huge_zero_pmd() test (though not on x86,
      since L1TF forced us to protect against that); or perhaps even crash in
      pmd_page() applied to a swap-like entry.
      
      Make it safe by adding pmd_present() check into is_huge_zero_pmd()
      itself; and make it quicker by saving huge_zero_pfn, so that
      is_huge_zero_pmd() will not need to do that pmd_page() lookup each time.
      
      __split_huge_pmd_locked() checked pmd_trans_huge() before: that worked,
      but is unnecessary now that is_huge_zero_pmd() checks present.
      
      Link: https://lkml.kernel.org/r/21ea9ca-a1f5-8b90-5e88-95fb1c49bbfa@google.com
      Fixes: e71769ae ("mm: enable thp migration for shmem thp")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b77e8c8
    • M
      mm/hugetlb: expand restore_reserve_on_error functionality · 846be085
      Mike Kravetz 提交于
      The routine restore_reserve_on_error is called to restore reservation
      information when an error occurs after page allocation.  The routine
      alloc_huge_page modifies the mapping reserve map and potentially the
      reserve count during allocation.  If code calling alloc_huge_page
      encounters an error after allocation and needs to free the page, the
      reservation information needs to be adjusted.
      
      Currently, restore_reserve_on_error only takes action on pages for which
      the reserve count was adjusted(HPageRestoreReserve flag).  There is
      nothing wrong with these adjustments.  However, alloc_huge_page ALWAYS
      modifies the reserve map during allocation even if the reserve count is
      not adjusted.  This can cause issues as observed during development of
      this patch [1].
      
      One specific series of operations causing an issue is:
      
       - Create a shared hugetlb mapping
         Reservations for all pages created by default
      
       - Fault in a page in the mapping
         Reservation exists so reservation count is decremented
      
       - Punch a hole in the file/mapping at index previously faulted
         Reservation and any associated pages will be removed
      
       - Allocate a page to fill the hole
         No reservation entry, so reserve count unmodified
         Reservation entry added to map by alloc_huge_page
      
       - Error after allocation and before instantiating the page
         Reservation entry remains in map
      
       - Allocate a page to fill the hole
         Reservation entry exists, so decrement reservation count
      
      This will cause a reservation count underflow as the reservation count
      was decremented twice for the same index.
      
      A user would observe a very large number for HugePages_Rsvd in
      /proc/meminfo.  This would also likely cause subsequent allocations of
      hugetlb pages to fail as it would 'appear' that all pages are reserved.
      
      This sequence of operations is unlikely to happen, however they were
      easily reproduced and observed using hacked up code as described in [1].
      
      Address the issue by having the routine restore_reserve_on_error take
      action on pages where HPageRestoreReserve is not set.  In this case, we
      need to remove any reserve map entry created by alloc_huge_page.  A new
      helper routine vma_del_reservation assists with this operation.
      
      There are three callers of alloc_huge_page which do not currently call
      restore_reserve_on error before freeing a page on error paths.  Add
      those missing calls.
      
      [1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@google.com/
      
      Link: https://lkml.kernel.org/r/20210607204510.22617-1-mike.kravetz@oracle.com
      Fixes: 96b96a96 ("mm/hugetlb: fix huge page reservation leak in private mapping error paths"
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMina Almasry <almasrymina@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      846be085
    • P
      mm/swap: fix pte_same_as_swp() not removing uffd-wp bit when compare · 099dd687
      Peter Xu 提交于
      I found it by pure code review, that pte_same_as_swp() of unuse_vma()
      didn't take uffd-wp bit into account when comparing ptes.
      pte_same_as_swp() returning false negative could cause failure to
      swapoff swap ptes that was wr-protected by userfaultfd.
      
      Link: https://lkml.kernel.org/r/20210603180546.9083-1-peterx@redhat.com
      Fixes: f45ec5ff ("userfaultfd: wp: support swap and page migration")
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      099dd687
    • N
      mm,hwpoison: fix race with hugetlb page allocation · 25182f05
      Naoya Horiguchi 提交于
      When hugetlb page fault (under overcommitting situation) and
      memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following
      race:
      
          CPU0:                           CPU1:
      
                                          gather_surplus_pages()
                                            page = alloc_surplus_huge_page()
          memory_failure_hugetlb()
            get_hwpoison_page(page)
              __get_hwpoison_page(page)
                get_page_unless_zero(page)
                                            zero = put_page_testzero(page)
                                            VM_BUG_ON_PAGE(!zero, page)
                                            enqueue_huge_page(h, page)
            put_page(page)
      
      __get_hwpoison_page() only checks the page refcount before taking an
      additional one for memory error handling, which is not enough because
      there's a time window where compound pages have non-zero refcount during
      hugetlb page initialization.
      
      So make __get_hwpoison_page() check page status a bit more for hugetlb
      pages with get_hwpoison_huge_page().  Checking hugetlb-specific flags
      under hugetlb_lock makes sure that the hugetlb page is not transitive.
      It's notable that another new function, HWPoisonHandlable(), is helpful
      to prevent a race against other transitive page states (like a generic
      compound page just before PageHuge becomes true).
      
      Link: https://lkml.kernel.org/r/20210603233632.2964832-2-nao.horiguchi@gmail.com
      Fixes: ead07f6a ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
      Signed-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reported-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>	[5.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25182f05
  6. 16 6月, 2021 1 次提交
    • J
      ptp: improve max_adj check against unreasonable values · 475b92f9
      Jakub Kicinski 提交于
      Scaled PPM conversion to PPB may (on 64bit systems) result
      in a value larger than s32 can hold (freq/scaled_ppm is a long).
      This means the kernel will not correctly reject unreasonably
      high ->freq values (e.g. > 4294967295ppb, 281474976645 scaled PPM).
      
      The conversion is equivalent to a division by ~66 (65.536),
      so the value of ppb is always smaller than ppm, but not small
      enough to assume narrowing the type from long -> s32 is okay.
      
      Note that reasonable user space (e.g. ptp4l) will not use such
      high values, anyway, 4289046510ppb ~= 4.3x, so the fix is
      somewhat pedantic.
      
      Fixes: d39a7435 ("ptp: validate the requested frequency adjustment.")
      Fixes: d94ba80e ("ptp: Added a brand new class driver for ptp clocks.")
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      475b92f9
  7. 14 6月, 2021 1 次提交
  8. 13 6月, 2021 2 次提交
    • F
      mm: relocate 'write_protect_seq' in struct mm_struct · 2e302543
      Feng Tang 提交于
      0day robot reported a 9.2% regression for will-it-scale mmap1 test
      case[1], caused by commit 57efa1fe ("mm/gup: prevent gup_fast from
      racing with COW during fork").
      
      Further debug shows the regression is due to that commit changes the
      offset of hot fields 'mmap_lock' inside structure 'mm_struct', thus some
      cache alignment changes.
      
      From the perf data, the contention for 'mmap_lock' is very severe and
      takes around 95% cpu cycles, and it is a rw_semaphore
      
              struct rw_semaphore {
                      atomic_long_t count;	/* 8 bytes */
                      atomic_long_t owner;	/* 8 bytes */
                      struct optimistic_spin_queue osq; /* spinner MCS lock */
                      ...
      
      Before commit 57efa1fe adds the 'write_protect_seq', it happens to
      have a very optimal cache alignment layout, as Linus explained:
      
       "and before the addition of the 'write_protect_seq' field, the
        mmap_sem was at offset 120 in 'struct mm_struct'.
      
        Which meant that count and owner were in two different cachelines,
        and then when you have contention and spend time in
        rwsem_down_write_slowpath(), this is probably *exactly* the kind
        of layout you want.
      
        Because first the rwsem_write_trylock() will do a cmpxchg on the
        first cacheline (for the optimistic fast-path), and then in the
        case of contention, rwsem_down_write_slowpath() will just access
        the second cacheline.
      
        Which is probably just optimal for a load that spends a lot of
        time contended - new waiters touch that first cacheline, and then
        they queue themselves up on the second cacheline."
      
      After the commit, the rw_semaphore is at offset 128, which means the
      'count' and 'owner' fields are now in the same cacheline, and causes
      more cache bouncing.
      
      Currently there are 3 "#ifdef CONFIG_XXX" before 'mmap_lock' which will
      affect its offset:
      
        CONFIG_MMU
        CONFIG_MEMBARRIER
        CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
      
      The layout above is on 64 bits system with 0day's default kernel config
      (similar to RHEL-8.3's config), in which all these 3 options are 'y'.
      And the layout can vary with different kernel configs.
      
      Relayouting a structure is usually a double-edged sword, as sometimes it
      can helps one case, but hurt other cases.  For this case, one solution
      is, as the newly added 'write_protect_seq' is a 4 bytes long seqcount_t
      (when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an existing 4 bytes
      hole in 'mm_struct' will not change other fields' alignment, while
      restoring the regression.
      
      Link: https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/ [1]
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Signed-off-by: NFeng Tang <feng.tang@intel.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e302543
    • C
      net: make get_net_ns return error if NET_NS is disabled · ea6932d7
      Changbin Du 提交于
      There is a panic in socket ioctl cmd SIOCGSKNS when NET_NS is not enabled.
      The reason is that nsfs tries to access ns->ops but the proc_ns_operations
      is not implemented in this case.
      
      [7.670023] Unable to handle kernel NULL pointer dereference at virtual address 00000010
      [7.670268] pgd = 32b54000
      [7.670544] [00000010] *pgd=00000000
      [7.671861] Internal error: Oops: 5 [#1] SMP ARM
      [7.672315] Modules linked in:
      [7.672918] CPU: 0 PID: 1 Comm: systemd Not tainted 5.13.0-rc3-00375-g6799d4f2 #16
      [7.673309] Hardware name: Generic DT based system
      [7.673642] PC is at nsfs_evict+0x24/0x30
      [7.674486] LR is at clear_inode+0x20/0x9c
      
      The same to tun SIOCGSKNS command.
      
      To fix this problem, we make get_net_ns() return -EINVAL when NET_NS is
      disabled. Meanwhile move it to right place net/core/net_namespace.c.
      Signed-off-by: NChangbin Du <changbin.du@gmail.com>
      Fixes: c62cce2c ("net: add an ioctl to get a socket network namespace")
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: NJakub Kicinski <kuba@kernel.org>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea6932d7
  9. 10 6月, 2021 2 次提交
    • D
      net/mlx5e: Fix page reclaim for dead peer hairpin · a3e5fd93
      Dima Chumak 提交于
      When adding a hairpin flow, a firmware-side send queue is created for
      the peer net device, which claims some host memory pages for its
      internal ring buffer. If the peer net device is removed/unbound before
      the hairpin flow is deleted, then the send queue is not destroyed which
      leads to a stack trace on pci device remove:
      
      [ 748.005230] mlx5_core 0000:08:00.2: wait_func:1094:(pid 12985): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource
      [ 748.005231] mlx5_core 0000:08:00.2: reclaim_pages:514:(pid 12985): failed reclaiming pages: err -110
      [ 748.001835] mlx5_core 0000:08:00.2: mlx5_reclaim_root_pages:653:(pid 12985): failed reclaiming pages (-110) for func id 0x0
      [ 748.002171] ------------[ cut here ]------------
      [ 748.001177] FW pages counter is 4 after reclaiming all pages
      [ 748.001186] WARNING: CPU: 1 PID: 12985 at drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c:685 mlx5_reclaim_startup_pages+0x34b/0x460 [mlx5_core]                      [  +0.002771] Modules linked in: cls_flower mlx5_ib mlx5_core ptp pps_core act_mirred sch_ingress openvswitch nsh xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_umad ib_ipoib iw_cm ib_cm ib_uverbs ib_core overlay fuse [last unloaded: pps_core]
      [ 748.007225] CPU: 1 PID: 12985 Comm: tee Not tainted 5.12.0+ #1
      [ 748.001376] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      [ 748.002315] RIP: 0010:mlx5_reclaim_startup_pages+0x34b/0x460 [mlx5_core]
      [ 748.001679] Code: 28 00 00 00 0f 85 22 01 00 00 48 81 c4 b0 00 00 00 31 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c7 40 cc 19 a1 e8 9f 71 0e e2 <0f> 0b e9 30 ff ff ff 48 c7 c7 a0 cc 19 a1 e8 8c 71 0e e2 0f 0b e9
      [ 748.003781] RSP: 0018:ffff88815220faf8 EFLAGS: 00010286
      [ 748.001149] RAX: 0000000000000000 RBX: ffff8881b4900280 RCX: 0000000000000000
      [ 748.001445] RDX: 0000000000000027 RSI: 0000000000000004 RDI: ffffed102a441f51
      [ 748.001614] RBP: 00000000000032b9 R08: 0000000000000001 R09: ffffed1054a15ee8
      [ 748.001446] R10: ffff8882a50af73b R11: ffffed1054a15ee7 R12: fffffbfff07c1e30
      [ 748.001447] R13: dffffc0000000000 R14: ffff8881b492cba8 R15: 0000000000000000
      [ 748.001429] FS:  00007f58bd08b580(0000) GS:ffff8882a5080000(0000) knlGS:0000000000000000
      [ 748.001695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 748.001309] CR2: 000055a026351740 CR3: 00000001d3b48006 CR4: 0000000000370ea0
      [ 748.001506] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 748.001483] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 748.001654] Call Trace:
      [ 748.000576]  ? mlx5_satisfy_startup_pages+0x290/0x290 [mlx5_core]
      [ 748.001416]  ? mlx5_cmd_teardown_hca+0xa2/0xd0 [mlx5_core]
      [ 748.001354]  ? mlx5_cmd_init_hca+0x280/0x280 [mlx5_core]
      [ 748.001203]  mlx5_function_teardown+0x30/0x60 [mlx5_core]
      [ 748.001275]  mlx5_uninit_one+0xa7/0xc0 [mlx5_core]
      [ 748.001200]  remove_one+0x5f/0xc0 [mlx5_core]
      [ 748.001075]  pci_device_remove+0x9f/0x1d0
      [ 748.000833]  device_release_driver_internal+0x1e0/0x490
      [ 748.001207]  unbind_store+0x19f/0x200
      [ 748.000942]  ? sysfs_file_ops+0x170/0x170
      [ 748.001000]  kernfs_fop_write_iter+0x2bc/0x450
      [ 748.000970]  new_sync_write+0x373/0x610
      [ 748.001124]  ? new_sync_read+0x600/0x600
      [ 748.001057]  ? lock_acquire+0x4d6/0x700
      [ 748.000908]  ? lockdep_hardirqs_on_prepare+0x400/0x400
      [ 748.001126]  ? fd_install+0x1c9/0x4d0
      [ 748.000951]  vfs_write+0x4d0/0x800
      [ 748.000804]  ksys_write+0xf9/0x1d0
      [ 748.000868]  ? __x64_sys_read+0xb0/0xb0
      [ 748.000811]  ? filp_open+0x50/0x50
      [ 748.000919]  ? syscall_enter_from_user_mode+0x1d/0x50
      [ 748.001223]  do_syscall_64+0x3f/0x80
      [ 748.000892]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 748.001026] RIP: 0033:0x7f58bcfb22f7
      [ 748.000944] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
      [ 748.003925] RSP: 002b:00007fffd7f2aaa8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [ 748.001732] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007f58bcfb22f7
      [ 748.001426] RDX: 000000000000000d RSI: 00007fffd7f2abc0 RDI: 0000000000000003
      [ 748.001746] RBP: 00007fffd7f2abc0 R08: 0000000000000000 R09: 0000000000000001
      [ 748.001631] R10: 00000000000001b6 R11: 0000000000000246 R12: 000000000000000d
      [ 748.001537] R13: 00005597ac2c24a0 R14: 000000000000000d R15: 00007f58bd084700
      [ 748.001564] irq event stamp: 0
      [ 748.000787] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [ 748.001399] hardirqs last disabled at (0): [<ffffffff813132cf>] copy_process+0x146f/0x5eb0
      [ 748.001854] softirqs last  enabled at (0): [<ffffffff8131330e>] copy_process+0x14ae/0x5eb0
      [ 748.013431] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [ 748.001492] ---[ end trace a6fabd773d1c51ae ]---
      
      Fix by destroying the send queue of a hairpin peer net device that is
      being removed/unbound, which returns the allocated ring buffer pages to
      the host.
      
      Fixes: 4d8fcf21 ("net/mlx5e: Avoid unbounded peer devices when unpairing TC hairpin rules")
      Signed-off-by: NDima Chumak <dchumak@nvidia.com>
      Reviewed-by: NRoi Dayan <roid@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      a3e5fd93
    • R
      misc: rtsx: separate aspm mode into MODE_REG and MODE_CFG · 3df4fce7
      Ricky Wu 提交于
      aspm (Active State Power Management)
      rtsx_comm_set_aspm: this function is for driver to make sure
      not enter power saving when processing of init and card_detcct
      ASPM_MODE_CFG: 8411 5209 5227 5229 5249 5250
      Change back to use original way to control aspm
      ASPM_MODE_REG: 5227A 524A 5250A 5260 5261 5228
      Keep the new way to control aspm
      
      Fixes: 121e9c6b ("misc: rtsx: modify and fix init_hw function")
      Reported-by: NChris Chiu <chris.chiu@canonical.com>
      Tested-by: NGordon Lack <gordon.lack@dsl.pipex.com>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: NRicky Wu <ricky_wu@realtek.com>
      Link: https://lore.kernel.org/r/20210607101634.4948-1-ricky_wu@realtek.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3df4fce7
  10. 09 6月, 2021 2 次提交
    • P
      kvm: fix previous commit for 32-bit builds · 4422829e
      Paolo Bonzini 提交于
      array_index_nospec does not work for uint64_t on 32-bit builds.
      However, the size of a memory slot must be less than 20 bits wide
      on those system, since the memory slot must fit in the user
      address space.  So just store it in an unsigned long.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4422829e
    • P
      kvm: avoid speculation-based attacks from out-of-range memslot accesses · da27a83f
      Paolo Bonzini 提交于
      KVM's mechanism for accessing guest memory translates a guest physical
      address (gpa) to a host virtual address using the right-shifted gpa
      (also known as gfn) and a struct kvm_memory_slot.  The translation is
      performed in __gfn_to_hva_memslot using the following formula:
      
            hva = slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE
      
      It is expected that gfn falls within the boundaries of the guest's
      physical memory.  However, a guest can access invalid physical addresses
      in such a way that the gfn is invalid.
      
      __gfn_to_hva_memslot is called from kvm_vcpu_gfn_to_hva_prot, which first
      retrieves a memslot through __gfn_to_memslot.  While __gfn_to_memslot
      does check that the gfn falls within the boundaries of the guest's
      physical memory or not, a CPU can speculate the result of the check and
      continue execution speculatively using an illegal gfn. The speculation
      can result in calculating an out-of-bounds hva.  If the resulting host
      virtual address is used to load another guest physical address, this
      is effectively a Spectre gadget consisting of two consecutive reads,
      the second of which is data dependent on the first.
      
      Right now it's not clear if there are any cases in which this is
      exploitable.  One interesting case was reported by the original author
      of this patch, and involves visiting guest page tables on x86.  Right
      now these are not vulnerable because the hva read goes through get_user(),
      which contains an LFENCE speculation barrier.  However, there are
      patches in progress for x86 uaccess.h to mask kernel addresses instead of
      using LFENCE; once these land, a guest could use speculation to read
      from the VMM's ring 3 address space.  Other architectures such as ARM
      already use the address masking method, and would be susceptible to
      this same kind of data-dependent access gadgets.  Therefore, this patch
      proactively protects from these attacks by masking out-of-bounds gfns
      in __gfn_to_hva_memslot, which blocks speculation of invalid hvas.
      
      Sean Christopherson noted that this patch does not cover
      kvm_read_guest_offset_cached.  This however is limited to a few bytes
      past the end of the cache, and therefore it is unlikely to be useful in
      the context of building a chain of data dependent accesses.
      Reported-by: NArtemiy Margaritov <artemiy.margaritov@gmail.com>
      Co-developed-by: NArtemiy Margaritov <artemiy.margaritov@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      da27a83f
  11. 05 6月, 2021 1 次提交
  12. 04 6月, 2021 3 次提交
  13. 03 6月, 2021 1 次提交
    • D
      sched/fair: Fix util_est UTIL_AVG_UNCHANGED handling · 68d7a190
      Dietmar Eggemann 提交于
      The util_est internal UTIL_AVG_UNCHANGED flag which is used to prevent
      unnecessary util_est updates uses the LSB of util_est.enqueued. It is
      exposed via _task_util_est() (and task_util_est()).
      
      Commit 92a801e5 ("sched/fair: Mask UTIL_AVG_UNCHANGED usages")
      mentions that the LSB is lost for util_est resolution but
      find_energy_efficient_cpu() checks if task_util_est() returns 0 to
      return prev_cpu early.
      
      _task_util_est() returns the max value of util_est.ewma and
      util_est.enqueued or'ed w/ UTIL_AVG_UNCHANGED.
      So task_util_est() returning the max of task_util() and
      _task_util_est() will never return 0 under the default
      SCHED_FEAT(UTIL_EST, true).
      
      To fix this use the MSB of util_est.enqueued instead and keep the flag
      util_est internal, i.e. don't export it via _task_util_est().
      
      The maximal possible util_avg value for a task is 1024 so the MSB of
      'unsigned int util_est.enqueued' isn't used to store a util value.
      
      As a caveat the code behind the util_est_se trace point has to filter
      UTIL_AVG_UNCHANGED to see the real util_est.enqueued value which should
      be easy to do.
      
      This also fixes an issue report by Xuewen Yan that util_est_update()
      only used UTIL_AVG_UNCHANGED for the subtrahend of the equation:
      
        last_enqueued_diff = ue.enqueued - (task_util() | UTIL_AVG_UNCHANGED)
      
      Fixes: b89997aa sched/pelt: Fix task util_est update filtering
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NXuewen Yan <xuewen.yan@unisoc.com>
      Reviewed-by: NVincent Donnefort <vincent.donnefort@arm.com>
      Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lore.kernel.org/r/20210602145808.1562603-1-dietmar.eggemann@arm.com
      68d7a190
  14. 02 6月, 2021 2 次提交
  15. 01 6月, 2021 1 次提交
  16. 31 5月, 2021 1 次提交
  17. 27 5月, 2021 3 次提交
  18. 26 5月, 2021 2 次提交
  19. 25 5月, 2021 3 次提交
  20. 24 5月, 2021 1 次提交