1. 09 4月, 2021 13 次提交
  2. 09 3月, 2021 9 次提交
  3. 08 2月, 2021 10 次提交
  4. 28 1月, 2021 5 次提交
  5. 27 1月, 2021 1 次提交
    • L
      mm: make wait_on_page_writeback() wait for multiple pending writebacks · 7b73bf0a
      Linus Torvalds 提交于
      stable inclusion
      from stable-5.10.7
      commit 876195e1c8c6dcd580b648f0a691c93b86ec2042
      bugzilla: 47429
      
      --------------------------------
      
      commit c2407cf7 upstream.
      
      Ever since commit 2a9127fc ("mm: rewrite wait_on_page_bit_common()
      logic") we've had some very occasional reports of BUG_ON(PageWriteback)
      in write_cache_pages(), which we thought we already fixed in commit
      073861ed ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)").
      
      But syzbot just reported another one, even with that commit in place.
      
      And it turns out that there's a simpler way to trigger the BUG_ON() than
      the one Hugh found with page re-use.  It all boils down to the fact that
      the page writeback is ostensibly serialized by the page lock, but that
      isn't actually really true.
      
      Yes, the people _setting_ writeback all do so under the page lock, but
      the actual clearing of the bit - and waking up any waiters - happens
      without any page lock.
      
      This gives us this fairly simple race condition:
      
        CPU1 = end previous writeback
        CPU2 = start new writeback under page lock
        CPU3 = write_cache_pages()
      
        CPU1          CPU2            CPU3
        ----          ----            ----
      
        end_page_writeback()
          test_clear_page_writeback(page)
          ... delayed...
      
                      lock_page();
                      set_page_writeback()
                      unlock_page()
      
                                      lock_page()
                                      wait_on_page_writeback();
      
          wake_up_page(page, PG_writeback);
          .. wakes up CPU3 ..
      
                                      BUG_ON(PageWriteback(page));
      
      where the BUG_ON() happens because we woke up the PG_writeback bit
      becasue of the _previous_ writeback, but a new one had already been
      started because the clearing of the bit wasn't actually atomic wrt the
      actual wakeup or serialized by the page lock.
      
      The reason this didn't use to happen was that the old logic in waiting
      on a page bit would just loop if it ever saw the bit set again.
      
      The nice proper fix would probably be to get rid of the whole "wait for
      writeback to clear, and then set it" logic in the writeback path, and
      replace it with an atomic "wait-to-set" (ie the same as we have for page
      locking: we set the page lock bit with a single "lock_page()", not with
      "wait for lock bit to clear and then set it").
      
      However, out current model for writeback is that the waiting for the
      writeback bit is done by the generic VFS code (ie write_cache_pages()),
      but the actual setting of the writeback bit is done much later by the
      filesystem ".writepages()" function.
      
      IOW, to make the writeback bit have that same kind of "wait-to-set"
      behavior as we have for page locking, we'd have to change our roughly
      ~50 different writeback functions.  Painful.
      
      Instead, just make "wait_on_page_writeback()" loop on the very unlikely
      situation that the PG_writeback bit is still set, basically re-instating
      the old behavior.  This is very non-optimal in case of contention, but
      since we only ever set the bit under the page lock, that situation is
      controlled.
      
      Reported-by: syzbot+2fc0712f8f8b8b8fa0ef@syzkaller.appspotmail.com
      Fixes: 2a9127fc ("mm: rewrite wait_on_page_bit_common() logic")
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      7b73bf0a
  6. 18 1月, 2021 2 次提交
    • B
      mm: memmap defer init doesn't work as expected · ad6a5557
      Baoquan He 提交于
      stable inclusion
      from stable-5.10.5
      commit 98b57685c26d8f41040ecf71e190250fb2eb2a0c
      bugzilla: 46931
      
      --------------------------------
      
      commit dc2da7b4 upstream.
      
      VMware observed a performance regression during memmap init on their
      platform, and bisected to commit 73a6e474 ("mm: memmap_init:
      iterate over memblock regions rather that check each PFN") causing it.
      
      Before the commit:
      
        [0.033176] Normal zone: 1445888 pages used for memmap
        [0.033176] Normal zone: 89391104 pages, LIFO batch:63
        [0.035851] ACPI: PM-Timer IO Port: 0x448
      
      With commit
      
        [0.026874] Normal zone: 1445888 pages used for memmap
        [0.026875] Normal zone: 89391104 pages, LIFO batch:63
        [2.028450] ACPI: PM-Timer IO Port: 0x448
      
      The root cause is the current memmap defer init doesn't work as expected.
      
      Before, memmap_init_zone() was used to do memmap init of one whole zone,
      to initialize all low zones of one numa node, but defer memmap init of
      the last zone in that numa node.  However, since commit 73a6e474,
      function memmap_init() is adapted to iterater over memblock regions
      inside one zone, then call memmap_init_zone() to do memmap init for each
      region.
      
      E.g, on VMware's system, the memory layout is as below, there are two
      memory regions in node 2.  The current code will mistakenly initialize the
      whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to
      iniatialize only one memmory section on the 2nd region [mem
      0x10000000000-0x1033fffffff].  In fact, we only expect to see that there's
      only one memory section's memmap initialized.  That's why more time is
      costed at the time.
      
      [    0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
      [    0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
      [    0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff]
      [    0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff]
      [    0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff]
      [    0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff]
      
      Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
      down the real zone end pfn so that defer_init() can use it to judge
      whether defer need be taken in zone wide.
      
      Link: https://lkml.kernel.org/r/20201223080811.16211-1-bhe@redhat.com
      Link: https://lkml.kernel.org/r/20201223080811.16211-2-bhe@redhat.com
      Fixes: commit 73a6e474 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Reported-by: NRahul Gopakumar <gopakumarr@vmware.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      ad6a5557
    • M
      mm/hugetlb: fix deadlock in hugetlb_cow error path · 69d4df11
      Mike Kravetz 提交于
      stable inclusion
      from stable-5.10.5
      commit df73c80338ef397d5fb2fe2631d24e2256bed9bd
      bugzilla: 46931
      
      --------------------------------
      
      commit e7dd91c4 upstream.
      
      syzbot reported the deadlock here [1].  The issue is in hugetlb cow
      error handling when there are not enough huge pages for the faulting
      task which took the original reservation.  It is possible that other
      (child) tasks could have consumed pages associated with the reservation.
      In this case, we want the task which took the original reservation to
      succeed.  So, we unmap any associated pages in children so that they can
      be used by the faulting task that owns the reservation.
      
      The unmapping code needs to hold i_mmap_rwsem in write mode.  However,
      due to commit c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd
      sharing synchronization") we are already holding i_mmap_rwsem in read
      mode when hugetlb_cow is called.
      
      Technically, i_mmap_rwsem does not need to be held in read mode for COW
      mappings as they can not share pmd's.  Modifying the fault code to not
      take i_mmap_rwsem in read mode for COW (and other non-sharable) mappings
      is too involved for a stable fix.
      
      Instead, we simply drop the hugetlb_fault_mutex and i_mmap_rwsem before
      unmapping.  This is OK as it is technically not needed.  They are
      reacquired after unmapping as expected by calling code.  Since this is
      done in an uncommon error path, the overhead of dropping and reacquiring
      mutexes is acceptable.
      
      While making changes, remove redundant BUG_ON after unmap_ref_private.
      
      [1] https://lkml.kernel.org/r/000000000000b73ccc05b5cf8558@google.com
      
      Link: https://lkml.kernel.org/r/4c5781b8-3b00-761e-c0c7-c5edebb6ec1a@oracle.com
      Fixes: c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: syzbot+5eee4145df3c15e96625@syzkaller.appspotmail.com
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      69d4df11