1. 25 10月, 2021 3 次提交
    • Y
      net/802/garp: fix memleak in garp_request_join() · 96a28d8d
      Yang Yingliang 提交于
      stable inclusion
      from linux-4.19.200
      commit e954107513e5e984821591b9b0ee4b002fcb63c6
      
      --------------------------------
      
      [ Upstream commit 42ca63f9 ]
      
      I got kmemleak report when doing fuzz test:
      
      BUG: memory leak
      unreferenced object 0xffff88810c909b80 (size 64):
        comm "syz", pid 957, jiffies 4295220394 (age 399.090s)
        hex dump (first 32 bytes):
          01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 08 00 00 00 01 02 00 04  ................
        backtrace:
          [<00000000ca1f2e2e>] garp_request_join+0x285/0x3d0
          [<00000000bf153351>] vlan_gvrp_request_join+0x15b/0x190
          [<0000000024005e72>] vlan_dev_open+0x706/0x980
          [<00000000dc20c4d4>] __dev_open+0x2bb/0x460
          [<0000000066573004>] __dev_change_flags+0x501/0x650
          [<0000000035b42f83>] rtnl_configure_link+0xee/0x280
          [<00000000a5e69de0>] __rtnl_newlink+0xed5/0x1550
          [<00000000a5258f4a>] rtnl_newlink+0x66/0x90
          [<00000000506568ee>] rtnetlink_rcv_msg+0x439/0xbd0
          [<00000000b7eaeae1>] netlink_rcv_skb+0x14d/0x420
          [<00000000c373ce66>] netlink_unicast+0x550/0x750
          [<00000000ec74ce74>] netlink_sendmsg+0x88b/0xda0
          [<00000000381ff246>] sock_sendmsg+0xc9/0x120
          [<000000008f6a2db3>] ____sys_sendmsg+0x6e8/0x820
          [<000000008d9c1735>] ___sys_sendmsg+0x145/0x1c0
          [<00000000aa39dd8b>] __sys_sendmsg+0xfe/0x1d0
      
      Calling garp_request_leave() after garp_request_join(), the attr->state
      is set to GARP_APPLICANT_VO, garp_attr_destroy() won't be called in last
      transmit event in garp_uninit_applicant(), the attr of applicant will be
      leaked. To fix this leak, iterate and free each attr of applicant before
      rerturning from garp_uninit_applicant().
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      96a28d8d
    • Y
      net/802/mrp: fix memleak in mrp_request_join() · 4c2871d4
      Yang Yingliang 提交于
      stable inclusion
      from linux-4.19.200
      commit f9dd1e4e9d39e799fbe2be9ac7e6b43a9567ff8c
      
      --------------------------------
      
      [ Upstream commit 996af621 ]
      
      I got kmemleak report when doing fuzz test:
      
      BUG: memory leak
      unreferenced object 0xffff88810c239500 (size 64):
      comm "syz-executor940", pid 882, jiffies 4294712870 (age 14.631s)
      hex dump (first 32 bytes):
      01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      00 00 00 00 00 00 00 00 01 00 00 00 01 02 00 04 ................
      backtrace:
      [<00000000a323afa4>] slab_alloc_node mm/slub.c:2972 [inline]
      [<00000000a323afa4>] slab_alloc mm/slub.c:2980 [inline]
      [<00000000a323afa4>] __kmalloc+0x167/0x340 mm/slub.c:4130
      [<000000005034ca11>] kmalloc include/linux/slab.h:595 [inline]
      [<000000005034ca11>] mrp_attr_create net/802/mrp.c:276 [inline]
      [<000000005034ca11>] mrp_request_join+0x265/0x550 net/802/mrp.c:530
      [<00000000fcfd81f3>] vlan_mvrp_request_join+0x145/0x170 net/8021q/vlan_mvrp.c:40
      [<000000009258546e>] vlan_dev_open+0x477/0x890 net/8021q/vlan_dev.c:292
      [<0000000059acd82b>] __dev_open+0x281/0x410 net/core/dev.c:1609
      [<000000004e6dc695>] __dev_change_flags+0x424/0x560 net/core/dev.c:8767
      [<00000000471a09af>] rtnl_configure_link+0xd9/0x210 net/core/rtnetlink.c:3122
      [<0000000037a4672b>] __rtnl_newlink+0xe08/0x13e0 net/core/rtnetlink.c:3448
      [<000000008d5d0fda>] rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3488
      [<000000004882fe39>] rtnetlink_rcv_msg+0x369/0xa10 net/core/rtnetlink.c:5552
      [<00000000907e6c54>] netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2504
      [<00000000e7d7a8c4>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
      [<00000000e7d7a8c4>] netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1340
      [<00000000e0645d50>] netlink_sendmsg+0x78e/0xc90 net/netlink/af_netlink.c:1929
      [<00000000c24559b7>] sock_sendmsg_nosec net/socket.c:654 [inline]
      [<00000000c24559b7>] sock_sendmsg+0x139/0x170 net/socket.c:674
      [<00000000fc210bc2>] ____sys_sendmsg+0x658/0x7d0 net/socket.c:2350
      [<00000000be4577b5>] ___sys_sendmsg+0xf8/0x170 net/socket.c:2404
      
      Calling mrp_request_leave() after mrp_request_join(), the attr->state
      is set to MRP_APPLICANT_VO, mrp_attr_destroy() won't be called in last
      TX event in mrp_uninit_applicant(), the attr of applicant will be leaked.
      To fix this leak, iterate and free each attr of applicant before rerturning
      from mrp_uninit_applicant().
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4c2871d4
    • M
      af_unix: fix garbage collect vs MSG_PEEK · faa64eb4
      Miklos Szeredi 提交于
      stable inclusion
      from linux-4.19.200
      commit 1dabafa9f61118b1377fde424d9a94bf8dbf2813
      
      --------------------------------
      
      commit cbcf0112 upstream.
      
      unix_gc() assumes that candidate sockets can never gain an external
      reference (i.e.  be installed into an fd) while the unix_gc_lock is
      held.  Except for MSG_PEEK this is guaranteed by modifying inflight
      count under the unix_gc_lock.
      
      MSG_PEEK does not touch any variable protected by unix_gc_lock (file
      count is not), yet it needs to be serialized with garbage collection.
      Do this by locking/unlocking unix_gc_lock:
      
       1) increment file count
      
       2) lock/unlock barrier to make sure incremented file count is visible
          to garbage collection
      
       3) install file into fd
      
      This is a lock barrier (unlike smp_mb()) that ensures that garbage
      collection is run completely before or completely after the barrier.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      faa64eb4
  2. 22 10月, 2021 12 次提交
  3. 21 10月, 2021 1 次提交
  4. 20 10月, 2021 9 次提交
    • Y
      uacce: misc fixes · de116c46
      Yu'an Wang 提交于
      driver inclusion
      category: bugfix
      bugzilla: NA
      CVE: NA
      
      1.add input para check of uacce_unregister api
      2.change uacce_qfrt_str to internal interface, because it is used
        just in uacce.c
      Signed-off-by: NYu'an Wang <wangyuan46@huawei.com>
      Reviewed-by: NLongfang Liu <liulongfang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      de116c46
    • D
      mm/page_alloc: place pages to tail in __free_pages_core() · 01702464
      David Hildenbrand 提交于
      mainline inclusion
      from mainline-v5.10-rc1
      commit 7fef431b
      category: feature
      bugzilla: 182882
      CVE: NA
      
      __free_pages_core() is used when exposing fresh memory to the buddy during
      system boot and when onlining memory in generic_online_page().
      
      generic_online_page() is used in two cases:
      
      1. Direct memory onlining in online_pages().
      2. Deferred memory onlining in memory-ballooning-like mechanisms (HyperV
         balloon and virtio-mem), when parts of a section are kept
         fake-offline to be fake-onlined later on.
      
      In 1, we already place pages to the tail of the freelist.  Pages will be
      freed to MIGRATE_ISOLATE lists first and moved to the tail of the
      freelists via undo_isolate_page_range().
      
      In 2, we currently don't implement a proper rule.  In case of virtio-mem,
      where we currently always online MAX_ORDER - 1 pages, the pages will be
      placed to the HEAD of the freelist - undesireable.  While the hyper-v
      balloon calls generic_online_page() with single pages, usually it will
      call it on successive single pages in a larger block.
      
      The pages are fresh, so place them to the tail of the freelist and avoid
      the PCP.  In __free_pages_core(), remove the now superflouos call to
      set_page_refcounted() and add a comment regarding page initialization and
      the refcount.
      
      Note: In 2.  we currently don't shuffle.  If ever relevant (page shuffling
      is usually of limited use in virtualized environments), we might want to
      shuffle after a sequence of generic_online_page() calls in the relevant
      callers.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Link: https://lkml.kernel.org/r/20201005121534.15649-5-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Conflicts:
      	mm/page_alloc.c
      [Peng Liu: adjust context]
      Signed-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      01702464
    • D
      mm/page_alloc: move pages to tail in move_to_free_list() · 5be3d693
      David Hildenbrand 提交于
      mainline inclusion
      from mainline-v5.10-rc1
      commit 293ffa5e
      category: feature
      bugzilla: 182882
      CVE: NA
      
      -----------------------------------------------
      
      Whenever we move pages between freelists via move_to_free_list()/
      move_freepages_block(), we don't actually touch the pages:
      1. Page isolation doesn't actually touch the pages, it simply isolates
         pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
         When undoing isolation, we move the pages back to the target list.
      2. Page stealing (steal_suitable_fallback()) moves free pages directly
         between lists without touching them.
      3. reserve_highatomic_pageblock()/unreserve_highatomic_pageblock() moves
         free pages directly between freelists without touching them.
      
      We already place pages to the tail of the freelists when undoing isolation
      via __putback_isolated_page(), let's do it in any case (e.g., if order <=
      pageblock_order) and document the behavior. To simplify, let's move the
      pages to the tail for all move_to_free_list()/move_freepages_block() users.
      
      In 2., the target list is empty, so there should be no change.  In 3., we
      might observe a change, however, highatomic is more concerned about
      allocations succeeding than cache hotness - if we ever realize this change
      degrades a workload, we can special-case this instance and add a proper
      comment.
      
      This change results in all pages getting onlined via online_pages() to be
      placed to the tail of the freelist.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: https://lkml.kernel.org/r/20201005121534.15649-4-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Conflicts:
      	mm/page_alloc.c
      [Peng Liu: cherry-pick from 293ffa5e]
      Signed-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5be3d693
    • D
      mm/page_alloc: place pages to tail in __putback_isolated_page() · 69e15bb4
      David Hildenbrand 提交于
      mainline inclusion
      from mainline-v5.10-rc1
      commit 47b6a24a
      category: feature
      bugzilla: 182882
      CVE: NA
      
      -----------------------------------------------
      
      __putback_isolated_page() already documents that pages will be placed to
      the tail of the freelist - this is, however, not the case for "order >=
      MAX_ORDER - 2" (see buddy_merge_likely()) - which should be the case for
      all existing users.
      
      This change affects two users:
      - free page reporting
      - page isolation, when undoing the isolation (including memory onlining).
      
      This behavior is desirable for pages that haven't really been touched
      lately, so exactly the two users that don't actually read/write page
      content, but rather move untouched pages.
      
      The new behavior is especially desirable for memory onlining, where we
      allow allocation of newly onlined pages via undo_isolate_page_range() in
      online_pages().  Right now, we always place them to the head of the
      freelist, resulting in undesireable behavior: Assume we add individual
      memory chunks via add_memory() and online them right away to the NORMAL
      zone.  We create a dependency chain of unmovable allocations e.g., via the
      memmap.  The memmap of the next chunk will be placed onto previous chunks
      - if the last block cannot get offlined+removed, all dependent ones cannot
      get offlined+removed.  While this can already be observed with individual
      DIMMs, it's more of an issue for virtio-mem (and I suspect also ppc
      DLPAR).
      
      Document that this should only be used for optimizations, and no code
      should rely on this behavior for correction (if the order of the freelists
      ever changes).
      
      We won't care about page shuffling: memory onlining already properly
      shuffles after onlining.  free page reporting doesn't care about
      physically contiguous ranges, and there are already cases where page
      isolation will simply move (physically close) free pages to (currently)
      the head of the freelists via move_freepages_block() instead of shuffling.
      If this becomes ever relevant, we should shuffle the whole zone when
      undoing isolation of larger ranges, and after free_contig_range().
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: https://lkml.kernel.org/r/20201005121534.15649-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Conflicts:
      	mm/page_alloc.c
      [Peng Liu: cherry-pick from 47b6a24a]
      Signed-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      69e15bb4
    • D
      mm/page_alloc: convert "report" flag of __free_one_page() to a proper flag · 22d4ccfa
      David Hildenbrand 提交于
      mainline inclusion
      from mainline-v5.10-rc1
      commit f04a5d5d
      category: feature
      bugzilla: 182882
      CVE: NA
      
      -----------------------------------------------
      
      Patch series "mm: place pages to the freelist tail when onlining and undoing isolation", v2.
      
      When adding separate memory blocks via add_memory*() and onlining them
      immediately, the metadata (especially the memmap) of the next block will
      be placed onto one of the just added+onlined block.  This creates a chain
      of unmovable allocations: If the last memory block cannot get
      offlined+removed() so will all dependent ones.  We directly have unmovable
      allocations all over the place.
      
      This can be observed quite easily using virtio-mem, however, it can also
      be observed when using DIMMs.  The freshly onlined pages will usually be
      placed to the head of the freelists, meaning they will be allocated next,
      turning the just-added memory usually immediately un-removable.  The fresh
      pages are cold, prefering to allocate others (that might be hot) also
      feels to be the natural thing to do.
      
      It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
      adding separate, successive memory blocks, each memory block will have
      unmovable allocations on them - for example gigantic pages will fail to
      allocate.
      
      While the ZONE_NORMAL doesn't provide any guarantees that memory can get
      offlined+removed again (any kind of fragmentation with unmovable
      allocations is possible), there are many scenarios (hotplugging a lot of
      memory, running workload, hotunplug some memory/as much as possible) where
      we can offline+remove quite a lot with this patchset.
      
      a) To visualize the problem, a very simple example:
      
      Start a VM with 4GB and 8GB of virtio-mem memory:
      
       [root@localhost ~]# lsmem
       RANGE                                 SIZE  STATE REMOVABLE  BLOCK
       0x0000000000000000-0x00000000bfffffff   3G online       yes   0-23
       0x0000000100000000-0x000000033fffffff   9G online       yes 32-103
      
       Memory block size:       128M
       Total online memory:      12G
       Total offline memory:      0B
      
      Then try to unplug as much as possible using virtio-mem. Observe which
      memory blocks are still around. Without this patch set:
      
       [root@localhost ~]# lsmem
       RANGE                                  SIZE  STATE REMOVABLE   BLOCK
       0x0000000000000000-0x00000000bfffffff    3G online       yes    0-23
       0x0000000100000000-0x000000013fffffff    1G online       yes   32-39
       0x0000000148000000-0x000000014fffffff  128M online       yes      41
       0x0000000158000000-0x000000015fffffff  128M online       yes      43
       0x0000000168000000-0x000000016fffffff  128M online       yes      45
       0x0000000178000000-0x000000017fffffff  128M online       yes      47
       0x0000000188000000-0x0000000197ffffff  256M online       yes   49-50
       0x00000001a0000000-0x00000001a7ffffff  128M online       yes      52
       0x00000001b0000000-0x00000001b7ffffff  128M online       yes      54
       0x00000001c0000000-0x00000001c7ffffff  128M online       yes      56
       0x00000001d0000000-0x00000001d7ffffff  128M online       yes      58
       0x00000001e0000000-0x00000001e7ffffff  128M online       yes      60
       0x00000001f0000000-0x00000001f7ffffff  128M online       yes      62
       0x0000000200000000-0x0000000207ffffff  128M online       yes      64
       0x0000000210000000-0x0000000217ffffff  128M online       yes      66
       0x0000000220000000-0x0000000227ffffff  128M online       yes      68
       0x0000000230000000-0x0000000237ffffff  128M online       yes      70
       0x0000000240000000-0x0000000247ffffff  128M online       yes      72
       0x0000000250000000-0x0000000257ffffff  128M online       yes      74
       0x0000000260000000-0x0000000267ffffff  128M online       yes      76
       0x0000000270000000-0x0000000277ffffff  128M online       yes      78
       0x0000000280000000-0x0000000287ffffff  128M online       yes      80
       0x0000000290000000-0x0000000297ffffff  128M online       yes      82
       0x00000002a0000000-0x00000002a7ffffff  128M online       yes      84
       0x00000002b0000000-0x00000002b7ffffff  128M online       yes      86
       0x00000002c0000000-0x00000002c7ffffff  128M online       yes      88
       0x00000002d0000000-0x00000002d7ffffff  128M online       yes      90
       0x00000002e0000000-0x00000002e7ffffff  128M online       yes      92
       0x00000002f0000000-0x00000002f7ffffff  128M online       yes      94
       0x0000000300000000-0x0000000307ffffff  128M online       yes      96
       0x0000000310000000-0x0000000317ffffff  128M online       yes      98
       0x0000000320000000-0x0000000327ffffff  128M online       yes     100
       0x0000000330000000-0x000000033fffffff  256M online       yes 102-103
      
       Memory block size:       128M
       Total online memory:     8.1G
       Total offline memory:      0B
      
      With this patch set:
      
       [root@localhost ~]# lsmem
       RANGE                                 SIZE  STATE REMOVABLE BLOCK
       0x0000000000000000-0x00000000bfffffff   3G online       yes  0-23
       0x0000000100000000-0x000000013fffffff   1G online       yes 32-39
      
       Memory block size:       128M
       Total online memory:       4G
       Total offline memory:      0B
      
      All memory can get unplugged, all memory block can get removed.  Of
      course, no workload ran and the system was basically idle, but it
      highlights the issue - the fairly deterministic chain of unmovable
      allocations.  When a huge page for the 2MB memmap is needed, a
      just-onlined 4MB page will be split.  The remaining 2MB page will be used
      for the memmap of the next memory block.  So one memory block will hold
      the memmap of the two following memory blocks.  Finally the pages of the
      last-onlined memory block will get used for the next bigger allocations -
      if any allocation is unmovable, all dependent memory blocks cannot get
      unplugged and removed until that allocation is gone.
      
      Note that with bigger memory blocks (e.g., 256MB), *all* memory
      blocks are dependent and none can get unplugged again!
      
      b) Experiment with memory intensive workload
      
      I performed an experiment with an older version of this patch set (before
      we used undo_isolate_page_range() in online_pages(): Hotplug 56GB to a VM
      with an initial 4GB, onlining all memory to ZONE_NORMAL right from the
      kernel when adding it.  I then run various memory intensive workloads that
      consume most system memory for a total of 45 minutes.  Once finished, I
      try to unplug as much memory as possible.
      
      With this change, I am able to remove via virtio-mem (adding individual
      128MB memory blocks) 413 out of 448 added memory blocks.  Via individual
      (256MB) DIMMs 380 out of 448 added memory blocks.  (I don't have any
      numbers without this patchset, but looking at the above example, it's at
      most half of the 448 memory blocks for virtio-mem, and most probably none
      for DIMMs).
      
      Again, there are workloads that might behave very differently due to the
      nature of ZONE_NORMAL.
      
      This change also affects (besides memory onlining):
      - Other users of undo_isolate_page_range(): Pages are always placed to the
        tail.
      -- When memory offlining fails
      -- When memory isolation fails after having isolated some pageblocks
      -- When alloc_contig_range() either succeeds or fails
      - Other users of __putback_isolated_page(): Pages are always placed to the
        tail.
      -- Free page reporting
      - Other users of __free_pages_core()
      -- AFAIKs, any memory that is getting exposed to the buddy during boot.
         IIUC we will now usually allocate memory from lower addresses within
         a zone first (especially during boot).
      - Other users of generic_online_page()
      -- Hyper-V balloon
      
      This patch (of 5):
      
      Let's prepare for additional flags and avoid long parameter lists of
      bools.  Follow-up patches will also make use of the flags in
      __free_pages_ok().
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Link: https://lkml.kernel.org/r/20201005121534.15649-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20201005121534.15649-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Conflicts:
      	mm/page_alloc.c
      [Peng Liu: cherry-pick from f04a5d5d]
      Signed-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      22d4ccfa
    • A
      mm: add function __putback_isolated_page · 91bac231
      Alexander Duyck 提交于
      mainline inclusion
      from mainline-v5.7-rc1
      commit 624f58d8
      category: feature
      bugzilla: 182882
      CVE: NA
      
      -----------------------------------------------
      
      There are cases where we would benefit from avoiding having to go through
      the allocation and free cycle to return an isolated page.
      
      Examples for this might include page poisoning in which we isolate a page
      and then put it back in the free list without ever having actually
      allocated it.
      
      This will enable us to also avoid notifiers for the future free page
      reporting which will need to avoid retriggering page reporting when
      returning pages that have been reported on.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224624.29318.89287.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Conflicts:
              mm/internal.h
      [Peng Liu: cherry-pick from 624f58d8]
      Signed-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      91bac231
    • A
      mm/page_alloc.c: memory hotplug: free pages as higher order · c479b04b
      Arun KS 提交于
      mainline inclusion
      from mainline-v5.1-rc1
      commit a9cd410a
      category: feature
      bugzilla: 182882
      CVE: NA
      
      -----------------------------------------------
      
      When freeing pages are done with higher order, time spent on coalescing
      pages by buddy allocator can be reduced.  With section size of 256MB,
      hot add latency of a single section shows improvement from 50-60 ms to
      less than 1 ms, hence improving the hot add latency by 60 times.  Modify
      external providers of online callback to align with the change.
      
      [arunks@codeaurora.org: v11]
        Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
      [akpm@linux-foundation.org: remove unused local, per Arun]
      [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
      [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
      [arunks@codeaurora.org: v8]
        Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
      [arunks@codeaurora.org: v9]
        Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
      Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Conflicts:
      	mm/page_alloc.c
      	mm/memory_hotplug.c
      [Peng Liu: adjust context]
      Signed-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c479b04b
    • G
      raid1: ensure write behind bio has less than BIO_MAX_VECS sectors · a2e9fa04
      Guoqing Jiang 提交于
      mainline inclusion
      from mainline-v5.15-rc1
      commit 6607cd31
      category: bugfix
      bugzilla: 182883, https://gitee.com/openeuler/kernel/issues/I4ENHY
      CVE: NA
      
      -------------------------------------------------
      
      We can't split write behind bio with more than BIO_MAX_VECS sectors,
      otherwise the below call trace was triggered because we could allocate
      oversized write behind bio later.
      
      [ 8.097936] bvec_alloc+0x90/0xc0
      [ 8.098934] bio_alloc_bioset+0x1b3/0x260
      [ 8.099959] raid1_make_request+0x9ce/0xc50 [raid1]
      [ 8.100988] ? __bio_clone_fast+0xa8/0xe0
      [ 8.102008] md_handle_request+0x158/0x1d0 [md_mod]
      [ 8.103050] md_submit_bio+0xcd/0x110 [md_mod]
      [ 8.104084] submit_bio_noacct+0x139/0x530
      [ 8.105127] submit_bio+0x78/0x1d0
      [ 8.106163] ext4_io_submit+0x48/0x60 [ext4]
      [ 8.107242] ext4_writepages+0x652/0x1170 [ext4]
      [ 8.108300] ? do_writepages+0x41/0x100
      [ 8.109338] ? __ext4_mark_inode_dirty+0x240/0x240 [ext4]
      [ 8.110406] do_writepages+0x41/0x100
      [ 8.111450] __filemap_fdatawrite_range+0xc5/0x100
      [ 8.112513] file_write_and_wait_range+0x61/0xb0
      [ 8.113564] ext4_sync_file+0x73/0x370 [ext4]
      [ 8.114607] __x64_sys_fsync+0x33/0x60
      [ 8.115635] do_syscall_64+0x33/0x40
      [ 8.116670] entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Thanks for the comment from Christoph.
      
      [1]. https://bugs.archlinux.org/task/70992
      
      Cc: stable@vger.kernel.org # v5.12+
      Reported-by: NJens Stutte <jens@chianterastutte.eu>
      Tested-by: NJens Stutte <jens@chianterastutte.eu>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      
      Conflict:
              drivers/md/raid1.c
              [ Mainline patch 6607cd31 ("raid1: ensure write behind bio has less
              than BIO_MAX_VECS sectors"), BIO_MAX_VECS is used directly, but the
              BIO_MAX_VECS was renamed previously and the corresponding patch
              a8affc03 ("block: rename BIO_MAX_PAGES to BIO_MAX_VECS") was not
              incorporated. So we modify BIO_MAX_VECS to the original BIO_MAX_PAGES.]
      Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com>
      Reviewed-by: NJason Yan <yanaijie@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a2e9fa04
    • L
      blk-wbt: fix IO hang due to negative inflight counter · 02512915
      Laibin Qiu 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 182135, https://gitee.com/openeuler/kernel/issues/I4ENC8
      CVE: NA
      
      --------------------------
      
      Block test reported the following stack, Some req has been watting for
      wakeup in wbt_wait, and vmcore showed that wbt inflight counter is -1.
      So Request cannot be awakened.
      
      PID: 75416  TASK: ffff88836c098000  CPU: 2   COMMAND: "fsstress"
      [ffff8882e59a7608] __schedule at ffffffffb2d22a25
      [ffff8882e59a7720] schedule at ffffffffb2d2358f
      [ffff8882e59a7738] io_schedule at ffffffffb2d23bdc
      [ffff8882e59a7750] rq_qos_wait at ffffffffb2400fde
      [ffff8882e59a7878] wbt_wait at ffffffffb243a051
      [ffff8882e59a7910] __rq_qos_throttle at ffffffffb2400a20
      [ffff8882e59a7930] blk_mq_make_request at ffffffffb23de038
      [ffff8882e59a7a98] generic_make_request at ffffffffb23c393d
      [ffff8882e59a7b80] submit_bio at ffffffffb23c3db8
      [ffff8882e59a7c48] submit_bio_wait at ffffffffb23b3a5d
      [ffff8882e59a7cf0] blkdev_issue_flush at ffffffffb23c8f4c
      [ffff8882e59a7d20] ext4_sync_fs at ffffffffc06dd708 [ext4]
      [ffff8882e59a7dd0] sync_filesystem at ffffffffb21e8335
      [ffff8882e59a7df8] ovl_sync_fs at ffffffffc0fd853a [overlay]
      [ffff8882e59a7e10] sync_fs_one_sb at ffffffffb21e8221
      [ffff8882e59a7e30] iterate_supers at ffffffffb218401e
      [ffff8882e59a7e70] ksys_sync at ffffffffb21e8588
      [ffff8882e59a7f20] __x64_sys_sync at ffffffffb21e861f
      [ffff8882e59a7f28] do_syscall_64 at ffffffffb1c06bc8
      [ffff8882e59a7f50] entry_SYSCALL_64_after_hwframe at ffffffffb2e000ad
      RIP: 00007f479ab13347  RSP: 00007ffd4dda9fe8  RFLAGS: 00000202
      RAX: ffffffffffffffda  RBX: 0000000000000068  RCX: 00007f479ab13347
      RDX: 0000000000000000  RSI: 000000003e1b142d  RDI: 0000000000000068
      RBP: 0000000051eb851f   R8: 00007f479abd4034   R9: 00007f479abd40a0
      R10: 0000000000000000  R11: 0000000000000202  R12: 0000000000402c20
      R13: 0000000000000001  R14: 0000000000000000  R15: 7fffffffffffffff
      
      The ->inflight counter may be negative (-1) if
      
      1) blk-wbt was disabled when the IO was issued,
      which will add inflight count.
      
      2) blk-wbt was enabled before this IO tracked.
      
      3) the ->inflight counter is decreased from
      0 to -1 in endio().
      
      This fixes the problem by freezing the queue while enabling wbt,
      there is no inflight rq running.
      Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com>
      Reviewed-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      02512915
  5. 19 10月, 2021 7 次提交
  6. 18 10月, 2021 8 次提交