1. 08 7月, 2021 2 次提交
  2. 06 7月, 2021 5 次提交
  3. 03 7月, 2021 4 次提交
  4. 15 6月, 2021 3 次提交
  5. 04 6月, 2021 2 次提交
  6. 03 6月, 2021 16 次提交
  7. 26 4月, 2021 2 次提交
  8. 22 4月, 2021 1 次提交
    • I
      mm: fix race by making init_zero_pfn() early_initcall · a202e050
      Ilya Lipnitskiy 提交于
      stable inclusion
      from stable-5.10.28
      commit ec3e06e06f763d8ef5ba3919e2c4e59366b5b92a
      bugzilla: 51779
      
      --------------------------------
      
      commit e720e7d0 upstream.
      
      There are code paths that rely on zero_pfn to be fully initialized
      before core_initcall.  For example, wq_sysfs_init() is a core_initcall
      function that eventually results in a call to kernel_execve, which
      causes a page fault with a subsequent mmput.  If zero_pfn is not
      initialized by then it may not get cleaned up properly and result in an
      error:
      
        BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1
      
      Here is an analysis of the race as seen on a MIPS device. On this
      particular MT7621 device (Ubiquiti ER-X), zero_pfn is PFN 0 until
      initialized, at which point it becomes PFN 5120:
      
        1. wq_sysfs_init calls into kobject_uevent_env at core_initcall:
             kobject_uevent_env+0x7e4/0x7ec
             kset_register+0x68/0x88
             bus_register+0xdc/0x34c
             subsys_virtual_register+0x34/0x78
             wq_sysfs_init+0x1c/0x4c
             do_one_initcall+0x50/0x1a8
             kernel_init_freeable+0x230/0x2c8
             kernel_init+0x10/0x100
             ret_from_kernel_thread+0x14/0x1c
      
        2. kobject_uevent_env() calls call_usermodehelper_exec() which executes
           kernel_execve asynchronously.
      
        3. Memory allocations in kernel_execve cause a page fault, bumping the
           MM reference counter:
             add_mm_counter_fast+0xb4/0xc0
             handle_mm_fault+0x6e4/0xea0
             __get_user_pages.part.78+0x190/0x37c
             __get_user_pages_remote+0x128/0x360
             get_arg_page+0x34/0xa0
             copy_string_kernel+0x194/0x2a4
             kernel_execve+0x11c/0x298
             call_usermodehelper_exec_async+0x114/0x194
      
        4. In case zero_pfn has not been initialized yet, zap_pte_range does
           not decrement the MM_ANONPAGES RSS counter and the BUG message is
           triggered shortly afterwards when __mmdrop checks the ref counters:
             __mmdrop+0x98/0x1d0
             free_bprm+0x44/0x118
             kernel_execve+0x160/0x1d8
             call_usermodehelper_exec_async+0x114/0x194
             ret_from_kernel_thread+0x14/0x1c
      
      To avoid races such as described above, initialize init_zero_pfn at
      early_initcall level.  Depending on the architecture, ZERO_PAGE is
      either constant or gets initialized even earlier, at paging_init, so
      there is no issue with initializing zero_pfn earlier.
      
      Link: https://lkml.kernel.org/r/CALCv0x2YqOXEAy2Q=hafjhHCtTHVodChv1qpM=niAXOpqEbt7w@mail.gmail.comSigned-off-by: NIlya Lipnitskiy <ilya.lipnitskiy@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: stable@vger.kernel.org
      Tested-by: N周琰杰 (Zhou Yanjie) <zhouyanjie@wanyeetech.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      a202e050
  9. 19 4月, 2021 5 次提交
    • H
      mm/memcg: fix 5.10 backport of splitting page memcg · 65dff3b2
      Hugh Dickins 提交于
      stable inclusion
      from stable-5.10.27
      commit 002ea848d7fd3bdcb6281e75bdde28095c2cd549
      bugzilla: 51493
      
      --------------------------------
      
      The straight backport of 5.12's e1baddf8 ("mm/memcg: set memcg when
      splitting page") works fine in 5.11, but turned out to be wrong for 5.10:
      because that relies on a separate flag, which must also be set for the
      memcg to be recognized and uncharged and cleared when freeing. Fix that.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      65dff3b2
    • S
      mm/mmu_notifiers: ensure range_end() is paired with range_start() · be2fd8ca
      Sean Christopherson 提交于
      stable inclusion
      from stable-5.10.27
      commit de2e6b4e32d6be7ed2218c1b20a9f81f8859ec2a
      bugzilla: 51493
      
      --------------------------------
      
      [ Upstream commit c2655835 ]
      
      If one or more notifiers fails .invalidate_range_start(), invoke
      .invalidate_range_end() for "all" notifiers.  If there are multiple
      notifiers, those that did not fail are expecting _start() and _end() to
      be paired, e.g.  KVM's mmu_notifier_count would become imbalanced.
      Disallow notifiers that can fail _start() from implementing _end() so
      that it's unnecessary to either track which notifiers rejected _start(),
      or had already succeeded prior to a failed _start().
      
      Note, the existing behavior of calling _start() on all notifiers even
      after a previous notifier failed _start() was an unintented "feature".
      Make it canon now that the behavior is depended on for correctness.
      
      As of today, the bug is likely benign:
      
        1. The only caller of the non-blocking notifier is OOM kill.
        2. The only notifiers that can fail _start() are the i915 and Nouveau
           drivers.
        3. The only notifiers that utilize _end() are the SGI UV GRU driver
           and KVM.
        4. The GRU driver will never coincide with the i195/Nouveau drivers.
        5. An imbalanced kvm->mmu_notifier_count only causes soft lockup in the
           _guest_, and the guest is already doomed due to being an OOM victim.
      
      Fix the bug now to play nice with future usage, e.g.  KVM has a
      potential use case for blocking memslot updates in KVM while an
      invalidation is in-progress, and failure to unblock would result in said
      updates being blocked indefinitely and hanging.
      
      Found by inspection.  Verified by adding a second notifier in KVM that
      periodically returns -EAGAIN on non-blockable ranges, triggering OOM,
      and observing that KVM exits with an elevated notifier count.
      
      Link: https://lkml.kernel.org/r/20210311180057.1582638-1-seanjc@google.com
      Fixes: 93065ac7 ("mm, oom: distinguish blockable mode for mmu notifiers")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Suggested-by: NJason Gunthorpe <jgg@ziepe.ca>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      be2fd8ca
    • M
      hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings · c329e216
      Miaohe Lin 提交于
      stable inclusion
      from stable-5.10.27
      commit fe03ccc3ce906a31005637263fb82dd84d5d1dac
      bugzilla: 51493
      
      --------------------------------
      
      commit d85aecf2 upstream.
      
      The current implementation of hugetlb_cgroup for shared mappings could
      have different behavior.  Consider the following two scenarios:
      
       1.Assume initial css reference count of hugetlb_cgroup is 1:
        1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference
            count is 2 associated with 1 file_region.
        1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference
            count is 3 associated with 2 file_region.
        1.3 coalesce_file_region will coalesce these two file_regions into
            one. So css reference count is 3 associated with 1 file_region
            now.
      
       2.Assume initial css reference count of hugetlb_cgroup is 1 again:
        2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference
            count is 2 associated with 1 file_region.
      
      Therefore, we might have one file_region while holding one or more css
      reference counts. This inconsistency could lead to imbalanced css_get()
      and css_put() pair. If we do css_put one by one (i.g. hole punch case),
      scenario 2 would put one more css reference. If we do css_put all
      together (i.g. truncate case), scenario 1 will leak one css reference.
      
      The imbalanced css_get() and css_put() pair would result in a non-zero
      reference when we try to destroy the hugetlb cgroup. The hugetlb cgroup
      directory is removed __but__ associated resource is not freed. This
      might result in OOM or can not create a new hugetlb cgroup in a busy
      workload ultimately.
      
      In order to fix this, we have to make sure that one file_region must
      hold exactly one css reference. So in coalesce_file_region case, we
      should release one css reference before coalescence. Also only put css
      reference when the entire file_region is removed.
      
      The last thing to note is that the caller of region_add() will only hold
      one reference to h_cg->css for the whole contiguous reservation region.
      But this area might be scattered when there are already some
      file_regions reside in it. As a result, many file_regions may share only
      one h_cg->css reference. In order to ensure that one file_region must
      hold exactly one css reference, we should do css_get() for each
      file_region and release the reference held by caller when they are done.
      
      [linmiaohe@huawei.com: fix imbalanced css_get and css_put pair for shared mappings]
        Link: https://lkml.kernel.org/r/20210316023002.53921-1-linmiaohe@huawei.com
      
      Link: https://lkml.kernel.org/r/20210301120540.37076-1-linmiaohe@huawei.com
      Fixes: 075a61d0 ("hugetlb_cgroup: add accounting for shared mappings")
      Reported-by: kernel test robot <lkp@intel.com> (auto build test ERROR)
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      c329e216
    • T
      z3fold: prevent reclaim/free race for headless pages · 49f10ff7
      Thomas Hebb 提交于
      stable inclusion
      from stable-5.10.27
      commit 1d215fcbc4ef305614871bbb2399f7b4670cb266
      bugzilla: 51493
      
      --------------------------------
      
      commit 6d679578 upstream.
      
      Commit ca0246bb ("z3fold: fix possible reclaim races") introduced
      the PAGE_CLAIMED flag "to avoid racing on a z3fold 'headless' page
      release." By atomically testing and setting the bit in each of
      z3fold_free() and z3fold_reclaim_page(), a double-free was avoided.
      
      However, commit dcf5aedb ("z3fold: stricter locking and more careful
      reclaim") appears to have unintentionally broken this behavior by moving
      the PAGE_CLAIMED check in z3fold_reclaim_page() to after the page lock
      gets taken, which only happens for non-headless pages.  For headless
      pages, the check is now skipped entirely and races can occur again.
      
      I have observed such a race on my system:
      
          page:00000000ffbd76b7 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x165316
          flags: 0x2ffff0000000000()
          raw: 02ffff0000000000 ffffea0004535f48 ffff8881d553a170 0000000000000000
          raw: 0000000000000000 0000000000000011 00000000ffffffff 0000000000000000
          page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
          ------------[ cut here ]------------
          kernel BUG at include/linux/mm.h:707!
          invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
          CPU: 2 PID: 291928 Comm: kworker/2:0 Tainted: G    B             5.10.7-arch1-1-kasan #1
          Hardware name: Gigabyte Technology Co., Ltd. H97N-WIFI/H97N-WIFI, BIOS F9b 03/03/2016
          Workqueue: zswap-shrink shrink_worker
          RIP: 0010:__free_pages+0x10a/0x130
          Code: c1 e7 06 48 01 ef 45 85 e4 74 d1 44 89 e6 31 d2 41 83 ec 01 e8 e7 b0 ff ff eb da 48 c7 c6 e0 32 91 88 48 89 ef e8 a6 89 f8 ff <0f> 0b 4c 89 e7 e8 fc 79 07 00 e9 33 ff ff ff 48 89 ef e8 ff 79 07
          RSP: 0000:ffff88819a2ffb98 EFLAGS: 00010296
          RAX: 0000000000000000 RBX: ffffea000594c5a8 RCX: 0000000000000000
          RDX: 1ffffd4000b298b7 RSI: 0000000000000000 RDI: ffffea000594c5b8
          RBP: ffffea000594c580 R08: 000000000000003e R09: ffff8881d5520bbb
          R10: ffffed103aaa4177 R11: 0000000000000001 R12: ffffea000594c5b4
          R13: 0000000000000000 R14: ffff888165316000 R15: ffffea000594c588
          FS:  0000000000000000(0000) GS:ffff8881d5500000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007f7c8c3654d8 CR3: 0000000103f42004 CR4: 00000000001706e0
          Call Trace:
           z3fold_zpool_shrink+0x9b6/0x1240
           shrink_worker+0x35/0x90
           process_one_work+0x70c/0x1210
           worker_thread+0x539/0x1200
           kthread+0x330/0x400
           ret_from_fork+0x22/0x30
          Modules linked in: rfcomm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ccm algif_aead des_generic libdes ecb algif_skcipher cmac bnep md4 algif_hash af_alg vfat fat intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iwlmvm hid_logitech_hidpp kvm at24 mac80211 snd_hda_codec_realtek iTCO_wdt snd_hda_codec_generic intel_pmc_bxt snd_hda_codec_hdmi ledtrig_audio iTCO_vendor_support mei_wdt mei_hdcp snd_hda_intel snd_intel_dspcfg libarc4 soundwire_intel irqbypass iwlwifi soundwire_generic_allocation rapl soundwire_cadence intel_cstate snd_hda_codec intel_uncore btusb joydev mousedev snd_usb_audio pcspkr btrtl uvcvideo nouveau btbcm i2c_i801 btintel snd_hda_core videobuf2_vmalloc i2c_smbus snd_usbmidi_lib videobuf2_memops bluetooth snd_hwdep soundwire_bus snd_soc_rt5640 videobuf2_v4l2 cfg80211 snd_soc_rl6231 videobuf2_common snd_rawmidi lpc_ich alx videodev mdio snd_seq_device snd_soc_core mc ecdh_generic mxm_wmi mei_me
           hid_logitech_dj wmi snd_compress e1000e ac97_bus mei ttm rfkill snd_pcm_dmaengine ecc snd_pcm snd_timer snd soundcore mac_hid acpi_pad pkcs8_key_parser it87 hwmon_vid crypto_user fuse ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 dm_crypt cbc encrypted_keys trusted tpm rng_core usbhid dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper xhci_pci xhci_pci_renesas i915 video intel_gtt i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm agpgart
          ---[ end trace 126d646fc3dc0ad8 ]---
      
      To fix the issue, re-add the earlier test and set in the case where we
      have a headless page.
      
      Link: https://lkml.kernel.org/r/c8106dbe6d8390b290cd1d7f873a2942e805349e.1615452048.git.tommyhebb@gmail.com
      Fixes: dcf5aedb ("z3fold: stricter locking and more careful reclaim")
      Signed-off-by: NThomas Hebb <tommyhebb@gmail.com>
      Reviewed-by: NVitaly Wool <vitaly.wool@konsulko.com>
      Cc: Jongseok Kim <ks77sj@gmail.com>
      Cc: Snild Dolkow <snild@sony.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      49f10ff7
    • Z
      mm/memcg: set memcg when splitting page · 5e0ab4f2
      Zhou Guanghui 提交于
      stable inclusion
      from stable-5.10.27
      commit efb12c03fcd0ca9cca2a1bde790348c25485c5c0
      bugzilla: 51493
      
      --------------------------------
      
      commit e1baddf8 upstream.
      
      As described in the split_page() comment, for the non-compound high order
      page, the sub-pages must be freed individually.  If the memcg of the first
      page is valid, the tail pages cannot be uncharged when be freed.
      
      For example, when alloc_pages_exact is used to allocate 1MB continuous
      physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
      set).  When make_alloc_exact free the unused 1MB and free_pages_exact free
      the applied 1MB, actually, only 4KB(one page) is uncharged.
      
      Therefore, the memcg of the tail page needs to be set when splitting a
      page.
      
      Michel:
      
      There are at least two explicit users of __GFP_ACCOUNT with
      alloc_exact_pages added recently.  See 7efe8ef2 ("KVM: arm64:
      Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c4196218
      ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
      just a theoretical issue.
      
      Link: https://lkml.kernel.org/r/20210304074053.65527-3-zhouguanghui1@huawei.comSigned-off-by: NZhou Guanghui <zhouguanghui1@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Rui Xiang <rui.xiang@huawei.com>
      Cc: Tianhong Ding <dingtianhong@huawei.com>
      Cc: Weilong Chen <chenweilong@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      5e0ab4f2