1. 17 4月, 2021 1 次提交
  2. 10 4月, 2021 2 次提交
  3. 09 4月, 2021 1 次提交
  4. 31 3月, 2021 1 次提交
    • I
      mm: fix race by making init_zero_pfn() early_initcall · e720e7d0
      Ilya Lipnitskiy 提交于
      There are code paths that rely on zero_pfn to be fully initialized
      before core_initcall.  For example, wq_sysfs_init() is a core_initcall
      function that eventually results in a call to kernel_execve, which
      causes a page fault with a subsequent mmput.  If zero_pfn is not
      initialized by then it may not get cleaned up properly and result in an
      error:
      
        BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1
      
      Here is an analysis of the race as seen on a MIPS device. On this
      particular MT7621 device (Ubiquiti ER-X), zero_pfn is PFN 0 until
      initialized, at which point it becomes PFN 5120:
      
        1. wq_sysfs_init calls into kobject_uevent_env at core_initcall:
             kobject_uevent_env+0x7e4/0x7ec
             kset_register+0x68/0x88
             bus_register+0xdc/0x34c
             subsys_virtual_register+0x34/0x78
             wq_sysfs_init+0x1c/0x4c
             do_one_initcall+0x50/0x1a8
             kernel_init_freeable+0x230/0x2c8
             kernel_init+0x10/0x100
             ret_from_kernel_thread+0x14/0x1c
      
        2. kobject_uevent_env() calls call_usermodehelper_exec() which executes
           kernel_execve asynchronously.
      
        3. Memory allocations in kernel_execve cause a page fault, bumping the
           MM reference counter:
             add_mm_counter_fast+0xb4/0xc0
             handle_mm_fault+0x6e4/0xea0
             __get_user_pages.part.78+0x190/0x37c
             __get_user_pages_remote+0x128/0x360
             get_arg_page+0x34/0xa0
             copy_string_kernel+0x194/0x2a4
             kernel_execve+0x11c/0x298
             call_usermodehelper_exec_async+0x114/0x194
      
        4. In case zero_pfn has not been initialized yet, zap_pte_range does
           not decrement the MM_ANONPAGES RSS counter and the BUG message is
           triggered shortly afterwards when __mmdrop checks the ref counters:
             __mmdrop+0x98/0x1d0
             free_bprm+0x44/0x118
             kernel_execve+0x160/0x1d8
             call_usermodehelper_exec_async+0x114/0x194
             ret_from_kernel_thread+0x14/0x1c
      
      To avoid races such as described above, initialize init_zero_pfn at
      early_initcall level.  Depending on the architecture, ZERO_PAGE is
      either constant or gets initialized even earlier, at paging_init, so
      there is no issue with initializing zero_pfn earlier.
      
      Link: https://lkml.kernel.org/r/CALCv0x2YqOXEAy2Q=hafjhHCtTHVodChv1qpM=niAXOpqEbt7w@mail.gmail.comSigned-off-by: NIlya Lipnitskiy <ilya.lipnitskiy@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: stable@vger.kernel.org
      Tested-by: N周琰杰 (Zhou Yanjie) <zhouyanjie@wanyeetech.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e720e7d0
  5. 26 3月, 2021 5 次提交
    • I
      mm/highmem: fix CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP · 487cfade
      Ira Weiny 提交于
      The kernel test robot found that __kmap_local_sched_out() was not
      correctly skipping the guard pages when DEBUG_KMAP_LOCAL_FORCE_MAP was
      set.[1] This was due to DEBUG_HIGHMEM check being used.
      
      Change the configuration check to be correct.
      
      [1] https://lore.kernel.org/lkml/20210304083825.GB17830@xsang-OptiPlex-9020/
      
      Link: https://lkml.kernel.org/r/20210318230657.1497881-1-ira.weiny@intel.com
      Fixes: 0e91a0c6 ("mm/highmem: Provide CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP")
      Signed-off-by: NIra Weiny <ira.weiny@intel.com>
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Oliver Sang <oliver.sang@intel.com>
      Cc: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      487cfade
    • M
      kfence: make compatible with kmemleak · 95511580
      Marco Elver 提交于
      Because memblock allocations are registered with kmemleak, the KFENCE
      pool was seen by kmemleak as one large object.  Later allocations
      through kfence_alloc() that were registered with kmemleak via
      slab_post_alloc_hook() would then overlap and trigger a warning.
      Therefore, once the pool is initialized, we can remove (free) it from
      kmemleak again, since it should be treated as allocator-internal and be
      seen as "free memory".
      
      The second problem is that kmemleak is passed the rounded size, and not
      the originally requested size, which is also the size of KFENCE objects.
      To avoid kmemleak scanning past the end of an object and trigger a
      KFENCE out-of-bounds error, fix the size if it is a KFENCE object.
      
      For simplicity, to avoid a call to kfence_ksize() in
      slab_post_alloc_hook() (and avoid new IS_ENABLED(CONFIG_DEBUG_KMEMLEAK)
      guard), just call kfence_ksize() in mm/kmemleak.c:create_object().
      
      Link: https://lkml.kernel.org/r/20210317084740.3099921-1-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Reported-by: NLuis Henriques <lhenriques@suse.de>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Tested-by: NLuis Henriques <lhenriques@suse.de>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95511580
    • T
      z3fold: prevent reclaim/free race for headless pages · 6d679578
      Thomas Hebb 提交于
      Commit ca0246bb ("z3fold: fix possible reclaim races") introduced
      the PAGE_CLAIMED flag "to avoid racing on a z3fold 'headless' page
      release." By atomically testing and setting the bit in each of
      z3fold_free() and z3fold_reclaim_page(), a double-free was avoided.
      
      However, commit dcf5aedb ("z3fold: stricter locking and more careful
      reclaim") appears to have unintentionally broken this behavior by moving
      the PAGE_CLAIMED check in z3fold_reclaim_page() to after the page lock
      gets taken, which only happens for non-headless pages.  For headless
      pages, the check is now skipped entirely and races can occur again.
      
      I have observed such a race on my system:
      
          page:00000000ffbd76b7 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x165316
          flags: 0x2ffff0000000000()
          raw: 02ffff0000000000 ffffea0004535f48 ffff8881d553a170 0000000000000000
          raw: 0000000000000000 0000000000000011 00000000ffffffff 0000000000000000
          page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
          ------------[ cut here ]------------
          kernel BUG at include/linux/mm.h:707!
          invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
          CPU: 2 PID: 291928 Comm: kworker/2:0 Tainted: G    B             5.10.7-arch1-1-kasan #1
          Hardware name: Gigabyte Technology Co., Ltd. H97N-WIFI/H97N-WIFI, BIOS F9b 03/03/2016
          Workqueue: zswap-shrink shrink_worker
          RIP: 0010:__free_pages+0x10a/0x130
          Code: c1 e7 06 48 01 ef 45 85 e4 74 d1 44 89 e6 31 d2 41 83 ec 01 e8 e7 b0 ff ff eb da 48 c7 c6 e0 32 91 88 48 89 ef e8 a6 89 f8 ff <0f> 0b 4c 89 e7 e8 fc 79 07 00 e9 33 ff ff ff 48 89 ef e8 ff 79 07
          RSP: 0000:ffff88819a2ffb98 EFLAGS: 00010296
          RAX: 0000000000000000 RBX: ffffea000594c5a8 RCX: 0000000000000000
          RDX: 1ffffd4000b298b7 RSI: 0000000000000000 RDI: ffffea000594c5b8
          RBP: ffffea000594c580 R08: 000000000000003e R09: ffff8881d5520bbb
          R10: ffffed103aaa4177 R11: 0000000000000001 R12: ffffea000594c5b4
          R13: 0000000000000000 R14: ffff888165316000 R15: ffffea000594c588
          FS:  0000000000000000(0000) GS:ffff8881d5500000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007f7c8c3654d8 CR3: 0000000103f42004 CR4: 00000000001706e0
          Call Trace:
           z3fold_zpool_shrink+0x9b6/0x1240
           shrink_worker+0x35/0x90
           process_one_work+0x70c/0x1210
           worker_thread+0x539/0x1200
           kthread+0x330/0x400
           ret_from_fork+0x22/0x30
          Modules linked in: rfcomm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ccm algif_aead des_generic libdes ecb algif_skcipher cmac bnep md4 algif_hash af_alg vfat fat intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iwlmvm hid_logitech_hidpp kvm at24 mac80211 snd_hda_codec_realtek iTCO_wdt snd_hda_codec_generic intel_pmc_bxt snd_hda_codec_hdmi ledtrig_audio iTCO_vendor_support mei_wdt mei_hdcp snd_hda_intel snd_intel_dspcfg libarc4 soundwire_intel irqbypass iwlwifi soundwire_generic_allocation rapl soundwire_cadence intel_cstate snd_hda_codec intel_uncore btusb joydev mousedev snd_usb_audio pcspkr btrtl uvcvideo nouveau btbcm i2c_i801 btintel snd_hda_core videobuf2_vmalloc i2c_smbus snd_usbmidi_lib videobuf2_memops bluetooth snd_hwdep soundwire_bus snd_soc_rt5640 videobuf2_v4l2 cfg80211 snd_soc_rl6231 videobuf2_common snd_rawmidi lpc_ich alx videodev mdio snd_seq_device snd_soc_core mc ecdh_generic mxm_wmi mei_me
           hid_logitech_dj wmi snd_compress e1000e ac97_bus mei ttm rfkill snd_pcm_dmaengine ecc snd_pcm snd_timer snd soundcore mac_hid acpi_pad pkcs8_key_parser it87 hwmon_vid crypto_user fuse ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 dm_crypt cbc encrypted_keys trusted tpm rng_core usbhid dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper xhci_pci xhci_pci_renesas i915 video intel_gtt i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm agpgart
          ---[ end trace 126d646fc3dc0ad8 ]---
      
      To fix the issue, re-add the earlier test and set in the case where we
      have a headless page.
      
      Link: https://lkml.kernel.org/r/c8106dbe6d8390b290cd1d7f873a2942e805349e.1615452048.git.tommyhebb@gmail.com
      Fixes: dcf5aedb ("z3fold: stricter locking and more careful reclaim")
      Signed-off-by: NThomas Hebb <tommyhebb@gmail.com>
      Reviewed-by: NVitaly Wool <vitaly.wool@konsulko.com>
      Cc: Jongseok Kim <ks77sj@gmail.com>
      Cc: Snild Dolkow <snild@sony.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d679578
    • S
      mm/mmu_notifiers: ensure range_end() is paired with range_start() · c2655835
      Sean Christopherson 提交于
      If one or more notifiers fails .invalidate_range_start(), invoke
      .invalidate_range_end() for "all" notifiers.  If there are multiple
      notifiers, those that did not fail are expecting _start() and _end() to
      be paired, e.g.  KVM's mmu_notifier_count would become imbalanced.
      Disallow notifiers that can fail _start() from implementing _end() so
      that it's unnecessary to either track which notifiers rejected _start(),
      or had already succeeded prior to a failed _start().
      
      Note, the existing behavior of calling _start() on all notifiers even
      after a previous notifier failed _start() was an unintented "feature".
      Make it canon now that the behavior is depended on for correctness.
      
      As of today, the bug is likely benign:
      
        1. The only caller of the non-blocking notifier is OOM kill.
        2. The only notifiers that can fail _start() are the i915 and Nouveau
           drivers.
        3. The only notifiers that utilize _end() are the SGI UV GRU driver
           and KVM.
        4. The GRU driver will never coincide with the i195/Nouveau drivers.
        5. An imbalanced kvm->mmu_notifier_count only causes soft lockup in the
           _guest_, and the guest is already doomed due to being an OOM victim.
      
      Fix the bug now to play nice with future usage, e.g.  KVM has a
      potential use case for blocking memslot updates in KVM while an
      invalidation is in-progress, and failure to unblock would result in said
      updates being blocked indefinitely and hanging.
      
      Found by inspection.  Verified by adding a second notifier in KVM that
      periodically returns -EAGAIN on non-blockable ranges, triggering OOM,
      and observing that KVM exits with an elevated notifier count.
      
      Link: https://lkml.kernel.org/r/20210311180057.1582638-1-seanjc@google.com
      Fixes: 93065ac7 ("mm, oom: distinguish blockable mode for mmu notifiers")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Suggested-by: NJason Gunthorpe <jgg@ziepe.ca>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2655835
    • M
      hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings · d85aecf2
      Miaohe Lin 提交于
      The current implementation of hugetlb_cgroup for shared mappings could
      have different behavior.  Consider the following two scenarios:
      
       1.Assume initial css reference count of hugetlb_cgroup is 1:
        1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference
            count is 2 associated with 1 file_region.
        1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference
            count is 3 associated with 2 file_region.
        1.3 coalesce_file_region will coalesce these two file_regions into
            one. So css reference count is 3 associated with 1 file_region
            now.
      
       2.Assume initial css reference count of hugetlb_cgroup is 1 again:
        2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference
            count is 2 associated with 1 file_region.
      
      Therefore, we might have one file_region while holding one or more css
      reference counts. This inconsistency could lead to imbalanced css_get()
      and css_put() pair. If we do css_put one by one (i.g. hole punch case),
      scenario 2 would put one more css reference. If we do css_put all
      together (i.g. truncate case), scenario 1 will leak one css reference.
      
      The imbalanced css_get() and css_put() pair would result in a non-zero
      reference when we try to destroy the hugetlb cgroup. The hugetlb cgroup
      directory is removed __but__ associated resource is not freed. This
      might result in OOM or can not create a new hugetlb cgroup in a busy
      workload ultimately.
      
      In order to fix this, we have to make sure that one file_region must
      hold exactly one css reference. So in coalesce_file_region case, we
      should release one css reference before coalescence. Also only put css
      reference when the entire file_region is removed.
      
      The last thing to note is that the caller of region_add() will only hold
      one reference to h_cg->css for the whole contiguous reservation region.
      But this area might be scattered when there are already some
      file_regions reside in it. As a result, many file_regions may share only
      one h_cg->css reference. In order to ensure that one file_region must
      hold exactly one css reference, we should do css_get() for each
      file_region and release the reference held by caller when they are done.
      
      [linmiaohe@huawei.com: fix imbalanced css_get and css_put pair for shared mappings]
        Link: https://lkml.kernel.org/r/20210316023002.53921-1-linmiaohe@huawei.com
      
      Link: https://lkml.kernel.org/r/20210301120540.37076-1-linmiaohe@huawei.com
      Fixes: 075a61d0 ("hugetlb_cgroup: add accounting for shared mappings")
      Reported-by: kernel test robot <lkp@intel.com> (auto build test ERROR)
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d85aecf2
  6. 24 3月, 2021 1 次提交
  7. 14 3月, 2021 15 次提交
  8. 11 3月, 2021 1 次提交
    • L
      Revert "mm, slub: consider rest of partial list if acquire_slab() fails" · 9b1ea29b
      Linus Torvalds 提交于
      This reverts commit 8ff60eb0.
      
      The kernel test robot reports a huge performance regression due to the
      commit, and the reason seems fairly straightforward: when there is
      contention on the page list (which is what causes acquire_slab() to
      fail), we do _not_ want to just loop and try again, because that will
      transfer the contention to the 'n->list_lock' spinlock we hold, and
      just make things even worse.
      
      This is admittedly likely a problem only on big machines - the kernel
      test robot report comes from a 96-thread dual socket Intel Xeon Gold
      6252 setup, but the regression there really is quite noticeable:
      
         -47.9% regression of stress-ng.rawpkt.ops_per_sec
      
      and the commit that was marked as being fixed (7ced3719: "slub:
      Acquire_slab() avoid loop") actually did the loop exit early very
      intentionally (the hint being that "avoid loop" part of that commit
      message), exactly to avoid this issue.
      
      The correct thing to do may be to pick some kind of reasonable middle
      ground: instead of breaking out of the loop on the very first sign of
      contention, or trying over and over and over again, the right thing may
      be to re-try _once_, and then give up on the second failure (or pick
      your favorite value for "once"..).
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Link: https://lore.kernel.org/lkml/20210301080404.GF12822@xsang-OptiPlex-9020/
      Cc: Jann Horn <jannh@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b1ea29b
  9. 10 3月, 2021 1 次提交
  10. 03 3月, 2021 1 次提交
  11. 27 2月, 2021 11 次提交