1. 04 11月, 2020 40 次提交
    • H
      mm/khugepaged: fix filemap page_to_pgoff(page) != offset · f1e893f1
      Hugh Dickins 提交于
      stable inclusion
      from linux-4.19.151
      commit fbe96d5aab1ef3c992b1dd7a0a4a5aeb21093571
      
      --------------------------------
      
      commit 033b5d77 upstream.
      
      There have been elusive reports of filemap_fault() hitting its
      VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page) on kernels built
      with CONFIG_READ_ONLY_THP_FOR_FS=y.
      
      Suren has hit it on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y and
      CONFIG_NUMA is not set: and he has analyzed it down to how khugepaged
      without NUMA reuses the same huge page after collapse_file() failed
      (whereas NUMA targets its allocation to the respective node each time).
      And most of us were usually testing with CONFIG_NUMA=y kernels.
      
      collapse_file(old start)
        new_page = khugepaged_alloc_page(hpage)
        __SetPageLocked(new_page)
        new_page->index = start // hpage->index=old offset
        new_page->mapping = mapping
        xas_store(&xas, new_page)
      
                                filemap_fault
                                  page = find_get_page(mapping, offset)
                                  // if offset falls inside hpage then
                                  // compound_head(page) == hpage
                                  lock_page_maybe_drop_mmap()
                                    __lock_page(page)
      
        // collapse fails
        xas_store(&xas, old page)
        new_page->mapping = NULL
        unlock_page(new_page)
      
      collapse_file(new start)
        new_page = khugepaged_alloc_page(hpage)
        __SetPageLocked(new_page)
        new_page->index = start // hpage->index=new offset
        new_page->mapping = mapping // mapping becomes valid again
      
                                  // since compound_head(page) == hpage
                                  // page_to_pgoff(page) got changed
                                  VM_BUG_ON_PAGE(page_to_pgoff(page) != offset)
      
      An initial patch replaced __SetPageLocked() by lock_page(), which did
      fix the race which Suren illustrates above.  But testing showed that it's
      not good enough: if the racing task's __lock_page() gets delayed long
      after its find_get_page(), then it may follow collapse_file(new start)'s
      successful final unlock_page(), and crash on the same VM_BUG_ON_PAGE.
      
      It could be fixed by relaxing filemap_fault()'s VM_BUG_ON_PAGE to a
      check and retry (as is done for mapping), with similar relaxations in
      find_lock_entry() and pagecache_get_page(): but it's not obvious what
      else might get caught out; and khugepaged non-NUMA appears to be unique
      in exposing a page to page cache, then revoking, without going through
      a full cycle of freeing before reuse.
      
      Instead, non-NUMA khugepaged_prealloc_page() release the old page
      if anyone else has a reference to it (1% of cases when I tested).
      
      Although never reported on huge tmpfs, I believe its find_lock_entry()
      has been at similar risk; but huge tmpfs does not rely on khugepaged
      for its normal working nearly so much as READ_ONLY_THP_FOR_FS does.
      Reported-by: NDenis Lisov <dennis.lissov@gmail.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206569
      Link: https://lore.kernel.org/linux-mm/?q=20200219144635.3b7417145de19b65f258c943%40linux-foundation.orgReported-by: NQian Cai <cai@lca.pw>
      Link: https://lore.kernel.org/linux-xfs/?q=20200616013309.GB815%40lca.pwReported-and-analyzed-by: NSuren Baghdasaryan <surenb@google.com>
      Fixes: 87c460a0 ("mm/khugepaged: collapse_shmem() without freezing new_page")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org # v4.9+
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f1e893f1
    • C
      nvme-core: put ctrl ref when module ref get fail · d397dacb
      Chaitanya Kulkarni 提交于
      stable inclusion
      from linux-4.19.151
      commit b6df5afc3d81e34d32f0b092d59b7fe8915d824b
      
      --------------------------------
      
      commit 4bab6909 upstream.
      
      When try_module_get() fails in the nvme_dev_open() it returns without
      releasing the ctrl reference which was taken earlier.
      
      Put the ctrl reference which is taken before calling the
      try_module_get() in the error return code path.
      
      Fixes: 52a3974f "nvme-core: get/put ctrl and transport module in nvme_dev_open/release()"
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d397dacb
    • L
      usermodehelper: reset umask to default before executing user process · 2dc6afbe
      Linus Torvalds 提交于
      stable inclusion
      from linux-4.19.151
      commit 33acb78c859f1a0bd3c6b67801fada16f99614f6
      
      --------------------------------
      
      commit 4013c149 upstream.
      
      Kernel threads intentionally do CLONE_FS in order to follow any changes
      that 'init' does to set up the root directory (or cwd).
      
      It is admittedly a bit odd, but it avoids the situation where 'init'
      does some extensive setup to initialize the system environment, and then
      we execute a usermode helper program, and it uses the original FS setup
      from boot time that may be very limited and incomplete.
      
      [ Both Al Viro and Eric Biederman point out that 'pivot_root()' will
        follow the root regardless, since it fixes up other users of root (see
        chroot_fs_refs() for details), but overmounting root and doing a
        chroot() would not. ]
      
      However, Vegard Nossum noticed that the CLONE_FS not only means that we
      follow the root and current working directories, it also means we share
      umask with whatever init changed it to. That wasn't intentional.
      
      Just reset umask to the original default (0022) before actually starting
      the usermode helper program.
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Conflicts:
        kernel/umh.c
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2dc6afbe
    • W
      netfilter: ctnetlink: add a range check for l3/l4 protonum · 4ffa797e
      Will McVicker 提交于
      stable inclusion
      from linux-4.19.150
      commit 289fe546ea16c2dcb57c5198c5a7b7387604530e
      
      --------------------------------
      
      commit 1cc5ef91 upstream.
      
      The indexes to the nf_nat_l[34]protos arrays come from userspace. So
      check the tuple's family, e.g. l3num, when creating the conntrack in
      order to prevent an OOB memory access during setup.  Here is an example
      kernel panic on 4.14.180 when userspace passes in an index greater than
      NFPROTO_NUMPROTO.
      
      Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
      Modules linked in:...
      Process poc (pid: 5614, stack limit = 0x00000000a3933121)
      CPU: 4 PID: 5614 Comm: poc Tainted: G S      W  O    4.14.180-g051355490483
      Hardware name: Qualcomm Technologies, Inc. SM8150 V2 PM8150 Google Inc. MSM
      task: 000000002a3dfffe task.stack: 00000000a3933121
      pc : __cfi_check_fail+0x1c/0x24
      lr : __cfi_check_fail+0x1c/0x24
      ...
      Call trace:
      __cfi_check_fail+0x1c/0x24
      name_to_dev_t+0x0/0x468
      nfnetlink_parse_nat_setup+0x234/0x258
      ctnetlink_parse_nat_setup+0x4c/0x228
      ctnetlink_new_conntrack+0x590/0xc40
      nfnetlink_rcv_msg+0x31c/0x4d4
      netlink_rcv_skb+0x100/0x184
      nfnetlink_rcv+0xf4/0x180
      netlink_unicast+0x360/0x770
      netlink_sendmsg+0x5a0/0x6a4
      ___sys_sendmsg+0x314/0x46c
      SyS_sendmsg+0xb4/0x108
      el0_svc_naked+0x34/0x38
      
      This crash is not happening since 5.4+, however, ctnetlink still
      allows for creating entries with unsupported layer 3 protocol number.
      
      Fixes: c1d10adb ("[NETFILTER]: Add ctnetlink port for nf_conntrack")
      Signed-off-by: NWill McVicker <willmcvicker@google.com>
      [pablo@netfilter.org: rebased original patch on top of nf.git]
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4ffa797e
    • A
      ep_create_wakeup_source(): dentry name can change under you... · b3e6c207
      Al Viro 提交于
      stable inclusion
      from linux-4.19.150
      commit ced8ce5d2157142c469eccc5eef5ea8ad579fa5e
      
      --------------------------------
      
      commit 3701cb59 upstream.
      
      or get freed, for that matter, if it's a long (separately stored)
      name.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b3e6c207
    • A
      epoll: EPOLL_CTL_ADD: close the race in decision to take fast path · da5e9534
      Al Viro 提交于
      stable inclusion
      from linux-4.19.150
      commit 90ef231ba534d43033884b8560df26e608ca0a21
      
      --------------------------------
      
      commit fe0a916c upstream.
      
      Checking for the lack of epitems refering to the epoll we want to insert into
      is not enough; we might have an insertion of that epoll into another one that
      has already collected the set of files to recheck for excessive reverse paths,
      but hasn't gotten to creating/inserting the epitem for it.
      
      However, any such insertion in progress can be detected - it will update the
      generation count in our epoll when it's done looking through it for files
      to check.  That gets done under ->mtx of our epoll and that allows us to
      detect that safely.
      
      We are *not* holding epmutex here, so the generation count is not stable.
      However, since both the update of ep->gen by loop check and (later)
      insertion into ->f_ep_link are done with ep->mtx held, we are fine -
      the sequence is
      	grab epmutex
      	bump loop_check_gen
      	...
      	grab tep->mtx		// 1
      	tep->gen = loop_check_gen
      	...
      	drop tep->mtx		// 2
      	...
      	grab tep->mtx		// 3
      	...
      	insert into ->f_ep_link
      	...
      	drop tep->mtx		// 4
      	bump loop_check_gen
      	drop epmutex
      and if the fastpath check in another thread happens for that
      eventpoll, it can come
      	* before (1) - in that case fastpath is just fine
      	* after (4) - we'll see non-empty ->f_ep_link, slow path
      taken
      	* between (2) and (3) - loop_check_gen is stable,
      with ->mtx providing barriers and we end up taking slow path.
      
      Note that ->f_ep_link emptiness check is slightly racy - we are protected
      against insertions into that list, but removals can happen right under us.
      Not a problem - in the worst case we'll end up taking a slow path for
      no good reason.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      da5e9534
    • A
      epoll: replace ->visited/visited_list with generation count · d6514225
      Al Viro 提交于
      stable inclusion
      from linux-4.19.150
      commit ff329915a5b1f6778344a6fc7b060c991376b095
      
      --------------------------------
      
      commit 18306c40 upstream.
      
      removes the need to clear it, along with the races.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d6514225
    • A
      epoll: do not insert into poll queues until all sanity checks are done · 36b8f7e2
      Al Viro 提交于
      stable inclusion
      from linux-4.19.150
      commit 3e3bbc4d23eeb90bf282e98c7dfeca7702df3169
      
      --------------------------------
      
      commit f8d4f44d upstream.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      36b8f7e2
    • L
      mm: don't rely on system state to detect hot-plug operations · f38a43c2
      Laurent Dufour 提交于
      stable inclusion
      from linux-4.19.150
      commit b6f69f72c15d7f973f5709c5351f378f235b3654
      
      --------------------------------
      
      commit f85086f9 upstream.
      
      In register_mem_sect_under_node() the system_state's value is checked to
      detect whether the call is made during boot time or during an hot-plug
      operation.  Unfortunately, that check against SYSTEM_BOOTING is wrong
      because regular memory is registered at SYSTEM_SCHEDULING state.  In
      addition, memory hot-plug operation can be triggered at this system
      state by the ACPI [1].  So checking against the system state is not
      enough.
      
      The consequence is that on system with interleaved node's ranges like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      This can be seen on PowerPC LPAR after multiple memory hot-plug and
      hot-unplug operations are done.  At the next reboot the node's memory
      ranges can be interleaved and since the call to link_mem_sections() is
      made in topology_init() while the system is in the SYSTEM_SCHEDULING
      state, the node's id is not checked, and the sections registered to
      multiple nodes:
      
        $ ls -l /sys/devices/system/memory/memory21/node*
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
      
      In that case, the system is able to boot but if later one of theses
      memory blocks is hot-unplugged and then hot-plugged, the sysfs
      inconsistency is detected and this is triggering a BUG_ON():
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        Oops: Exception in kernel mode, sig: 5 [#1]
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This patch addresses the root cause by not relying on the system_state
      value to detect whether the call is due to a hot-plug operation.  An
      extra parameter is added to link_mem_sections() detailing whether the
      operation is due to a hot-plug operation.
      
      [1] According to Oscar Salvador, using this qemu command line, ACPI
      memory hotplug operations are raised at SYSTEM_SCHEDULING state:
      
        $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
              -m size=$MEM,slots=255,maxmem=4294967296k  \
              -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
              -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
              -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
              -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
              -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
              -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
              -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
              -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \
      
      Fixes: 4fbce633 ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f38a43c2
    • L
      mm: replace memmap_context by meminit_context · fb9e4c0b
      Laurent Dufour 提交于
      stable inclusion
      from linux-4.19.150
      commit 25eaea1b33f2569f69a82dfddb3fb05384143bd0
      
      --------------------------------
      
      commit c1d0da83 upstream.
      
      Patch series "mm: fix memory to node bad links in sysfs", v3.
      
      Sometimes, firmware may expose interleaved memory layout like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      In that case, we can see memory blocks assigned to multiple nodes in
      sysfs:
      
        $ ls -l /sys/devices/system/memory/memory21
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
        drwxr-xr-x 2 root root     0 Aug 24 05:27 power
        -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
        lrwxrwxrwx 1 root root     0 Aug 24 05:25 subsystem -> ../../../../bus/memory
        -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
        -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones
      
      The same applies in the node's directory with a memory21 link in both
      the node1 and node2's directory.
      
      This is wrong but doesn't prevent the system to run.  However when
      later, one of these memory blocks is hot-unplugged and then hot-plugged,
      the system is detecting an inconsistency in the sysfs layout and a
      BUG_ON() is raised:
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This has been seen on PowerPC LPAR.
      
      The root cause of this issue is that when node's memory is registered,
      the range used can overlap another node's range, thus the memory block
      is registered to multiple nodes in sysfs.
      
      There are two issues here:
      
       (a) The sysfs memory and node's layouts are broken due to these
           multiple links
      
       (b) The link errors in link_mem_sections() should not lead to a system
           panic.
      
      To address (a) register_mem_sect_under_node should not rely on the
      system state to detect whether the link operation is triggered by a hot
      plug operation or not.  This is addressed by the patches 1 and 2 of this
      series.
      
      Issue (b) will be addressed separately.
      
      This patch (of 2):
      
      The memmap_context enum is used to detect whether a memory operation is
      due to a hot-add operation or happening at boot time.
      
      Make it general to the hotplug operation and rename it as
      meminit_context.
      
      There is no functional change introduced by this patch
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J . Wysocki" <rafael@kernel.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
      Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fb9e4c0b
    • T
      random32: Restore __latent_entropy attribute on net_rand_state · b800aa58
      Thibaut Sautereau 提交于
      stable inclusion
      from linux-4.19.150
      commit a4ebc2d6aa3ac2aa92cac8f6f53662df2c4904c9
      
      --------------------------------
      
      [ Upstream commit 09a6b0bc ]
      
      Commit f227e3ec ("random32: update the net random state on interrupt
      and activity") broke compilation and was temporarily fixed by Linus in
      83bdc727 ("random32: remove net_rand_state from the latent entropy
      gcc plugin") by entirely moving net_rand_state out of the things handled
      by the latent_entropy GCC plugin.
      
      From what I understand when reading the plugin code, using the
      __latent_entropy attribute on a declaration was the wrong part and
      simply keeping the __latent_entropy attribute on the variable definition
      was the correct fix.
      
      Fixes: 83bdc727 ("random32: remove net_rand_state from the latent entropy gcc plugin")
      Acked-by: NWilly Tarreau <w@1wt.eu>
      Cc: Emese Revfy <re.emese@gmail.com>
      Signed-off-by: NThibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b800aa58
    • J
      nfs: Fix security label length not being reset · ca70c18c
      Jeffrey Mitchell 提交于
      stable inclusion
      from linux-4.19.150
      commit 345c6f260c89e417de6e7d81f3366bd5079f48a3
      
      --------------------------------
      
      [ Upstream commit d33030e2 ]
      
      nfs_readdir_page_filler() iterates over entries in a directory, reusing
      the same security label buffer, but does not reset the buffer's length.
      This causes decode_attr_security_label() to return -ERANGE if an entry's
      security label is longer than the previous one's. This error, in
      nfs4_decode_dirent(), only gets passed up as -EAGAIN, which causes another
      failed attempt to copy into the buffer. The second error is ignored and
      the remaining entries do not show up in ls, specifically the getdents64()
      syscall.
      
      Reproduce by creating multiple files in NFS and giving one of the later
      files a longer security label. ls will not see that file nor any that are
      added afterwards, though they will exist on the backend.
      
      In nfs_readdir_page_filler(), reset security label buffer length before
      every reuse
      Signed-off-by: NJeffrey Mitchell <jeffrey.mitchell@starlab.io>
      Fixes: b4487b93 ("nfs: Fix getxattr kernel panic and memory overflow")
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ca70c18c
    • C
      nvme-core: get/put ctrl and transport module in nvme_dev_open/release() · 8a27b13f
      Chaitanya Kulkarni 提交于
      stable inclusion
      from linux-4.19.150
      commit c2df194a0d50bc1370c6761f5b80d3a32f42bcd4
      
      --------------------------------
      
      [ Upstream commit 52a3974f ]
      
      Get and put the reference to the ctrl in the nvme_dev_open() and
      nvme_dev_release() before and after module get/put for ctrl in char
      device file operations.
      
      Introduce char_dev relase function, get/put the controller and module
      which allows us to fix the potential Oops which can be easily reproduced
      with a passthru ctrl (although the problem also exists with pure user
      access):
      
      Entering kdb (current=0xffff8887f8290000, pid 3128) on processor 30 Oops: (null)
      due to oops @ 0xffffffffa01019ad
      CPU: 30 PID: 3128 Comm: bash Tainted: G        W  OE     5.8.0-rc4nvme-5.9+ #35
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.4
      RIP: 0010:nvme_free_ctrl+0x234/0x285 [nvme_core]
      Code: 57 10 a0 e8 73 bf 02 e1 ba 3d 11 00 00 48 c7 c6 98 33 10 a0 48 c7 c7 1d 57 10 a0 e8 5b bf 02 e1 8
      RSP: 0018:ffffc90001d63de0 EFLAGS: 00010246
      RAX: ffffffffa05c0440 RBX: ffff8888119e45a0 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffff8888177e9550 RDI: ffff8888119e43b0
      RBP: ffff8887d4768000 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: ffffc90001d63c90 R12: ffff8888119e43b0
      R13: ffff8888119e5108 R14: dead000000000100 R15: ffff8888119e5108
      FS:  00007f1ef27b0740(0000) GS:ffff888817600000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffffffffa05c0470 CR3: 00000007f6bee000 CR4: 00000000003406e0
      Call Trace:
       device_release+0x27/0x80
       kobject_put+0x98/0x170
       nvmet_passthru_ctrl_disable+0x4a/0x70 [nvmet]
       nvmet_passthru_enable_store+0x4c/0x90 [nvmet]
       configfs_write_file+0xe6/0x150
       vfs_write+0xba/0x1e0
       ksys_write+0x5f/0xe0
       do_syscall_64+0x52/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x7f1ef1eb2840
      Code: Bad RIP value.
      RSP: 002b:00007fffdbff0eb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f1ef1eb2840
      RDX: 0000000000000002 RSI: 00007f1ef27d2000 RDI: 0000000000000001
      RBP: 00007f1ef27d2000 R08: 000000000000000a R09: 00007f1ef27b0740
      R10: 0000000000000001 R11: 0000000000000246 R12: 00007f1ef2186400
      R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
      
      With this patch fix we take the module ref count in nvme_dev_open() and
      release that ref count in newly introduced nvme_dev_release().
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8a27b13f
    • S
      ftrace: Move RCU is watching check after recursion check · 9a469f6a
      Steven Rostedt (VMware) 提交于
      stable inclusion
      from linux-4.19.150
      commit 2fd5a462eb7b39694ae013450dc47d84cdf7204a
      
      --------------------------------
      
      commit b40341fa upstream.
      
      The first thing that the ftrace function callback helper functions should do
      is to check for recursion. Peter Zijlstra found that when
      "rcu_is_watching()" had its notrace removed, it caused perf function tracing
      to crash. This is because the call of rcu_is_watching() is tested before
      function recursion is checked and and if it is traced, it will cause an
      infinite recursion loop.
      
      rcu_is_watching() should still stay notrace, but to prevent this should
      never had crashed in the first place. The recursion prevention must be the
      first thing done in callback functions.
      
      Link: https://lore.kernel.org/r/20200929112541.GM2628@hirez.programming.kicks-ass.net
      
      Cc: stable@vger.kernel.org
      Cc: Paul McKenney <paulmck@kernel.org>
      Fixes: c68c0fa2 ("ftrace: Have ftrace_ops_get_func() handle RCU and PER_CPU flags too")
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reported-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      9a469f6a
    • G
      mm, THP, swap: fix allocating cluster for swapfile by mistake · 52fdd36b
      Gao Xiang 提交于
      stable inclusion
      from linux-4.19.149
      commit f3e8ed3d33fa963f1b6827977696235852cdd8d9
      
      --------------------------------
      
      commit 41663430 upstream.
      
      SWP_FS is used to make swap_{read,write}page() go through the
      filesystem, and it's only used for swap files over NFS.  So, !SWP_FS
      means non NFS for now, it could be either file backed or device backed.
      Something similar goes with legacy SWP_FILE.
      
      So in order to achieve the goal of the original patch, SWP_BLKDEV should
      be used instead.
      
      FS corruption can be observed with SSD device + XFS + fragmented
      swapfile due to CONFIG_THP_SWAP=y.
      
      I reproduced the issue with the following details:
      
      Environment:
      
        QEMU + upstream kernel + buildroot + NVMe (2 GB)
      
      Kernel config:
      
        CONFIG_BLK_DEV_NVME=y
        CONFIG_THP_SWAP=y
      
      Some reproducible steps:
      
        mkfs.xfs -f /dev/nvme0n1
        mkdir /tmp/mnt
        mount /dev/nvme0n1 /tmp/mnt
        bs="32k"
        sz="1024m"    # doesn't matter too much, I also tried 16m
        xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
        xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
        xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
        xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
        xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw
      
        mkswap /tmp/mnt/sw
        swapon /tmp/mnt/sw
      
        stress --vm 2 --vm-bytes 600M   # doesn't matter too much as well
      
      Symptoms:
       - FS corruption (e.g. checksum failure)
       - memory corruption at: 0xd2808010
       - segfault
      
      Fixes: f0eea189 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
      Fixes: 38d8b4e6 ("mm, THP, swap: delay splitting THP during swap out")
      Signed-off-by: NGao Xiang <hsiangkao@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Eric Sandeen <esandeen@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200820045323.7809-1-hsiangkao@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Conflicts:
        mm/swapfile.c
      [yyl: keep same as mainline]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      52fdd36b
    • M
      kprobes: Fix to check probe enabled before disarm_kprobe_ftrace() · f56c6487
      Masami Hiramatsu 提交于
      stable inclusion
      from linux-4.19.149
      commit ce7ff920092130f249b75f9fe177edb3362fefe8
      
      --------------------------------
      
      commit 3031313e upstream.
      
      Commit 0cb2f137 ("kprobes: Fix NULL pointer dereference at
      kprobe_ftrace_handler") fixed one bug but not completely fixed yet.
      If we run a kprobe_module.tc of ftracetest, kernel showed a warning
      as below.
      
      # ./ftracetest test.d/kprobe/kprobe_module.tc
      
      === Ftrace unit tests ===
      [1] Kprobe dynamic event - probing module
      ...
      [   22.400215] ------------[ cut here ]------------
      [   22.400962] Failed to disarm kprobe-ftrace at trace_printk_irq_work+0x0/0x7e [trace_printk] (-2)
      [   22.402139] WARNING: CPU: 7 PID: 200 at kernel/kprobes.c:1091 __disarm_kprobe_ftrace.isra.0+0x7e/0xa0
      [   22.403358] Modules linked in: trace_printk(-)
      [   22.404028] CPU: 7 PID: 200 Comm: rmmod Not tainted 5.9.0-rc2+ #66
      [   22.404870] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
      [   22.406139] RIP: 0010:__disarm_kprobe_ftrace.isra.0+0x7e/0xa0
      [   22.406947] Code: 30 8b 03 eb c9 80 3d e5 09 1f 01 00 75 dc 49 8b 34 24 89 c2 48 c7 c7 a0 c2 05 82 89 45 e4 c6 05 cc 09 1f 01 01 e8 a9 c7 f0 ff <0f> 0b 8b 45 e4 eb b9 89 c6 48 c7 c7 70 c2 05 82 89 45 e4 e8 91 c7
      [   22.409544] RSP: 0018:ffffc90000237df0 EFLAGS: 00010286
      [   22.410385] RAX: 0000000000000000 RBX: ffffffff83066024 RCX: 0000000000000000
      [   22.411434] RDX: 0000000000000001 RSI: ffffffff810de8d3 RDI: ffffffff810de8d3
      [   22.412687] RBP: ffffc90000237e10 R08: 0000000000000001 R09: 0000000000000001
      [   22.413762] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88807c478640
      [   22.414852] R13: ffffffff8235ebc0 R14: ffffffffa00060c0 R15: 0000000000000000
      [   22.415941] FS:  00000000019d48c0(0000) GS:ffff88807d7c0000(0000) knlGS:0000000000000000
      [   22.417264] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   22.418176] CR2: 00000000005bb7e3 CR3: 0000000078f7a000 CR4: 00000000000006a0
      [   22.419309] Call Trace:
      [   22.419990]  kill_kprobe+0x94/0x160
      [   22.420652]  kprobes_module_callback+0x64/0x230
      [   22.421470]  notifier_call_chain+0x4f/0x70
      [   22.422184]  blocking_notifier_call_chain+0x49/0x70
      [   22.422979]  __x64_sys_delete_module+0x1ac/0x240
      [   22.423733]  do_syscall_64+0x38/0x50
      [   22.424366]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   22.425176] RIP: 0033:0x4bb81d
      [   22.425741] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e0 ff ff ff f7 d8 64 89 01 48
      [   22.428726] RSP: 002b:00007ffc70fef008 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0
      [   22.430169] RAX: ffffffffffffffda RBX: 00000000019d48a0 RCX: 00000000004bb81d
      [   22.431375] RDX: 0000000000000000 RSI: 0000000000000880 RDI: 00007ffc70fef028
      [   22.432543] RBP: 0000000000000880 R08: 00000000ffffffff R09: 00007ffc70fef320
      [   22.433692] R10: 0000000000656300 R11: 0000000000000246 R12: 00007ffc70fef028
      [   22.434635] R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000000
      [   22.435682] irq event stamp: 1169
      [   22.436240] hardirqs last  enabled at (1179): [<ffffffff810df542>] console_unlock+0x422/0x580
      [   22.437466] hardirqs last disabled at (1188): [<ffffffff810df19b>] console_unlock+0x7b/0x580
      [   22.438608] softirqs last  enabled at (866): [<ffffffff81c0038e>] __do_softirq+0x38e/0x490
      [   22.439637] softirqs last disabled at (859): [<ffffffff81a00f42>] asm_call_on_stack+0x12/0x20
      [   22.440690] ---[ end trace 1e7ce7e1e4567276 ]---
      [   22.472832] trace_kprobe: This probe might be able to register after target module is loaded. Continue.
      
      This is because the kill_kprobe() calls disarm_kprobe_ftrace() even
      if the given probe is not enabled. In that case, ftrace_set_filter_ip()
      fails because the given probe point is not registered to ftrace.
      
      Fix to check the given (going) probe is enabled before invoking
      disarm_kprobe_ftrace().
      
      Link: https://lkml.kernel.org/r/159888672694.1411785.5987998076694782591.stgit@devnote2
      
      Fixes: 0cb2f137 ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler")
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Chengming Zhou <zhouchengming@bytedance.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f56c6487
    • T
      tracing: fix double free · 1d1d62d6
      Tom Rix 提交于
      stable inclusion
      from linux-4.19.149
      commit 240dd5118a9e0454f280ffeae63f22bd14735733
      
      --------------------------------
      
      commit 46bbe5c6 upstream.
      
      clang static analyzer reports this problem
      
      trace_events_hist.c:3824:3: warning: Attempt to free
        released memory
          kfree(hist_data->attrs->var_defs.name[i]);
      
      In parse_var_defs() if there is a problem allocating
      var_defs.expr, the earlier var_defs.name is freed.
      This free is duplicated by free_var_defs() which frees
      the rest of the list.
      
      Because free_var_defs() has to run anyway, remove the
      second free fom parse_var_defs().
      
      Link: https://lkml.kernel.org/r/20200907135845.15804-1-trix@redhat.com
      
      Cc: stable@vger.kernel.org
      Fixes: 30350d65 ("tracing: Add variable support to hist triggers")
      Reviewed-by: NTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: NTom Rix <trix@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1d1d62d6
    • Y
      bpf: Fix a rcu warning for bpffs map pretty-print · 1f14d906
      Yonghong Song 提交于
      stable inclusion
      from linux-4.19.149
      commit e1a75e94a3acf78e6afdd548a5d504fc29cbc953
      
      --------------------------------
      
      [ Upstream commit ce880cb8 ]
      
      Running selftest
        ./btf_btf -p
      the kernel had the following warning:
        [   51.528185] WARNING: CPU: 3 PID: 1756 at kernel/bpf/hashtab.c:717 htab_map_get_next_key+0x2eb/0x300
        [   51.529217] Modules linked in:
        [   51.529583] CPU: 3 PID: 1756 Comm: test_btf Not tainted 5.9.0-rc1+ #878
        [   51.530346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014
        [   51.531410] RIP: 0010:htab_map_get_next_key+0x2eb/0x300
        ...
        [   51.542826] Call Trace:
        [   51.543119]  map_seq_next+0x53/0x80
        [   51.543528]  seq_read+0x263/0x400
        [   51.543932]  vfs_read+0xad/0x1c0
        [   51.544311]  ksys_read+0x5f/0xe0
        [   51.544689]  do_syscall_64+0x33/0x40
        [   51.545116]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The related source code in kernel/bpf/hashtab.c:
        709 static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
        710 {
        711         struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
        712         struct hlist_nulls_head *head;
        713         struct htab_elem *l, *next_l;
        714         u32 hash, key_size;
        715         int i = 0;
        716
        717         WARN_ON_ONCE(!rcu_read_lock_held());
      
      In kernel/bpf/inode.c, bpffs map pretty print calls map->ops->map_get_next_key()
      without holding a rcu_read_lock(), hence causing the above warning.
      To fix the issue, just surrounding map->ops->map_get_next_key() with rcu read lock.
      
      Fixes: a26ca7c9 ("bpf: btf: Add pretty print support to the basic arraymap")
      Reported-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200916004401.146277-1-yhs@fb.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1f14d906
    • S
      lockdep: fix order in trace_hardirqs_off_caller() · 0224fc92
      Sven Schnelle 提交于
      stable inclusion
      from linux-4.19.149
      commit aafa75ff39d05ad8011c1b8fa118c36acec9661a
      
      --------------------------------
      
      [ Upstream commit 73ac74c7 ]
      
      Switch order so that locking state is consistent even
      if the IRQ tracer calls into lockdep again.
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NSven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0224fc92
    • A
      nvme: explicitly update mpath disk capacity on revalidation · df334213
      Anthony Iliopoulos 提交于
      stable inclusion
      from linux-4.19.149
      commit 906c9129787bf890f3f1b562ddac45c3ec0965a8
      
      --------------------------------
      
      [ Upstream commit 05b29021 ]
      
      Commit 3b4b1972 ("nvme: fix possible deadlock when I/O is
      blocked") reverted multipath head disk revalidation due to deadlocks
      caused by holding the bd_mutex during revalidate.
      
      Updating the multipath disk blockdev size is still required though for
      userspace to be able to observe any resizing while the device is
      mounted. Directly update the bdev inode size to avoid unnecessarily
      holding the bdev->bd_mutex.
      
      Fixes: 3b4b1972 ("nvme: fix possible deadlock when I/O is
      blocked")
      Signed-off-by: NAnthony Iliopoulos <ailiop@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      df334213
    • J
      perf parse-events: Use strcmp() to compare the PMU name · b3584994
      Jin Yao 提交于
      stable inclusion
      from linux-4.19.149
      commit 31c5c44707d8eb6809100a512b0877da51f795c2
      
      --------------------------------
      
      [ Upstream commit 8510895b ]
      
      A big uncore event group is split into multiple small groups which only
      include the uncore events from the same PMU. This has been supported in
      the commit 3cdc5c2c ("perf parse-events: Handle uncore event
      aliases in small groups properly").
      
      If the event's PMU name starts to repeat, it must be a new event.
      That can be used to distinguish the leader from other members.
      But now it only compares the pointer of pmu_name
      (leader->pmu_name == evsel->pmu_name).
      
      If we use "perf stat -M LLC_MISSES.PCIE_WRITE -a" on cascadelakex,
      the event list is:
      
        evsel->name					evsel->pmu_name
        ---------------------------------------------------------------
        unc_iio_data_req_of_cpu.mem_write.part0		uncore_iio_4 (as leader)
        unc_iio_data_req_of_cpu.mem_write.part0		uncore_iio_2
        unc_iio_data_req_of_cpu.mem_write.part0		uncore_iio_0
        unc_iio_data_req_of_cpu.mem_write.part0		uncore_iio_5
        unc_iio_data_req_of_cpu.mem_write.part0		uncore_iio_3
        unc_iio_data_req_of_cpu.mem_write.part0		uncore_iio_1
        unc_iio_data_req_of_cpu.mem_write.part1		uncore_iio_4
        ......
      
      For the event "unc_iio_data_req_of_cpu.mem_write.part1" with
      "uncore_iio_4", it should be the event from PMU "uncore_iio_4".
      It's not a new leader for this PMU.
      
      But if we use "(leader->pmu_name == evsel->pmu_name)", the check
      would be failed and the event is stored to leaders[] as a new
      PMU leader.
      
      So this patch uses strcmp to compare the PMU name between events.
      
      Fixes: d4953f7e ("perf parse-events: Fix 3 use after frees found with clang ASAN")
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jin Yao <yao.jin@intel.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/20200430003618.17002-1-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b3584994
    • Z
      vfio/pci: fix racy on error and request eventfd ctx · f2238e58
      Zeng Tao 提交于
      stable inclusion
      from linux-4.19.149
      commit 0d1682ca6d1314c27d07afacda4dd51baf5fcd94
      
      --------------------------------
      
      [ Upstream commit b872d064 ]
      
      The vfio_pci_release call will free and clear the error and request
      eventfd ctx while these ctx could be in use at the same time in the
      function like vfio_pci_request, and it's expected to protect them under
      the vdev->igate mutex, which is missing in vfio_pci_release.
      
      This issue is introduced since commit 1518ac27 ("vfio/pci: fix memory
      leaks of eventfd ctx"),and since commit 5c5866c5 ("vfio/pci: Clear
      error and request eventfd ctx after releasing"), it's very easily to
      trigger the kernel panic like this:
      
      [ 9513.904346] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
      [ 9513.913091] Mem abort info:
      [ 9513.915871]   ESR = 0x96000006
      [ 9513.918912]   EC = 0x25: DABT (current EL), IL = 32 bits
      [ 9513.924198]   SET = 0, FnV = 0
      [ 9513.927238]   EA = 0, S1PTW = 0
      [ 9513.930364] Data abort info:
      [ 9513.933231]   ISV = 0, ISS = 0x00000006
      [ 9513.937048]   CM = 0, WnR = 0
      [ 9513.940003] user pgtable: 4k pages, 48-bit VAs, pgdp=0000007ec7d12000
      [ 9513.946414] [0000000000000008] pgd=0000007ec7d13003, p4d=0000007ec7d13003, pud=0000007ec728c003, pmd=0000000000000000
      [ 9513.956975] Internal error: Oops: 96000006 [#1] PREEMPT SMP
      [ 9513.962521] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio hclge hns3 hnae3 [last unloaded: vfio_pci]
      [ 9513.972998] CPU: 4 PID: 1327 Comm: bash Tainted: G        W         5.8.0-rc4+ #3
      [ 9513.980443] Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V3.B270.01 05/08/2020
      [ 9513.989274] pstate: 80400089 (Nzcv daIf +PAN -UAO BTYPE=--)
      [ 9513.994827] pc : _raw_spin_lock_irqsave+0x48/0x88
      [ 9513.999515] lr : eventfd_signal+0x6c/0x1b0
      [ 9514.003591] sp : ffff800038a0b960
      [ 9514.006889] x29: ffff800038a0b960 x28: ffff007ef7f4da10
      [ 9514.012175] x27: ffff207eefbbfc80 x26: ffffbb7903457000
      [ 9514.017462] x25: ffffbb7912191000 x24: ffff007ef7f4d400
      [ 9514.022747] x23: ffff20be6e0e4c00 x22: 0000000000000008
      [ 9514.028033] x21: 0000000000000000 x20: 0000000000000000
      [ 9514.033321] x19: 0000000000000008 x18: 0000000000000000
      [ 9514.038606] x17: 0000000000000000 x16: ffffbb7910029328
      [ 9514.043893] x15: 0000000000000000 x14: 0000000000000001
      [ 9514.049179] x13: 0000000000000000 x12: 0000000000000002
      [ 9514.054466] x11: 0000000000000000 x10: 0000000000000a00
      [ 9514.059752] x9 : ffff800038a0b840 x8 : ffff007ef7f4de60
      [ 9514.065038] x7 : ffff007fffc96690 x6 : fffffe01faffb748
      [ 9514.070324] x5 : 0000000000000000 x4 : 0000000000000000
      [ 9514.075609] x3 : 0000000000000000 x2 : 0000000000000001
      [ 9514.080895] x1 : ffff007ef7f4d400 x0 : 0000000000000000
      [ 9514.086181] Call trace:
      [ 9514.088618]  _raw_spin_lock_irqsave+0x48/0x88
      [ 9514.092954]  eventfd_signal+0x6c/0x1b0
      [ 9514.096691]  vfio_pci_request+0x84/0xd0 [vfio_pci]
      [ 9514.101464]  vfio_del_group_dev+0x150/0x290 [vfio]
      [ 9514.106234]  vfio_pci_remove+0x30/0x128 [vfio_pci]
      [ 9514.111007]  pci_device_remove+0x48/0x108
      [ 9514.115001]  device_release_driver_internal+0x100/0x1b8
      [ 9514.120200]  device_release_driver+0x28/0x38
      [ 9514.124452]  pci_stop_bus_device+0x68/0xa8
      [ 9514.128528]  pci_stop_and_remove_bus_device+0x20/0x38
      [ 9514.133557]  pci_iov_remove_virtfn+0xb4/0x128
      [ 9514.137893]  sriov_disable+0x3c/0x108
      [ 9514.141538]  pci_disable_sriov+0x28/0x38
      [ 9514.145445]  hns3_pci_sriov_configure+0x48/0xb8 [hns3]
      [ 9514.150558]  sriov_numvfs_store+0x110/0x198
      [ 9514.154724]  dev_attr_store+0x44/0x60
      [ 9514.158373]  sysfs_kf_write+0x5c/0x78
      [ 9514.162018]  kernfs_fop_write+0x104/0x210
      [ 9514.166010]  __vfs_write+0x48/0x90
      [ 9514.169395]  vfs_write+0xbc/0x1c0
      [ 9514.172694]  ksys_write+0x74/0x100
      [ 9514.176079]  __arm64_sys_write+0x24/0x30
      [ 9514.179987]  el0_svc_common.constprop.4+0x110/0x200
      [ 9514.184842]  do_el0_svc+0x34/0x98
      [ 9514.188144]  el0_svc+0x14/0x40
      [ 9514.191185]  el0_sync_handler+0xb0/0x2d0
      [ 9514.195088]  el0_sync+0x140/0x180
      [ 9514.198389] Code: b9001020 d2800000 52800022 f9800271 (885ffe61)
      [ 9514.204455] ---[ end trace 648de00c8406465f ]---
      [ 9514.212308] note: bash[1327] exited with preempt_count 1
      
      Cc: Qian Cai <cai@lca.pw>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Fixes: 1518ac27 ("vfio/pci: fix memory leaks of eventfd ctx")
      Signed-off-by: NZeng Tao <prime.zeng@hisilicon.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f2238e58
    • S
      nvme: fix possible deadlock when I/O is blocked · b3f87175
      Sagi Grimberg 提交于
      stable inclusion
      from linux-4.19.149
      commit 03dfb191acea76e6f92379abdbb5335139b28ffa
      
      --------------------------------
      
      [ Upstream commit 3b4b1972 ]
      
      Revert fab7772b ("nvme-multipath: revalidate nvme_ns_head gendisk
      in nvme_validate_ns")
      
      When adding a new namespace to the head disk (via nvme_mpath_set_live)
      we will see partition scan which triggers I/O on the mpath device node.
      This process will usually be triggered from the scan_work which holds
      the scan_lock. If I/O blocks (if we got ana change currently have only
      available paths but none are accessible) this can deadlock on the head
      disk bd_mutex as both partition scan I/O takes it, and head disk revalidation
      takes it to check for resize (also triggered from scan_work on a different
      path). See trace [1].
      
      The mpath disk revalidation was originally added to detect online disk
      size change, but this is no longer needed since commit cb224c3a
      ("nvme: Convert to use set_capacity_revalidate_and_notify") which already
      updates resize info without unnecessarily revalidating the disk (the
      mpath disk doesn't even implement .revalidate_disk fop).
      
      [1]:
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b3f87175
    • Z
      cifs: Fix double add page to memcg when cifs_readpages · b309561e
      Zhang Xiaoxu 提交于
      stable inclusion
      from linux-4.19.149
      commit 5f7ca306c7db558fc81d9b1a45d59d5e1332a8a0
      
      --------------------------------
      
      [ Upstream commit 95a3d8f3 ]
      
      When xfstests generic/451, there is an BUG at mm/memcontrol.c:
        page:ffffea000560f2c0 refcount:2 mapcount:0 mapping:000000008544e0ea
             index:0xf
        mapping->aops:cifs_addr_ops dentry name:"tst-aio-dio-cycle-write.451"
        flags: 0x2fffff80000001(locked)
        raw: 002fffff80000001 ffffc90002023c50 ffffea0005280088 ffff88815cda0210
        raw: 000000000000000f 0000000000000000 00000002ffffffff ffff88817287d000
        page dumped because: VM_BUG_ON_PAGE(page->mem_cgroup)
        page->mem_cgroup:ffff88817287d000
        ------------[ cut here ]------------
        kernel BUG at mm/memcontrol.c:2659!
        invalid opcode: 0000 [#1] SMP
        CPU: 2 PID: 2038 Comm: xfs_io Not tainted 5.8.0-rc1 #44
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_
          073836-buildvm-ppc64le-16.ppc.4
        RIP: 0010:commit_charge+0x35/0x50
        Code: 0d 48 83 05 54 b2 02 05 01 48 89 77 38 c3 48 c7
              c6 78 4a ea ba 48 83 05 38 b2 02 05 01 e8 63 0d9
        RSP: 0018:ffffc90002023a50 EFLAGS: 00010202
        RAX: 0000000000000000 RBX: ffff88817287d000 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: ffff88817ac97ea0 RDI: ffff88817ac97ea0
        RBP: ffffea000560f2c0 R08: 0000000000000203 R09: 0000000000000005
        R10: 0000000000000030 R11: ffffc900020237a8 R12: 0000000000000000
        R13: 0000000000000001 R14: 0000000000000001 R15: ffff88815a1272c0
        FS:  00007f5071ab0800(0000) GS:ffff88817ac80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000055efcd5ca000 CR3: 000000015d312000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         mem_cgroup_charge+0x166/0x4f0
         __add_to_page_cache_locked+0x4a9/0x710
         add_to_page_cache_locked+0x15/0x20
         cifs_readpages+0x217/0x1270
         read_pages+0x29a/0x670
         page_cache_readahead_unbounded+0x24f/0x390
         __do_page_cache_readahead+0x3f/0x60
         ondemand_readahead+0x1f1/0x470
         page_cache_async_readahead+0x14c/0x170
         generic_file_buffered_read+0x5df/0x1100
         generic_file_read_iter+0x10c/0x1d0
         cifs_strict_readv+0x139/0x170
         new_sync_read+0x164/0x250
         __vfs_read+0x39/0x60
         vfs_read+0xb5/0x1e0
         ksys_pread64+0x85/0xf0
         __x64_sys_pread64+0x22/0x30
         do_syscall_64+0x69/0x150
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f5071fcb1af
        Code: Bad RIP value.
        RSP: 002b:00007ffde2cdb8e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
        RAX: ffffffffffffffda RBX: 00007ffde2cdb990 RCX: 00007f5071fcb1af
        RDX: 0000000000001000 RSI: 000055efcd5ca000 RDI: 0000000000000003
        RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000001000 R11: 0000000000000293 R12: 0000000000000001
        R13: 000000000009f000 R14: 0000000000000000 R15: 0000000000001000
        Modules linked in:
        ---[ end trace 725fa14a3e1af65c ]---
      
      Since commit 3fea5a49 ("mm: memcontrol: convert page cache to a new
      mem_cgroup_charge() API") not cancel the page charge, the pages maybe
      double add to pagecache:
      thread1                       | thread2
      cifs_readpages
      readpages_get_pages
       add_to_page_cache_locked(head,index=n)=0
                                    | readpages_get_pages
                                    | add_to_page_cache_locked(head,index=n+1)=0
       add_to_page_cache_locked(head, index=n+1)=-EEXIST
       then, will next loop with list head page's
       index=n+1 and the page->mapping not NULL
      readpages_get_pages
      add_to_page_cache_locked(head, index=n+1)
       commit_charge
        VM_BUG_ON_PAGE
      
      So, we should not do the next loop when any page add to page cache
      failed.
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Acked-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b309561e
    • A
      vfio/pci: Clear error and request eventfd ctx after releasing · f895b2a1
      Alex Williamson 提交于
      stable inclusion
      from linux-4.19.149
      commit 41a77298809e7be112f91972d794aa231fbe27aa
      
      --------------------------------
      
      [ Upstream commit 5c5866c5 ]
      
      The next use of the device will generate an underflow from the
      stale reference.
      
      Cc: Qian Cai <cai@lca.pw>
      Fixes: 1518ac27 ("vfio/pci: fix memory leaks of eventfd ctx")
      Reported-by: NDaniel Wagner <dwagner@suse.de>
      Reviewed-by: NCornelia Huck <cohuck@redhat.com>
      Tested-by: NDaniel Wagner <dwagner@suse.de>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f895b2a1
    • A
      perf kcore_copy: Fix module map when there are no modules loaded · ce72b2dd
      Adrian Hunter 提交于
      stable inclusion
      from linux-4.19.149
      commit a63689c06a6dd5c0cf2a9221927b9b1b2b2bb9c1
      
      --------------------------------
      
      [ Upstream commit 61f82e3f ]
      
      In the absence of any modules, no "modules" map is created, but there
      are other executable pages to map, due to eBPF JIT, kprobe or ftrace.
      Map them by recognizing that the first "module" symbol is not
      necessarily from a module, and adjust the map accordingly.
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Leo Yan <leo.yan@linaro.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: x86@kernel.org
      Link: http://lore.kernel.org/lkml/20200512121922.8997-10-adrian.hunter@intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ce72b2dd
    • I
      perf metricgroup: Free metric_events on error · 5aaea668
      Ian Rogers 提交于
      stable inclusion
      from linux-4.19.149
      commit cc6ae85020035734eb13597fd6e8b0074897b837
      
      --------------------------------
      
      [ Upstream commit a159e2fe ]
      
      Avoid a simple memory leak.
      Signed-off-by: NIan Rogers <irogers@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Jin Yao <yao.jin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: John Garry <john.garry@huawei.com>
      Cc: Kajol Jain <kjain@linux.ibm.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Kim Phillips <kim.phillips@amd.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Yonghong Song <yhs@fb.com>
      Cc: bpf@vger.kernel.org
      Cc: kp singh <kpsingh@chromium.org>
      Cc: netdev@vger.kernel.org
      Link: http://lore.kernel.org/lkml/20200508053629.210324-10-irogers@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5aaea668
    • X
      perf util: Fix memory leak of prefix_if_not_in · 8b35377a
      Xie XiuQi 提交于
      stable inclusion
      from linux-4.19.149
      commit dd155a48a0c9b53404b30f6f92ccf9f8160378c1
      
      --------------------------------
      
      [ Upstream commit 07e9a6f5 ]
      
      Need to free "str" before return when asprintf() failed to avoid memory
      leak.
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Hongbo Yao <yaohongbo@huawei.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Li Bin <huawei.libin@huawei.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: http://lore.kernel.org/lkml/20200521133218.30150-4-liwei391@huawei.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8b35377a
    • J
      perf stat: Fix duration_time value for higher intervals · 09e631e7
      Jiri Olsa 提交于
      stable inclusion
      from linux-4.19.149
      commit d911653688c588c22bdbc83459f87961c9d4399e
      
      --------------------------------
      
      [ Upstream commit ea9eb1f4 ]
      
      Joakim reported wrong duration_time value for interval bigger
      than 4000 [1].
      
      The problem is in the interval value we pass to update_stats
      function, which is typed as 'unsigned int' and overflows when
      we get over 2^32 (happens between intervals 4000 and 5000).
      
      Retyping the passed value to unsigned long long.
      
      [1] https://www.spinics.net/lists/linux-perf-users/msg11777.html
      
      Fixes: b90f1333 ("perf stat: Update walltime_nsecs_stats in interval mode")
      Reported-by: NJoakim Zhang <qiangqing.zhang@nxp.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Michael Petlan <mpetlan@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/20200518131445.3745083-1-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      09e631e7
    • I
      perf evsel: Fix 2 memory leaks · f9612c63
      Ian Rogers 提交于
      stable inclusion
      from linux-4.19.149
      commit 56540590ce7c316947d6740edc0403182a1e1ade
      
      --------------------------------
      
      [ Upstream commit 3efc899d ]
      
      If allocated, perf_pkg_mask and metric_events need freeing.
      Signed-off-by: NIan Rogers <irogers@google.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Link: http://lore.kernel.org/lkml/20200512235918.10732-1-irogers@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f9612c63
    • Q
      vfio/pci: fix memory leaks of eventfd ctx · 889371c3
      Qian Cai 提交于
      stable inclusion
      from linux-4.19.149
      commit b7e24664cc816717ca2a45b773d950a9188fb5c1
      
      --------------------------------
      
      [ Upstream commit 1518ac27 ]
      
      Finished a qemu-kvm (-device vfio-pci,host=0001:01:00.0) triggers a few
      memory leaks after a while because vfio_pci_set_ctx_trigger_single()
      calls eventfd_ctx_fdget() without the matching eventfd_ctx_put() later.
      Fix it by calling eventfd_ctx_put() for those memory in
      vfio_pci_release() before vfio_device_release().
      
      unreferenced object 0xebff008981cc2b00 (size 128):
        comm "qemu-kvm", pid 4043, jiffies 4294994816 (age 9796.310s)
        hex dump (first 32 bytes):
          01 00 00 00 6b 6b 6b 6b 00 00 00 00 ad 4e ad de  ....kkkk.....N..
          ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff  ....kkkk........
        backtrace:
          [<00000000917e8f8d>] slab_post_alloc_hook+0x74/0x9c
          [<00000000df0f2aa2>] kmem_cache_alloc_trace+0x2b4/0x3d4
          [<000000005fcec025>] do_eventfd+0x54/0x1ac
          [<0000000082791a69>] __arm64_sys_eventfd2+0x34/0x44
          [<00000000b819758c>] do_el0_svc+0x128/0x1dc
          [<00000000b244e810>] el0_sync_handler+0xd0/0x268
          [<00000000d495ef94>] el0_sync+0x164/0x180
      unreferenced object 0x29ff008981cc4180 (size 128):
        comm "qemu-kvm", pid 4043, jiffies 4294994818 (age 9796.290s)
        hex dump (first 32 bytes):
          01 00 00 00 6b 6b 6b 6b 00 00 00 00 ad 4e ad de  ....kkkk.....N..
          ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff  ....kkkk........
        backtrace:
          [<00000000917e8f8d>] slab_post_alloc_hook+0x74/0x9c
          [<00000000df0f2aa2>] kmem_cache_alloc_trace+0x2b4/0x3d4
          [<000000005fcec025>] do_eventfd+0x54/0x1ac
          [<0000000082791a69>] __arm64_sys_eventfd2+0x34/0x44
          [<00000000b819758c>] do_el0_svc+0x128/0x1dc
          [<00000000b244e810>] el0_sync_handler+0xd0/0x268
          [<00000000d495ef94>] el0_sync+0x164/0x180
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      889371c3
    • S
      printk: handle blank console arguments passed in. · 86592b2c
      Shreyas Joshi 提交于
      stable inclusion
      from linux-4.19.149
      commit c6a9585611a538466c8ad2421035c0ffa7fabc77
      
      --------------------------------
      
      [ Upstream commit 48021f98 ]
      
      If uboot passes a blank string to console_setup then it results in
      a trashed memory. Ultimately, the kernel crashes during freeing up
      the memory.
      
      This fix checks if there is a blank parameter being
      passed to console_setup from uboot. In case it detects that
      the console parameter is blank then it doesn't setup the serial
      device and it gracefully exits.
      
      Link: https://lore.kernel.org/r/20200522065306.83-1-shreyas.joshi@biamp.comSigned-off-by: NShreyas Joshi <shreyas.joshi@biamp.com>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      [pmladek@suse.com: Better format the commit message and code, remove unnecessary brackets.]
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      86592b2c
    • A
      arm64/cpufeature: Drop TraceFilt feature exposure from ID_DFR0 register · 827f5e68
      Anshuman Khandual 提交于
      stable inclusion
      from linux-4.19.149
      commit e682e0d53c390467100dadd0cebcf8f4f0b9498e
      
      --------------------------------
      
      [ Upstream commit 1ed1b90a ]
      
      ID_DFR0 based TraceFilt feature should not be exposed to guests. Hence lets
      drop it.
      
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-kernel@vger.kernel.org
      Suggested-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
      Link: https://lore.kernel.org/r/1589881254-10082-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      827f5e68
    • M
      fuse: don't check refcount after stealing page · 6f09c7c7
      Miklos Szeredi 提交于
      stable inclusion
      from linux-4.19.149
      commit 59da76a1713f7fd82d9c18ec72be99085b557027
      
      --------------------------------
      
      [ Upstream commit 32f98877 ]
      
      page_count() is unstable.  Unless there has been an RCU grace period
      between when the page was removed from the page cache and now, a
      speculative reference may exist from the page cache.
      Reported-by: NMatthew Wilcox <willy@infradead.org>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6f09c7c7
    • I
      perf mem2node: Avoid double free related to realloc · d10e06ba
      Ian Rogers 提交于
      stable inclusion
      from linux-4.19.149
      commit 318af7241223eea9fc16413b04a6915518ab1e9c
      
      --------------------------------
      
      [ Upstream commit 266150c9 ]
      
      Realloc of size zero is a free not an error, avoid this causing a double
      free. Caught by clang's address sanitizer:
      
      ==2634==ERROR: AddressSanitizer: attempting double-free on 0x6020000015f0 in thread T0:
          #0 0x5649659297fd in free llvm/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:123:3
          #1 0x5649659e9251 in __zfree tools/lib/zalloc.c:13:2
          #2 0x564965c0f92c in mem2node__exit tools/perf/util/mem2node.c:114:2
          #3 0x564965a08b4c in perf_c2c__report tools/perf/builtin-c2c.c:2867:2
          #4 0x564965a0616a in cmd_c2c tools/perf/builtin-c2c.c:2989:10
          #5 0x564965944348 in run_builtin tools/perf/perf.c:312:11
          #6 0x564965943235 in handle_internal_command tools/perf/perf.c:364:8
          #7 0x5649659440c4 in run_argv tools/perf/perf.c:408:2
          #8 0x564965942e41 in main tools/perf/perf.c:538:3
      
      0x6020000015f0 is located 0 bytes inside of 1-byte region [0x6020000015f0,0x6020000015f1)
      freed by thread T0 here:
          #0 0x564965929da3 in realloc third_party/llvm/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3
          #1 0x564965c0f55e in mem2node__init tools/perf/util/mem2node.c:97:16
          #2 0x564965a08956 in perf_c2c__report tools/perf/builtin-c2c.c:2803:8
          #3 0x564965a0616a in cmd_c2c tools/perf/builtin-c2c.c:2989:10
          #4 0x564965944348 in run_builtin tools/perf/perf.c:312:11
          #5 0x564965943235 in handle_internal_command tools/perf/perf.c:364:8
          #6 0x5649659440c4 in run_argv tools/perf/perf.c:408:2
          #7 0x564965942e41 in main tools/perf/perf.c:538:3
      
      previously allocated by thread T0 here:
          #0 0x564965929c42 in calloc third_party/llvm/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3
          #1 0x5649659e9220 in zalloc tools/lib/zalloc.c:8:9
          #2 0x564965c0f32d in mem2node__init tools/perf/util/mem2node.c:61:12
          #3 0x564965a08956 in perf_c2c__report tools/perf/builtin-c2c.c:2803:8
          #4 0x564965a0616a in cmd_c2c tools/perf/builtin-c2c.c:2989:10
          #5 0x564965944348 in run_builtin tools/perf/perf.c:312:11
          #6 0x564965943235 in handle_internal_command tools/perf/perf.c:364:8
          #7 0x5649659440c4 in run_argv tools/perf/perf.c:408:2
          #8 0x564965942e41 in main tools/perf/perf.c:538:3
      
      v2: add a WARN_ON_ONCE when the free condition arises.
      Signed-off-by: NIan Rogers <irogers@google.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: clang-built-linux@googlegroups.com
      Link: http://lore.kernel.org/lkml/20200320182347.87675-1-irogers@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d10e06ba
    • D
      bdev: Reduce time holding bd_mutex in sync in blkdev_close() · 465334c3
      Douglas Anderson 提交于
      stable inclusion
      from linux-4.19.149
      commit b6256c2966706c279f54bdd2c6582c7c370e9467
      
      --------------------------------
      
      [ Upstream commit b849dd84 ]
      
      While trying to "dd" to the block device for a USB stick, I
      encountered a hung task warning (blocked for > 120 seconds).  I
      managed to come up with an easy way to reproduce this on my system
      (where /dev/sdb is the block device for my USB stick) with:
      
        while true; do dd if=/dev/zero of=/dev/sdb bs=4M; done
      
      With my reproduction here are the relevant bits from the hung task
      detector:
      
       INFO: task udevd:294 blocked for more than 122 seconds.
       ...
       udevd           D    0   294      1 0x00400008
       Call trace:
        ...
        mutex_lock_nested+0x40/0x50
        __blkdev_get+0x7c/0x3d4
        blkdev_get+0x118/0x138
        blkdev_open+0x94/0xa8
        do_dentry_open+0x268/0x3a0
        vfs_open+0x34/0x40
        path_openat+0x39c/0xdf4
        do_filp_open+0x90/0x10c
        do_sys_open+0x150/0x3c8
        ...
      
       ...
       Showing all locks held in the system:
       ...
       1 lock held by dd/2798:
        #0: ffffff814ac1a3b8 (&bdev->bd_mutex){+.+.}, at: __blkdev_put+0x50/0x204
       ...
       dd              D    0  2798   2764 0x00400208
       Call trace:
        ...
        schedule+0x8c/0xbc
        io_schedule+0x1c/0x40
        wait_on_page_bit_common+0x238/0x338
        __lock_page+0x5c/0x68
        write_cache_pages+0x194/0x500
        generic_writepages+0x64/0xa4
        blkdev_writepages+0x24/0x30
        do_writepages+0x48/0xa8
        __filemap_fdatawrite_range+0xac/0xd8
        filemap_write_and_wait+0x30/0x84
        __blkdev_put+0x88/0x204
        blkdev_put+0xc4/0xe4
        blkdev_close+0x28/0x38
        __fput+0xe0/0x238
        ____fput+0x1c/0x28
        task_work_run+0xb0/0xe4
        do_notify_resume+0xfc0/0x14bc
        work_pending+0x8/0x14
      
      The problem appears related to the fact that my USB disk is terribly
      slow and that I have a lot of RAM in my system to cache things.
      Specifically my writes seem to be happening at ~15 MB/s and I've got
      ~4 GB of RAM in my system that can be used for buffering.  To write 4
      GB of buffer to disk thus takes ~4000 MB / ~15 MB/s = ~267 seconds.
      
      The 267 second number is a problem because in __blkdev_put() we call
      sync_blockdev() while holding the bd_mutex.  Any other callers who
      want the bd_mutex will be blocked for the whole time.
      
      The problem is made worse because I believe blkdev_put() specifically
      tells other tasks (namely udev) to go try to access the device at right
      around the same time we're going to hold the mutex for a long time.
      
      Putting some traces around this (after disabling the hung task detector),
      I could confirm:
       dd:    437.608600: __blkdev_put() right before sync_blockdev() for sdb
       udevd: 437.623901: blkdev_open() right before blkdev_get() for sdb
       dd:    661.468451: __blkdev_put() right after sync_blockdev() for sdb
       udevd: 663.820426: blkdev_open() right after blkdev_get() for sdb
      
      A simple fix for this is to realize that sync_blockdev() works fine if
      you're not holding the mutex.  Also, it's not the end of the world if
      you sync a little early (though it can have performance impacts).
      Thus we can make a guess that we're going to need to do the sync and
      then do it without holding the mutex.  We still do one last sync with
      the mutex but it should be much, much faster.
      
      With this, my hung task warnings for my test case are gone.
      Signed-off-by: NDouglas Anderson <dianders@chromium.org>
      Reviewed-by: NGuenter Roeck <groeck@chromium.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      465334c3
    • J
      mm/mmap.c: initialize align_offset explicitly for vm_unmapped_area · 877b77e6
      Jaewon Kim 提交于
      stable inclusion
      from linux-4.19.149
      commit 6bee7991f63e6ae8faba0c704f4d98575bb0312f
      
      --------------------------------
      
      [ Upstream commit 09ef5283 ]
      
      On passing requirement to vm_unmapped_area, arch_get_unmapped_area and
      arch_get_unmapped_area_topdown did not set align_offset.  Internally on
      both unmapped_area and unmapped_area_topdown, if info->align_mask is 0,
      then info->align_offset was meaningless.
      
      But commit df529cab ("mm: mmap: add trace point of
      vm_unmapped_area") always prints info->align_offset even though it is
      uninitialized.
      
      Fix this uninitialized value issue by setting it to 0 explicitly.
      
      Before:
        vm_unmapped_area: addr=0x755b155000 err=0 total_vm=0x15aaf0 flags=0x1 len=0x109000 lo=0x8000 hi=0x75eed48000 mask=0x0 ofs=0x4022
      
      After:
        vm_unmapped_area: addr=0x74a4ca1000 err=0 total_vm=0x168ab1 flags=0x1 len=0x9000 lo=0x8000 hi=0x753d94b000 mask=0x0 ofs=0x0
      Signed-off-by: NJaewon Kim <jaewon31.kim@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Link: http://lkml.kernel.org/r/20200409094035.19457-1-jaewon31.kim@samsung.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Conflict:
        mm/mmap.c
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      877b77e6
    • Q
      mm/vmscan.c: fix data races using kswapd_classzone_idx · e4783368
      Qian Cai 提交于
      stable inclusion
      from linux-4.19.149
      commit b73c744019721ea47340b37440a7f6a263beea54
      
      --------------------------------
      
      [ Upstream commit 5644e1fb ]
      
      pgdat->kswapd_classzone_idx could be accessed concurrently in
      wakeup_kswapd().  Plain writes and reads without any lock protection
      result in data races.  Fix them by adding a pair of READ|WRITE_ONCE() as
      well as saving a branch (compilers might well optimize the original code
      in an unintentional way anyway).  While at it, also take care of
      pgdat->kswapd_order and non-kswapd threads in allow_direct_reclaim().  The
      data races were reported by KCSAN,
      
       BUG: KCSAN: data-race in wakeup_kswapd / wakeup_kswapd
      
       write to 0xffff9f427ffff2dc of 4 bytes by task 7454 on cpu 13:
        wakeup_kswapd+0xf1/0x400
        wakeup_kswapd at mm/vmscan.c:3967
        wake_all_kswapds+0x59/0xc0
        wake_all_kswapds at mm/page_alloc.c:4241
        __alloc_pages_slowpath+0xdcc/0x1290
        __alloc_pages_slowpath at mm/page_alloc.c:4512
        __alloc_pages_nodemask+0x3bb/0x450
        alloc_pages_vma+0x8a/0x2c0
        do_anonymous_page+0x16e/0x6f0
        __handle_mm_fault+0xcd5/0xd40
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       1 lock held by mtest01/7454:
        #0: ffff9f425afe8808 (&mm->mmap_sem#2){++++}, at:
       do_page_fault+0x143/0x6f9
       do_user_addr_fault at arch/x86/mm/fault.c:1405
       (inlined by) do_page_fault at arch/x86/mm/fault.c:1539
       irq event stamp: 6944085
       count_memcg_event_mm+0x1a6/0x270
       count_memcg_event_mm+0x119/0x270
       __do_softirq+0x34c/0x57c
       irq_exit+0xa2/0xc0
      
       read to 0xffff9f427ffff2dc of 4 bytes by task 7472 on cpu 38:
        wakeup_kswapd+0xc8/0x400
        wake_all_kswapds+0x59/0xc0
        __alloc_pages_slowpath+0xdcc/0x1290
        __alloc_pages_nodemask+0x3bb/0x450
        alloc_pages_vma+0x8a/0x2c0
        do_anonymous_page+0x16e/0x6f0
        __handle_mm_fault+0xcd5/0xd40
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      
       1 lock held by mtest01/7472:
        #0: ffff9f425a9ac148 (&mm->mmap_sem#2){++++}, at:
       do_page_fault+0x143/0x6f9
       irq event stamp: 6793561
       count_memcg_event_mm+0x1a6/0x270
       count_memcg_event_mm+0x119/0x270
       __do_softirq+0x34c/0x57c
       irq_exit+0xa2/0xc0
      
       BUG: KCSAN: data-race in kswapd / wakeup_kswapd
      
       write to 0xffff90973ffff2dc of 4 bytes by task 820 on cpu 6:
        kswapd+0x27c/0x8d0
        kthread+0x1e0/0x200
        ret_from_fork+0x27/0x50
      
       read to 0xffff90973ffff2dc of 4 bytes by task 6299 on cpu 0:
        wakeup_kswapd+0xf3/0x450
        wake_all_kswapds+0x59/0xc0
        __alloc_pages_slowpath+0xdcc/0x1290
        __alloc_pages_nodemask+0x3bb/0x450
        alloc_pages_vma+0x8a/0x2c0
        do_anonymous_page+0x170/0x700
        __handle_mm_fault+0xc9f/0xd00
        handle_mm_fault+0xfc/0x2f0
        do_page_fault+0x263/0x6f9
        page_fault+0x34/0x40
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/1582749472-5171-1-git-send-email-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e4783368
    • X
      mm/filemap.c: clear page error before actual read · cfdea429
      Xianting Tian 提交于
      stable inclusion
      from linux-4.19.149
      commit cebefe4f6fc0cf5721d443b91e8f43a66766fb06
      
      --------------------------------
      
      [ Upstream commit faffdfa0 ]
      
      Mount failure issue happens under the scenario: Application forked dozens
      of threads to mount the same number of cramfs images separately in docker,
      but several mounts failed with high probability.  Mount failed due to the
      checking result of the page(read from the superblock of loop dev) is not
      uptodate after wait_on_page_locked(page) returned in function cramfs_read:
      
         wait_on_page_locked(page);
         if (!PageUptodate(page)) {
            ...
         }
      
      The reason of the checking result of the page not uptodate: systemd-udevd
      read the loopX dev before mount, because the status of loopX is Lo_unbound
      at this time, so loop_make_request directly trigger the calling of io_end
      handler end_buffer_async_read, which called SetPageError(page).  So It
      caused the page can't be set to uptodate in function
      end_buffer_async_read:
      
         if(page_uptodate && !PageError(page)) {
            SetPageUptodate(page);
         }
      
      Then mount operation is performed, it used the same page which is just
      accessed by systemd-udevd above, Because this page is not uptodate, it
      will launch a actual read via submit_bh, then wait on this page by calling
      wait_on_page_locked(page).  When the I/O of the page done, io_end handler
      end_buffer_async_read is called, because no one cleared the page
      error(during the whole read path of mount), which is caused by
      systemd-udevd reading, so this page is still in "PageError" status, which
      can't be set to uptodate in function end_buffer_async_read, then caused
      mount failure.
      
      But sometimes mount succeed even through systemd-udeved read loopX dev
      just before, The reason is systemd-udevd launched other loopX read just
      between step 3.1 and 3.2, the steps as below:
      
      1, loopX dev default status is Lo_unbound;
      2, systemd-udved read loopX dev (page is set to PageError);
      3, mount operation
         1) set loopX status to Lo_bound;
         ==>systemd-udevd read loopX dev<==
         2) read loopX dev(page has no error)
         3) mount succeed
      
      As the loopX dev status is set to Lo_bound after step 3.1, so the other
      loopX dev read by systemd-udevd will go through the whole I/O stack, part
      of the call trace as below:
      
         SYS_read
            vfs_read
                do_sync_read
                    blkdev_aio_read
                       generic_file_aio_read
                           do_generic_file_read:
                              ClearPageError(page);
                              mapping->a_ops->readpage(filp, page);
      
      here, mapping->a_ops->readpage() is blkdev_readpage.  In latest kernel,
      some function name changed, the call trace as below:
      
         blkdev_read_iter
            generic_file_read_iter
               generic_file_buffered_read:
                  /*
                   * A previous I/O error may have been due to temporary
                   * failures, eg. mutipath errors.
                   * Pg_error will be set again if readpage fails.
                   */
                  ClearPageError(page);
                  /* Start the actual read. The read will unlock the page*/
                  error=mapping->a_ops->readpage(flip, page);
      
      We can see ClearPageError(page) is called before the actual read,
      then the read in step 3.2 succeed.
      
      This patch is to add the calling of ClearPageError just before the actual
      read of read path of cramfs mount.  Without the patch, the call trace as
      below when performing cramfs mount:
      
         do_mount
            cramfs_read
               cramfs_blkdev_read
                  read_cache_page
                     do_read_cache_page:
                        filler(data, page);
                        or
                        mapping->a_ops->readpage(data, page);
      
      With the patch, the call trace as below when performing mount:
      
         do_mount
            cramfs_read
               cramfs_blkdev_read
                  read_cache_page:
                     do_read_cache_page:
                        ClearPageError(page); <== new add
                        filler(data, page);
                        or
                        mapping->a_ops->readpage(data, page);
      
      With the patch, mount operation trigger the calling of
      ClearPageError(page) before the actual read, the page has no error if no
      additional page error happen when I/O done.
      Signed-off-by: NXianting Tian <xianting_tian@126.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: <yubin@h3c.com>
      Link: http://lkml.kernel.org/r/1583318844-22971-1-git-send-email-xianting_tian@126.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      cfdea429
    • N
      mm/kmemleak.c: use address-of operator on section symbols · 747872b0
      Nathan Chancellor 提交于
      stable inclusion
      from linux-4.19.149
      commit afe001488e7e8e1108a2d9fcac3757713ffae503
      
      --------------------------------
      
      [ Upstream commit b0d14fc4 ]
      
      Clang warns:
      
        mm/kmemleak.c:1955:28: warning: array comparison always evaluates to a constant [-Wtautological-compare]
              if (__start_ro_after_init < _sdata || __end_ro_after_init > _edata)
                                        ^
        mm/kmemleak.c:1955:60: warning: array comparison always evaluates to a constant [-Wtautological-compare]
              if (__start_ro_after_init < _sdata || __end_ro_after_init > _edata)
      
      These are not true arrays, they are linker defined symbols, which are just
      addresses.  Using the address of operator silences the warning and does
      not change the resulting assembly with either clang/ld.lld or gcc/ld
      (tested with diff + objdump -Dr).
      Suggested-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Link: https://github.com/ClangBuiltLinux/linux/issues/895
      Link: http://lkml.kernel.org/r/20200220051551.44000-1-natechancellor@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      747872b0