1. 02 9月, 2020 1 次提交
    • Y
      alinux: sched: Add cpu_stress to show system-wide task waiting · ab81d2d9
      Yihao Wu 提交于
      to #28739709
      
      /proc/loadavg can reflex the waiting tasks over a period of time
      to some extent. But to become a SLI requires better precision and
      quicker response. Furthermore, I/O block is not concerned here,
      and bandwidth control is excluded from cpu_stress.
      
      This patch adds a new interface /proc/cpu_stress. It's based on
      task runtime tracking so we don't need to deal with complex state
      transition. And because task runtime tracking is done in most
      scheduler events, the precision is quite enough.
      
      Like loadavg, cpu_stress has 3 average windows too (1,5,15 min)
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      ab81d2d9
  2. 28 5月, 2020 2 次提交
  3. 24 4月, 2020 1 次提交
  4. 18 3月, 2020 1 次提交
    • X
      alinux: fs: record page or bio info while process is waitting on it · 2864376f
      Xiaoguang Wang 提交于
      If one process context is stucked in wait_on_buffer(), lock_buffer(),
      lock_page() and wait_on_page_writeback() and wait_on_bit_io(), it's
      hard to tell ture reason, for example, whether this page is under io,
      or this page is just locked too long by other process context.
      
      Normally io request has multiple bios, and every bio contains multiple
      pages which will hold data to be read from or written to device, so here
      we record page info or bio info in task_struct while process calls
      lock_page(), lock_buffer(), wait_on_page_writeback(), wait_on_buffer()
      and wait_on_bit_io(), we add a new proce interface:
      [lege@localhost linux]$ cat /proc/4516/wait_res
      1 ffffd0969f95d3c0 4295369599 4295381596
      
      Above info means that thread 4516 is waitting on a page, address is
      ffffd0969f95d3c0, and has waited for 11997ms.
      
      First field denotes the page address process is waitting on.
      Second field denotes the wait moment and the third denotes current moment.
      
      In practice, if we found a process waitting on one page for too long time,
      we can get page's address by reading /proc/$pid/wait_page, and search this
      page address in all block devices' /sys/kernel/debug/block/${devname}/rq_hang,
      if search operation hits one, we can get the request and know why this io
      request hangs that long.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      2864376f
  5. 27 12月, 2019 2 次提交
  6. 18 12月, 2019 1 次提交
    • M
      mm, thp, proc: report THP eligibility for each vma · c76adee3
      Michal Hocko 提交于
      [ Upstream commit 7635d9cbe8327e131a1d3d8517dc186c2796ce2e ]
      
      Userspace falls short when trying to find out whether a specific memory
      range is eligible for THP.  There are usecases that would like to know
      that
      http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
      : This is used to identify heap mappings that should be able to fault thp
      : but do not, and they normally point to a low-on-memory or fragmentation
      : issue.
      
      The only way to deduce this now is to query for hg resp.  nh flags and
      confronting the state with the global setting.  Except that there is also
      PR_SET_THP_DISABLE that might change the picture.  So the final logic is
      not trivial.  Moreover the eligibility of the vma depends on the type of
      VMA as well.  In the past we have supported only anononymous memory VMAs
      but things have changed and shmem based vmas are supported as well these
      days and the query logic gets even more complicated because the
      eligibility depends on the mount option and another global configuration
      knob.
      
      Simplify the current state and report the THP eligibility in
      /proc/<pid>/smaps for each existing vma.  Reuse
      transparent_hugepage_enabled for this purpose.  The original
      implementation of this function assumes that the caller knows that the vma
      itself is supported for THP so make the core checks into
      __transparent_hugepage_enabled and use it for existing callers.
      __show_smap just use the new transparent_hugepage_enabled which also
      checks the vma support status (please note that this one has to be out of
      line due to include dependency issues).
      
      [mhocko@kernel.org: fix oops with NULL ->f_mapping]
        Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Paul Oppenheimer <bepvte@gmail.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c76adee3
  7. 24 11月, 2019 1 次提交
  8. 29 10月, 2019 1 次提交
    • D
      fs/proc/page.c: don't access uninitialized memmaps in fs/proc/page.c · 6ea856ef
      David Hildenbrand 提交于
      commit aad5f69bc161af489dbb5934868bd347282f0764 upstream.
      
      There are three places where we access uninitialized memmaps, namely:
      - /proc/kpagecount
      - /proc/kpageflags
      - /proc/kpagecgroup
      
      We have initialized memmaps either when the section is online or when the
      page was initialized to the ZONE_DEVICE.  Uninitialized memmaps contain
      garbage and in the worst case trigger kernel BUGs, especially with
      CONFIG_PAGE_POISONING.
      
      For example, not onlining a DIMM during boot and calling /proc/kpagecount
      with CONFIG_PAGE_POISONING:
      
        :/# cat /proc/kpagecount > tmp.test
        BUG: unable to handle page fault for address: fffffffffffffffe
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 114616067 P4D 114616067 PUD 114618067 PMD 0
        Oops: 0000 [#1] SMP NOPTI
        CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
        RIP: 0010:kpagecount_read+0xce/0x1e0
        Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480
        RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202
        RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000
        RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
        R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08
        FS:  00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0
        Call Trace:
         proc_reg_read+0x3c/0x60
         vfs_read+0xc5/0x180
         ksys_read+0x68/0xe0
         do_syscall_64+0x5c/0xa0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      For now, let's drop support for ZONE_DEVICE from the three pseudo files
      in order to fix this.  To distinguish offline memory (with garbage
      memmap) from ZONE_DEVICE memory with properly initialized memmaps, we
      would have to check get_dev_pagemap() and pfn_zone_device_reserved()
      right now.  The usage of both (especially, special casing devmem) is
      frowned upon and needs to be reworked.
      
      The fundamental issue we have is:
      
      	if (pfn_to_online_page(pfn)) {
      		/* memmap initialized */
      	} else if (pfn_valid(pfn)) {
      		/*
      		 * ???
      		 * a) offline memory. memmap garbage.
      		 * b) devmem: memmap initialized to ZONE_DEVICE.
      		 * c) devmem: reserved for driver. memmap garbage.
      		 * (d) devmem: memmap currently initializing - garbage)
      		 */
      	}
      
      We'll leave the pfn_zone_device_reserved() check in stable_page_flags()
      in place as that function is also used from memory failure.  We now no
      longer dump information about pages that are not in use anymore -
      offline.
      
      Link: http://lkml.kernel.org/r/20191009142435.3975-2-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NQian Cai <cai@lca.pw>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6ea856ef
  9. 04 8月, 2019 2 次提交
    • L
      /proc/<pid>/cmdline: add back the setproctitle() special case · 54695343
      Linus Torvalds 提交于
      commit d26d0cd97c88eb1a5704b42e41ab443406807810 upstream.
      
      This makes the setproctitle() special case very explicit indeed, and
      handles it with a separate helper function entirely.  In the process, it
      re-instates the original semantics of simply stopping at the first NUL
      character when the original last NUL character is no longer there.
      
      [ The original semantics can still be seen in mm/util.c: get_cmdline()
        that is limited to a fixed-size buffer ]
      
      This makes the logic about when we use the string lengths etc much more
      obvious, and makes it easier to see what we do and what the two very
      different cases are.
      
      Note that even when we allow walking past the end of the argument array
      (because the setproctitle() might have overwritten and overflowed the
      original argv[] strings), we only allow it when it overflows into the
      environment region if it is immediately adjacent.
      
      [ Fixed for missing 'count' checks noted by Alexey Izbyshev ]
      
      Link: https://lore.kernel.org/lkml/alpine.LNX.2.21.1904052326230.3249@kich.toxcorp.com/
      Fixes: 5ab82718 ("fs/proc: simplify and clarify get_mm_cmdline() function")
      Cc: Jakub Jankowski <shasta@toxcorp.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alexey Izbyshev <izbyshev@ispras.ru>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      54695343
    • L
      /proc/<pid>/cmdline: remove all the special cases · 54ffaa53
      Linus Torvalds 提交于
      commit 3d712546d8ba9f25cdf080d79f90482aa4231ed4 upstream.
      
      Start off with a clean slate that only reads exactly from arg_start to
      arg_end, without any oddities.  This simplifies the code and in the
      process removes the case that caused us to potentially leak an
      uninitialized byte from the temporary kernel buffer.
      
      Note that in order to start from scratch with an understandable base,
      this simplifies things _too_ much, and removes all the legacy logic to
      handle setproctitle() having changed the argument strings.
      
      We'll add back those special cases very differently in the next commit.
      
      Link: https://lore.kernel.org/lkml/20190712160913.17727-1-izbyshev@ispras.ru/
      Fixes: f5b65348 ("proc: fix missing final NUL in get_mm_cmdline() rewrite")
      Cc: Alexey Izbyshev <izbyshev@ispras.ru>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      54ffaa53
  10. 31 7月, 2019 5 次提交
  11. 26 7月, 2019 1 次提交
    • R
      fs/proc/proc_sysctl.c: fix the default values of i_uid/i_gid on /proc/sys inodes. · 2c7b50c7
      Radoslaw Burny 提交于
      commit 5ec27ec735ba0477d48c80561cc5e856f0c5dfaf upstream.
      
      Normally, the inode's i_uid/i_gid are translated relative to s_user_ns,
      but this is not a correct behavior for proc.  Since sysctl permission
      check in test_perm is done against GLOBAL_ROOT_[UG]ID, it makes more
      sense to use these values in u_[ug]id of proc inodes.  In other words:
      although uid/gid in the inode is not read during test_perm, the inode
      logically belongs to the root of the namespace.  I have confirmed this
      with Eric Biederman at LPC and in this thread:
        https://lore.kernel.org/lkml/87k1kzjdff.fsf@xmission.com
      
      Consequences
      ============
      
      Since the i_[ug]id values of proc nodes are not used for permissions
      checks, this change usually makes no functional difference.  However, it
      causes an issue in a setup where:
      
       * a namespace container is created without root user in container -
         hence the i_[ug]id of proc nodes are set to INVALID_[UG]ID
      
       * container creator tries to configure it by writing /proc/sys files,
         e.g. writing /proc/sys/kernel/shmmax to configure shared memory limit
      
      Kernel does not allow to open an inode for writing if its i_[ug]id are
      invalid, making it impossible to write shmmax and thus - configure the
      container.
      
      Using a container with no root mapping is apparently rare, but we do use
      this configuration at Google.  Also, we use a generic tool to configure
      the container limits, and the inability to write any of them causes a
      failure.
      
      History
      =======
      
      The invalid uids/gids in inodes first appeared due to 81754357 (fs:
      Update i_[ug]id_(read|write) to translate relative to s_user_ns).
      However, AFAIK, this did not immediately cause any issues.  The
      inability to write to these "invalid" inodes was only caused by a later
      commit 0bd23d09 (vfs: Don't modify inodes with a uid or gid unknown
      to the vfs).
      
      Tested: Used a repro program that creates a user namespace without any
      mapping and stat'ed /proc/$PID/root/proc/sys/kernel/shmmax from outside.
      Before the change, it shows the overflow uid, with the change it's 0.
      The overflow uid indicates that the uid in the inode is not correct and
      thus it is not possible to open the file for writing.
      
      Link: http://lkml.kernel.org/r/20190708115130.250149-1-rburny@google.com
      Fixes: 0bd23d09 ("vfs: Don't modify inodes with a uid or gid unknown to the vfs")
      Signed-off-by: NRadoslaw Burny <rburny@google.com>
      Acked-by: NLuis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Seth Forshee <seth.forshee@canonical.com>
      Cc: John Sperbeck <jsperbeck@google.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2c7b50c7
  12. 03 7月, 2019 1 次提交
  13. 26 5月, 2019 1 次提交
  14. 02 5月, 2019 1 次提交
    • Y
      fs/proc/proc_sysctl.c: Fix a NULL pointer dereference · 4d476a00
      YueHaibing 提交于
      commit 89189557b47b35683a27c80ee78aef18248eefb4 upstream.
      
      Syzkaller report this:
      
        sysctl could not get directory: /net//bridge -12
        kasan: CONFIG_KASAN_INLINE enabled
        kasan: GPF could be caused by NULL-ptr deref or user memory access
        general protection fault: 0000 [#1] SMP KASAN PTI
        CPU: 1 PID: 7027 Comm: syz-executor.0 Tainted: G         C        5.1.0-rc3+ #8
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
        RIP: 0010:__write_once_size include/linux/compiler.h:220 [inline]
        RIP: 0010:__rb_change_child include/linux/rbtree_augmented.h:144 [inline]
        RIP: 0010:__rb_erase_augmented include/linux/rbtree_augmented.h:186 [inline]
        RIP: 0010:rb_erase+0x5f4/0x19f0 lib/rbtree.c:459
        Code: 00 0f 85 60 13 00 00 48 89 1a 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 f2 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 75 0c 00 00 4d 85 ed 4c 89 2e 74 ce 4c 89 ea 48
        RSP: 0018:ffff8881bb507778 EFLAGS: 00010206
        RAX: dffffc0000000000 RBX: ffff8881f224b5b8 RCX: ffffffff818f3f6a
        RDX: 000000000000000a RSI: 0000000000000050 RDI: ffff8881f224b568
        RBP: 0000000000000000 R08: ffffed10376a0ef4 R09: ffffed10376a0ef4
        R10: 0000000000000001 R11: ffffed10376a0ef4 R12: ffff8881f224b558
        R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
        FS:  00007f3e7ce13700(0000) GS:ffff8881f7300000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fd60fbe9398 CR3: 00000001cb55c001 CR4: 00000000007606e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        PKRU: 55555554
        Call Trace:
         erase_entry fs/proc/proc_sysctl.c:178 [inline]
         erase_header+0xe3/0x160 fs/proc/proc_sysctl.c:207
         start_unregistering fs/proc/proc_sysctl.c:331 [inline]
         drop_sysctl_table+0x558/0x880 fs/proc/proc_sysctl.c:1631
         get_subdir fs/proc/proc_sysctl.c:1022 [inline]
         __register_sysctl_table+0xd65/0x1090 fs/proc/proc_sysctl.c:1335
         br_netfilter_init+0x68/0x1000 [br_netfilter]
         do_one_initcall+0xbc/0x47d init/main.c:901
         do_init_module+0x1b5/0x547 kernel/module.c:3456
         load_module+0x6405/0x8c10 kernel/module.c:3804
         __do_sys_finit_module+0x162/0x190 kernel/module.c:3898
         do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        Modules linked in: br_netfilter(+) backlight comedi(C) hid_sensor_hub max3100 ti_ads8688 udc_core fddi snd_mona leds_gpio rc_streamzap mtd pata_netcell nf_log_common rc_winfast udp_tunnel snd_usbmidi_lib snd_usb_toneport snd_usb_line6 snd_rawmidi snd_seq_device snd_hwdep videobuf2_v4l2 videobuf2_common videodev media videobuf2_vmalloc videobuf2_memops rc_gadmei_rm008z 8250_of smm665 hid_tmff hid_saitek hwmon_vid rc_ati_tv_wonder_hd_600 rc_core pata_pdc202xx_old dn_rtmsg as3722 ad714x_i2c ad714x snd_soc_cs4265 hid_kensington panel_ilitek_ili9322 drm drm_panel_orientation_quirks ipack cdc_phonet usbcore phonet hid_jabra hid extcon_arizona can_dev industrialio_triggered_buffer kfifo_buf industrialio adm1031 i2c_mux_ltc4306 i2c_mux ipmi_msghandler mlxsw_core snd_soc_cs35l34 snd_soc_core snd_pcm_dmaengine snd_pcm snd_timer ac97_bus snd_compress snd soundcore gpio_da9055 uio ecdh_generic mdio_thunder of_mdio fixed_phy libphy mdio_cavium iptable_security iptable_raw iptable_mangle
         iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel hsr veth netdevsim vxcan batman_adv cfg80211 rfkill chnl_net caif nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun joydev mousedev ppdev tpm kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ide_pci_generic piix aes_x86_64 crypto_simd cryptd ide_core glue_helper input_leds psmouse intel_agp intel_gtt serio_raw ata_generic i2c_piix4 agpgart pata_acpi parport_pc parport floppy rtc_cmos sch_fq_codel ip_tables x_tables sha1_ssse3 sha1_generic ipv6 [last unloaded: br_netfilter]
        Dumping ftrace buffer:
           (ftrace buffer empty)
        ---[ end trace 68741688d5fbfe85 ]---
      
      commit 23da9588037e ("fs/proc/proc_sysctl.c: fix NULL pointer
      dereference in put_links") forgot to handle start_unregistering() case,
      while header->parent is NULL, it calls erase_header() and as seen in the
      above syzkaller call trace, accessing &header->parent->root will trigger
      a NULL pointer dereference.
      
      As that commit explained, there is also no need to call
      start_unregistering() if header->parent is NULL.
      
      Link: http://lkml.kernel.org/r/20190409153622.28112-1-yuehaibing@huawei.com
      Fixes: 23da9588037e ("fs/proc/proc_sysctl.c: fix NULL pointer dereference in put_links")
      Fixes: 0e47c99d ("sysctl: Replace root_list with links between sysctl_table_sets")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d476a00
  15. 27 4月, 2019 1 次提交
    • A
      coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping · 6ff17bc5
      Andrea Arcangeli 提交于
      commit 04f5866e41fb70690e28397487d8bd8eea7d712a upstream.
      
      The core dumping code has always run without holding the mmap_sem for
      writing, despite that is the only way to ensure that the entire vma
      layout will not change from under it.  Only using some signal
      serialization on the processes belonging to the mm is not nearly enough.
      This was pointed out earlier.  For example in Hugh's post from Jul 2017:
      
        https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils
      
        "Not strictly relevant here, but a related note: I was very surprised
         to discover, only quite recently, how handle_mm_fault() may be called
         without down_read(mmap_sem) - when core dumping. That seems a
         misguided optimization to me, which would also be nice to correct"
      
      In particular because the growsdown and growsup can move the
      vm_start/vm_end the various loops the core dump does around the vma will
      not be consistent if page faults can happen concurrently.
      
      Pretty much all users calling mmget_not_zero()/get_task_mm() and then
      taking the mmap_sem had the potential to introduce unexpected side
      effects in the core dumping code.
      
      Adding mmap_sem for writing around the ->core_dump invocation is a
      viable long term fix, but it requires removing all copy user and page
      faults and to replace them with get_dump_page() for all binary formats
      which is not suitable as a short term fix.
      
      For the time being this solution manually covers the places that can
      confuse the core dump either by altering the vma layout or the vma flags
      while it runs.  Once ->core_dump runs under mmap_sem for writing the
      function mmget_still_valid() can be dropped.
      
      Allowing mmap_sem protected sections to run in parallel with the
      coredump provides some minor parallelism advantage to the swapoff code
      (which seems to be safe enough by never mangling any vma field and can
      keep doing swapins in parallel to the core dumping) and to some other
      corner case.
      
      In order to facilitate the backporting I added "Fixes: 86039bd3"
      however the side effect of this same race condition in /proc/pid/mem
      should be reproducible since before 2.6.12-rc2 so I couldn't add any
      other "Fixes:" because there's no hash beyond the git genesis commit.
      
      Because find_extend_vma() is the only location outside of the process
      context that could modify the "mm" structures under mmap_sem for
      reading, by adding the mmget_still_valid() check to it, all other cases
      that take the mmap_sem for reading don't need the new check after
      mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
      context also doesn't need the new check, because all tasks under core
      dumping are frozen.
      
      Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
      Fixes: 86039bd3 ("userfaultfd: add new syscall to provide memory externalization")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NJann Horn <jannh@google.com>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NJann Horn <jannh@google.com>
      Acked-by: NJason Gunthorpe <jgg@mellanox.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6ff17bc5
  16. 20 4月, 2019 1 次提交
    • K
      x86/gart: Exclude GART aperture from kcore · 83e3e89d
      Kairui Song 提交于
      [ Upstream commit ffc8599aa9763f39f6736a79da4d1575e7006f9a ]
      
      On machines where the GART aperture is mapped over physical RAM,
      /proc/kcore contains the GART aperture range. Accessing the GART range via
      /proc/kcore results in a kernel crash.
      
      vmcore used to have the same issue, until it was fixed with commit
      2a3e83c6 ("x86/gart: Exclude GART aperture from vmcore")', leveraging
      existing hook infrastructure in vmcore to let /proc/vmcore return zeroes
      when attempting to read the aperture region, and so it won't read from the
      actual memory.
      
      Apply the same workaround for kcore. First implement the same hook
      infrastructure for kcore, then reuse the hook functions introduced in the
      previous vmcore fix. Just with some minor adjustment, rename some functions
      for more general usage, and simplify the hook infrastructure a bit as there
      is no module usage yet.
      Suggested-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NKairui Song <kasong@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJiri Bohac <jbohac@suse.cz>
      Acked-by: NBaoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Dave Young <dyoung@redhat.com>
      Link: https://lkml.kernel.org/r/20190308030508.13548-1-kasong@redhat.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      83e3e89d
  17. 03 4月, 2019 1 次提交
    • Y
      fs/proc/proc_sysctl.c: fix NULL pointer dereference in put_links · 07d0d2bd
      YueHaibing 提交于
      commit 23da9588037ecdd4901db76a5b79a42b529c4ec3 upstream.
      
      Syzkaller reports:
      
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN PTI
      CPU: 1 PID: 5373 Comm: syz-executor.0 Not tainted 5.0.0-rc8+ #3
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      RIP: 0010:put_links+0x101/0x440 fs/proc/proc_sysctl.c:1599
      Code: 00 0f 85 3a 03 00 00 48 8b 43 38 48 89 44 24 20 48 83 c0 38 48 89 c2 48 89 44 24 28 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 fe 02 00 00 48 8b 74 24 20 48 c7 c7 60 2a 9d 91
      RSP: 0018:ffff8881d828f238 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: ffff8881e01b1140 RCX: ffffffff8ee98267
      RDX: 0000000000000007 RSI: ffffc90001479000 RDI: ffff8881e01b1178
      RBP: dffffc0000000000 R08: ffffed103ee27259 R09: ffffed103ee27259
      R10: 0000000000000001 R11: ffffed103ee27258 R12: fffffffffffffff4
      R13: 0000000000000006 R14: ffff8881f59838c0 R15: dffffc0000000000
      FS:  00007f072254f700(0000) GS:ffff8881f7100000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fff8b286668 CR3: 00000001f0542002 CR4: 00000000007606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       drop_sysctl_table+0x152/0x9f0 fs/proc/proc_sysctl.c:1629
       get_subdir fs/proc/proc_sysctl.c:1022 [inline]
       __register_sysctl_table+0xd65/0x1090 fs/proc/proc_sysctl.c:1335
       br_netfilter_init+0xbc/0x1000 [br_netfilter]
       do_one_initcall+0xfa/0x5ca init/main.c:887
       do_init_module+0x204/0x5f6 kernel/module.c:3460
       load_module+0x66b2/0x8570 kernel/module.c:3808
       __do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902
       do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x462e99
      Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f072254ec58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
      RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99
      RDX: 0000000000000000 RSI: 0000000020000280 RDI: 0000000000000003
      RBP: 00007f072254ec70 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f072254f6bc
      R13: 00000000004bcefa R14: 00000000006f6fb0 R15: 0000000000000004
      Modules linked in: br_netfilter(+) dvb_usb_dibusb_mc_common dib3000mc dibx000_common dvb_usb_dibusb_common dvb_usb_dw2102 dvb_usb classmate_laptop palmas_regulator cn videobuf2_v4l2 v4l2_common snd_soc_bd28623 mptbase snd_usb_usx2y snd_usbmidi_lib snd_rawmidi wmi libnvdimm lockd sunrpc grace rc_kworld_pc150u rc_core rtc_da9063 sha1_ssse3 i2c_cros_ec_tunnel adxl34x_spi adxl34x nfnetlink lib80211 i5500_temp dvb_as102 dvb_core videobuf2_common videodev media videobuf2_vmalloc videobuf2_memops udc_core lnbp22 leds_lp3952 hid_roccat_ryos s1d13xxxfb mtd vport_geneve openvswitch nf_conncount nf_nat_ipv6 nsh geneve udp_tunnel ip6_udp_tunnel snd_soc_mt6351 sis_agp phylink snd_soc_adau1761_spi snd_soc_adau1761 snd_soc_adau17x1 snd_soc_core snd_pcm_dmaengine ac97_bus snd_compress snd_soc_adau_utils snd_soc_sigmadsp_regmap snd_soc_sigmadsp raid_class hid_roccat_konepure hid_roccat_common hid_roccat c2port_duramar2150 core mdio_bcm_unimac iptable_security iptable_raw iptable_mangle
       iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel hsr veth netdevsim devlink vxcan batman_adv cfg80211 rfkill chnl_net caif nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel joydev mousedev ide_pci_generic piix aesni_intel aes_x86_64 ide_core crypto_simd atkbd cryptd glue_helper serio_raw ata_generic pata_acpi i2c_piix4 floppy sch_fq_codel ip_tables x_tables ipv6 [last unloaded: lm73]
      Dumping ftrace buffer:
         (ftrace buffer empty)
      ---[ end trace 770020de38961fd0 ]---
      
      A new dir entry can be created in get_subdir and its 'header->parent' is
      set to NULL.  Only after insert_header success, it will be set to 'dir',
      otherwise 'header->parent' is set to NULL and drop_sysctl_table is called.
      However in err handling path of get_subdir, drop_sysctl_table also be
      called on 'new->header' regardless its value of parent pointer.  Then
      put_links is called, which triggers NULL-ptr deref when access member of
      header->parent.
      
      In fact we have multiple error paths which call drop_sysctl_table() there,
      upon failure on insert_links() we also call drop_sysctl_table().And even
      in the successful case on __register_sysctl_table() we still always call
      drop_sysctl_table().This patch fix it.
      
      Link: http://lkml.kernel.org/r/20190314085527.13244-1-yuehaibing@huawei.com
      Fixes: 0e47c99d ("sysctl: Replace root_list with links between sysctl_table_sets")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Acked-by: NLuis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: <stable@vger.kernel.org>    [3.4+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      07d0d2bd
  18. 14 3月, 2019 1 次提交
  19. 27 2月, 2019 1 次提交
  20. 20 2月, 2019 1 次提交
  21. 29 12月, 2018 1 次提交
  22. 14 11月, 2018 1 次提交
    • V
      mm: /proc/pid/smaps_rollup: fix NULL pointer deref in smaps_pte_range() · 30391e41
      Vlastimil Babka 提交于
      commit fa76da461bb0be13c8339d984dcf179151167c8f upstream.
      
      Leonardo reports an apparent regression in 4.19-rc7:
      
       BUG: unable to handle kernel NULL pointer dereference at 00000000000000f0
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP PTI
       CPU: 3 PID: 6032 Comm: python Not tainted 4.19.0-041900rc7-lowlatency #201810071631
       Hardware name: LENOVO 80UG/Toronto 4A2, BIOS 0XCN45WW 08/09/2018
       RIP: 0010:smaps_pte_range+0x32d/0x540
       Code: 80 00 00 00 00 74 a9 48 89 de 41 f6 40 52 40 0f 85 04 02 00 00 49 2b 30 48 c1 ee 0c 49 03 b0 98 00 00 00 49 8b 80 a0 00 00 00 <48> 8b b8 f0 00 00 00 e8 b7 ef ec ff 48 85 c0 0f 84 71 ff ff ff a8
       RSP: 0018:ffffb0cbc484fb88 EFLAGS: 00010202
       RAX: 0000000000000000 RBX: 0000560ddb9e9000 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: 0000000560ddb9e9 RDI: 0000000000000001
       RBP: ffffb0cbc484fbc0 R08: ffff94a5a227a578 R09: ffff94a5a227a578
       R10: 0000000000000000 R11: 0000560ddbbe7000 R12: ffffe903098ba728
       R13: ffffb0cbc484fc78 R14: ffffb0cbc484fcf8 R15: ffff94a5a2e9cf48
       FS:  00007f6dfb683740(0000) GS:ffff94a5aaf80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00000000000000f0 CR3: 000000011c118001 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        __walk_page_range+0x3c2/0x6f0
        walk_page_vma+0x42/0x60
        smap_gather_stats+0x79/0xe0
        ? gather_pte_stats+0x320/0x320
        ? gather_hugetlb_stats+0x70/0x70
        show_smaps_rollup+0xcd/0x1c0
        seq_read+0x157/0x400
        __vfs_read+0x3a/0x180
        ? security_file_permission+0x93/0xc0
        ? security_file_permission+0x93/0xc0
        vfs_read+0x8f/0x140
        ksys_read+0x55/0xc0
        __x64_sys_read+0x1a/0x20
        do_syscall_64+0x5a/0x110
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Decoded code matched to local compilation+disassembly points to
      smaps_pte_entry():
      
              } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
                                                              && pte_none(*pte))) {
                      page = find_get_entry(vma->vm_file->f_mapping,
                                                      linear_page_index(vma, addr));
      
      Here, vma->vm_file is NULL.  mss->check_shmem_swap should be false in that
      case, however for smaps_rollup, smap_gather_stats() can set the flag true
      for one vma and leave it true for subsequent vma's where it should be
      false.
      
      To fix, reset the check_shmem_swap flag to false.  There's also related
      bug which sets mss->swap to shmem_swapped, which in the context of
      smaps_rollup overwrites any value accumulated from previous vma's.  Fix
      that as well.
      
      Note that the report suggests a regression between 4.17.19 and 4.19-rc7,
      which makes the 4.19 series ending with commit 258f669e ("mm:
      /proc/pid/smaps_rollup: convert to single value seq_file") suspicious.
      But the mss was reused for rollup since 493b0e9d ("mm: add
      /proc/pid/smaps_rollup") so let's play it safe with the stable backport.
      
      Link: http://lkml.kernel.org/r/555fbd1f-4ac9-0b58-dcd4-5dc4380ff7ca@suse.cz
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=201377
      Fixes: 493b0e9d ("mm: add /proc/pid/smaps_rollup")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NLeonardo Soares Müller <leozinho29_eu@hotmail.com>
      Tested-by: NLeonardo Soares Müller <leozinho29_eu@hotmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      30391e41
  23. 06 10月, 2018 1 次提交
    • J
      proc: restrict kernel stack dumps to root · f8a00cef
      Jann Horn 提交于
      Currently, you can use /proc/self/task/*/stack to cause a stack walk on
      a task you control while it is running on another CPU.  That means that
      the stack can change under the stack walker.  The stack walker does
      have guards against going completely off the rails and into random
      kernel memory, but it can interpret random data from your kernel stack
      as instruction pointers and stack pointers.  This can cause exposure of
      kernel stack contents to userspace.
      
      Restrict the ability to inspect kernel stacks of arbitrary tasks to root
      in order to prevent a local attacker from exploiting racy stack unwinding
      to leak kernel task stack contents.  See the added comment for a longer
      rationale.
      
      There don't seem to be any users of this userspace API that can't
      gracefully bail out if reading from the file fails.  Therefore, I believe
      that this change is unlikely to break things.  In the case that this patch
      does end up needing a revert, the next-best solution might be to fake a
      single-entry stack based on wchan.
      
      Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com
      Fixes: 2ec220e2 ("proc: add /proc/*/stack")
      Signed-off-by: NJann Horn <jannh@google.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f8a00cef
  24. 21 9月, 2018 1 次提交
  25. 24 8月, 2018 1 次提交
  26. 23 8月, 2018 8 次提交