1. 27 12月, 2019 31 次提交
    • P
      ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code() · 97fa53b7
      Petr Mladek 提交于
      commit d5b844a2cf507fc7642c9ae80a9d585db3065c28 upstream.
      
      The commit 9f255b632bf12c4dd7 ("module: Fix livepatch/ftrace module text
      permissions race") causes a possible deadlock between register_kprobe()
      and ftrace_run_update_code() when ftrace is using stop_machine().
      
      The existing dependency chain (in reverse order) is:
      
      -> #1 (text_mutex){+.+.}:
             validate_chain.isra.21+0xb32/0xd70
             __lock_acquire+0x4b8/0x928
             lock_acquire+0x102/0x230
             __mutex_lock+0x88/0x908
             mutex_lock_nested+0x32/0x40
             register_kprobe+0x254/0x658
             init_kprobes+0x11a/0x168
             do_one_initcall+0x70/0x318
             kernel_init_freeable+0x456/0x508
             kernel_init+0x22/0x150
             ret_from_fork+0x30/0x34
             kernel_thread_starter+0x0/0xc
      
      -> #0 (cpu_hotplug_lock.rw_sem){++++}:
             check_prev_add+0x90c/0xde0
             validate_chain.isra.21+0xb32/0xd70
             __lock_acquire+0x4b8/0x928
             lock_acquire+0x102/0x230
             cpus_read_lock+0x62/0xd0
             stop_machine+0x2e/0x60
             arch_ftrace_update_code+0x2e/0x40
             ftrace_run_update_code+0x40/0xa0
             ftrace_startup+0xb2/0x168
             register_ftrace_function+0x64/0x88
             klp_patch_object+0x1a2/0x290
             klp_enable_patch+0x554/0x980
             do_one_initcall+0x70/0x318
             do_init_module+0x6e/0x250
             load_module+0x1782/0x1990
             __s390x_sys_finit_module+0xaa/0xf0
             system_call+0xd8/0x2d0
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(text_mutex);
                                     lock(cpu_hotplug_lock.rw_sem);
                                     lock(text_mutex);
        lock(cpu_hotplug_lock.rw_sem);
      
      It is similar problem that has been solved by the commit 2d1e38f5
      ("kprobes: Cure hotplug lock ordering issues"). Many locks are involved.
      To be on the safe side, text_mutex must become a low level lock taken
      after cpu_hotplug_lock.rw_sem.
      
      This can't be achieved easily with the current ftrace design.
      For example, arm calls set_all_modules_text_rw() already in
      ftrace_arch_code_modify_prepare(), see arch/arm/kernel/ftrace.c.
      This functions is called:
      
        + outside stop_machine() from ftrace_run_update_code()
        + without stop_machine() from ftrace_module_enable()
      
      Fortunately, the problematic fix is needed only on x86_64. It is
      the only architecture that calls set_all_modules_text_rw()
      in ftrace path and supports livepatching at the same time.
      
      Therefore it is enough to move text_mutex handling from the generic
      kernel/trace/ftrace.c into arch/x86/kernel/ftrace.c:
      
         ftrace_arch_code_modify_prepare()
         ftrace_arch_code_modify_post_process()
      
      This patch basically reverts the ftrace part of the problematic
      commit 9f255b632bf12c4dd7 ("module: Fix livepatch/ftrace module
      text permissions race"). And provides x86_64 specific-fix.
      
      Some refactoring of the ftrace code will be needed when livepatching
      is implemented for arm or nds32. These architectures call
      set_all_modules_text_rw() and use stop_machine() at the same time.
      
      Link: http://lkml.kernel.org/r/20190627081334.12793-1-pmladek@suse.com
      
      Fixes: 9f255b632bf12c4dd7 ("module: Fix livepatch/ftrace module text permissions race")
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Reported-by: NMiroslav Benes <mbenes@suse.cz>
      Reviewed-by: NMiroslav Benes <mbenes@suse.cz>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      [
        As reviewed by Miroslav Benes <mbenes@suse.cz>, removed return value of
        ftrace_run_update_code() as it is a void function.
      ]
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      97fa53b7
    • E
      tracing/snapshot: Resize spare buffer if size changed · f927e496
      Eiichi Tsukata 提交于
      commit 46cc0b44428d0f0e81f11ea98217fc0edfbeab07 upstream.
      
      Current snapshot implementation swaps two ring_buffers even though their
      sizes are different from each other, that can cause an inconsistency
      between the contents of buffer_size_kb file and the current buffer size.
      
      For example:
      
        # cat buffer_size_kb
        7 (expanded: 1408)
        # echo 1 > events/enable
        # grep bytes per_cpu/cpu0/stats
        bytes: 1441020
        # echo 1 > snapshot             // current:1408, spare:1408
        # echo 123 > buffer_size_kb     // current:123,  spare:1408
        # echo 1 > snapshot             // current:1408, spare:123
        # grep bytes per_cpu/cpu0/stats
        bytes: 1443700
        # cat buffer_size_kb
        123                             // != current:1408
      
      And also, a similar per-cpu case hits the following WARNING:
      
      Reproducer:
      
        # echo 1 > per_cpu/cpu0/snapshot
        # echo 123 > buffer_size_kb
        # echo 1 > per_cpu/cpu0/snapshot
      
      WARNING:
      
        WARNING: CPU: 0 PID: 1946 at kernel/trace/trace.c:1607 update_max_tr_single.part.0+0x2b8/0x380
        Modules linked in:
        CPU: 0 PID: 1946 Comm: bash Not tainted 5.2.0-rc6 #20
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
        RIP: 0010:update_max_tr_single.part.0+0x2b8/0x380
        Code: ff e8 dc da f9 ff 0f 0b e9 88 fe ff ff e8 d0 da f9 ff 44 89 ee bf f5 ff ff ff e8 33 dc f9 ff 41 83 fd f5 74 96 e8 b8 da f9 ff <0f> 0b eb 8d e8 af da f9 ff 0f 0b e9 bf fd ff ff e8 a3 da f9 ff 48
        RSP: 0018:ffff888063e4fca0 EFLAGS: 00010093
        RAX: ffff888066214380 RBX: ffffffff99850fe0 RCX: ffffffff964298a8
        RDX: 0000000000000000 RSI: 00000000fffffff5 RDI: 0000000000000005
        RBP: 1ffff1100c7c9f96 R08: ffff888066214380 R09: ffffed100c7c9f9b
        R10: ffffed100c7c9f9a R11: 0000000000000003 R12: 0000000000000000
        R13: 00000000ffffffea R14: ffff888066214380 R15: ffffffff99851060
        FS:  00007f9f8173c700(0000) GS:ffff88806d000000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000714dc0 CR3: 0000000066fa6000 CR4: 00000000000006f0
        Call Trace:
         ? trace_array_printk_buf+0x140/0x140
         ? __mutex_lock_slowpath+0x10/0x10
         tracing_snapshot_write+0x4c8/0x7f0
         ? trace_printk_init_buffers+0x60/0x60
         ? selinux_file_permission+0x3b/0x540
         ? tracer_preempt_off+0x38/0x506
         ? trace_printk_init_buffers+0x60/0x60
         __vfs_write+0x81/0x100
         vfs_write+0x1e1/0x560
         ksys_write+0x126/0x250
         ? __ia32_sys_read+0xb0/0xb0
         ? do_syscall_64+0x1f/0x390
         do_syscall_64+0xc1/0x390
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This patch adds resize_buffer_duplicate_size() to check if there is a
      difference between current/spare buffer sizes and resize a spare buffer
      if necessary.
      
      Link: http://lkml.kernel.org/r/20190625012910.13109-1-devel@etsukata.com
      
      Cc: stable@vger.kernel.org
      Fixes: ad909e21 ("tracing: Add internal tracing_snapshot() functions")
      Signed-off-by: NEiichi Tsukata <devel@etsukata.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f927e496
    • J
      module: Fix livepatch/ftrace module text permissions race · aa4d90fc
      Josh Poimboeuf 提交于
      [ Upstream commit 9f255b632bf12c4dd7fc31caee89aa991ef75176 ]
      
      It's possible for livepatch and ftrace to be toggling a module's text
      permissions at the same time, resulting in the following panic:
      
        BUG: unable to handle page fault for address: ffffffffc005b1d9
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0003) - permissions violation
        PGD 3ea0c067 P4D 3ea0c067 PUD 3ea0e067 PMD 3cc13067 PTE 3b8a1061
        Oops: 0003 [#1] PREEMPT SMP PTI
        CPU: 1 PID: 453 Comm: insmod Tainted: G           O  K   5.2.0-rc1-a188339ca5 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
        RIP: 0010:apply_relocate_add+0xbe/0x14c
        Code: fa 0b 74 21 48 83 fa 18 74 38 48 83 fa 0a 75 40 eb 08 48 83 38 00 74 33 eb 53 83 38 00 75 4e 89 08 89 c8 eb 0a 83 38 00 75 43 <89> 08 48 63 c1 48 39 c8 74 2e eb 48 83 38 00 75 32 48 29 c1 89 08
        RSP: 0018:ffffb223c00dbb10 EFLAGS: 00010246
        RAX: ffffffffc005b1d9 RBX: 0000000000000000 RCX: ffffffff8b200060
        RDX: 000000000000000b RSI: 0000004b0000000b RDI: ffff96bdfcd33000
        RBP: ffffb223c00dbb38 R08: ffffffffc005d040 R09: ffffffffc005c1f0
        R10: ffff96bdfcd33c40 R11: ffff96bdfcd33b80 R12: 0000000000000018
        R13: ffffffffc005c1f0 R14: ffffffffc005e708 R15: ffffffff8b2fbc74
        FS:  00007f5f447beba8(0000) GS:ffff96bdff900000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffffffffc005b1d9 CR3: 000000003cedc002 CR4: 0000000000360ea0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         klp_init_object_loaded+0x10f/0x219
         ? preempt_latency_start+0x21/0x57
         klp_enable_patch+0x662/0x809
         ? virt_to_head_page+0x3a/0x3c
         ? kfree+0x8c/0x126
         patch_init+0x2ed/0x1000 [livepatch_test02]
         ? 0xffffffffc0060000
         do_one_initcall+0x9f/0x1c5
         ? kmem_cache_alloc_trace+0xc4/0xd4
         ? do_init_module+0x27/0x210
         do_init_module+0x5f/0x210
         load_module+0x1c41/0x2290
         ? fsnotify_path+0x3b/0x42
         ? strstarts+0x2b/0x2b
         ? kernel_read+0x58/0x65
         __do_sys_finit_module+0x9f/0xc3
         ? __do_sys_finit_module+0x9f/0xc3
         __x64_sys_finit_module+0x1a/0x1c
         do_syscall_64+0x52/0x61
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The above panic occurs when loading two modules at the same time with
      ftrace enabled, where at least one of the modules is a livepatch module:
      
      CPU0					CPU1
      klp_enable_patch()
        klp_init_object_loaded()
          module_disable_ro()
          					ftrace_module_enable()
      					  ftrace_arch_code_modify_post_process()
      				    	    set_all_modules_text_ro()
            klp_write_object_relocations()
              apply_relocate_add()
      	  *patches read-only code* - BOOM
      
      A similar race exists when toggling ftrace while loading a livepatch
      module.
      
      Fix it by ensuring that the livepatch and ftrace code patching
      operations -- and their respective permissions changes -- are protected
      by the text_mutex.
      
      Link: http://lkml.kernel.org/r/ab43d56ab909469ac5d2520c5d944ad6d4abd476.1560474114.git.jpoimboe@redhat.comReported-by: NJohannes Erdfelt <johannes@erdfelt.com>
      Fixes: 444d13ff ("modules: add ro_after_init support")
      Acked-by: NJessica Yu <jeyu@kernel.org>
      Reviewed-by: NPetr Mladek <pmladek@suse.com>
      Reviewed-by: NMiroslav Benes <mbenes@suse.cz>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      aa4d90fc
    • V
      tracing: avoid build warning with HAVE_NOP_MCOUNT · 0bb6e415
      Vasily Gorbik 提交于
      [ Upstream commit cbdaeaf050b730ea02e9ab4ff844ce54d85dbe1d ]
      
      Selecting HAVE_NOP_MCOUNT enables -mnop-mcount (if gcc supports it)
      and sets CC_USING_NOP_MCOUNT. Reuse __is_defined (which is suitable for
      testing CC_USING_* defines) to avoid conditional compilation and fix
      the following gcc 9 warning on s390:
      
      kernel/trace/ftrace.c:2514:1: warning: ‘ftrace_code_disable’ defined
      but not used [-Wunused-function]
      
      Link: http://lkml.kernel.org/r/patch.git-1a82d13f33ac.your-ad-here.call-01559732716-ext-6629@work.hours
      
      Fixes: 2f4df001 ("tracing: Add -mcount-nop option support")
      Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0bb6e415
    • M
      bpf: fix nested bpf tracepoints with per-cpu data · 1cd29ed5
      Matt Mullins 提交于
      commit 9594dc3c7e71b9f52bee1d7852eb3d4e3aea9e99 upstream.
      
      BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as
      they do not increment bpf_prog_active while executing.
      
      This enables three levels of nesting, to support
        - a kprobe or raw tp or perf event,
        - another one of the above that irq context happens to call, and
        - another one in nmi context
      (at most one of which may be a kprobe or perf event).
      
      Fixes: 20b9d7ac ("bpf: avoid excessive stack usage for perf_sample_data")
      Signed-off-by: NMatt Mullins <mmullins@fb.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1cd29ed5
    • S
      Revert "x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP" · 1d6d11fc
      Sasha Levin 提交于
      This reverts commit 1a3188d737ceb922166d8fe78a5fc4f89907e31b, which was
      upstream commit 4a6c91fbdef846ec7250b82f2eeeb87ac5f18cf9.
      
      On Tue, Jun 25, 2019 at 09:39:45AM +0200, Sebastian Andrzej Siewior wrote:
      >Please backport commit e74deb11931ff682b59d5b9d387f7115f689698e to
      >stable _or_ revert the backport of commit 4a6c91fbdef84 ("x86/uaccess,
      >ftrace: Fix ftrace_likely_update() vs. SMAP"). It uses
      >user_access_{save|restore}() which has been introduced in the following
      >commit.
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1d6d11fc
    • M
      tracing: Silence GCC 9 array bounds warning · a66fb5c9
      Miguel Ojeda 提交于
      commit 0c97bf863efce63d6ab7971dad811601e6171d2f upstream.
      
      Starting with GCC 9, -Warray-bounds detects cases when memset is called
      starting on a member of a struct but the size to be cleared ends up
      writing over further members.
      
      Such a call happens in the trace code to clear, at once, all members
      after and including `seq` on struct trace_iterator:
      
          In function 'memset',
              inlined from 'ftrace_dump' at kernel/trace/trace.c:8914:3:
          ./include/linux/string.h:344:9: warning: '__builtin_memset' offset
          [8505, 8560] from the object at 'iter' is out of the bounds of
          referenced subobject 'seq' with type 'struct trace_seq' at offset
          4368 [-Warray-bounds]
            344 |  return __builtin_memset(p, c, size);
                |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      In order to avoid GCC complaining about it, we compute the address
      ourselves by adding the offsetof distance instead of referring
      directly to the member.
      
      Since there are two places doing this clear (trace.c and trace_kdb.c),
      take the chance to move the workaround into a single place in
      the internal header.
      
      Link: http://lkml.kernel.org/r/20190523124535.GA12931@gmail.comSigned-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      [ Removed unnecessary parenthesis around "iter" ]
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a66fb5c9
    • T
      tracing: Prevent hist_field_var_ref() from accessing NULL tracing_map_elts · 76e1af66
      Tom Zanussi 提交于
      [ Upstream commit 55267c88c003a3648567beae7c90512d3e2ab15e ]
      
      hist_field_var_ref() is an implementation of hist_field_fn_t(), which
      can be called with a null tracing_map_elt elt param when assembling a
      key in event_hist_trigger().
      
      In the case of hist_field_var_ref() this doesn't make sense, because a
      variable can only be resolved by looking it up using an already
      assembled key i.e. a variable can't be used to assemble a key since
      the key is required in order to access the variable.
      
      Upper layers should prevent the user from constructing a key using a
      variable in the first place, but in case one slips through, it
      shouldn't cause a NULL pointer dereference.  Also if one does slip
      through, we want to know about it, so emit a one-time warning in that
      case.
      
      Link: http://lkml.kernel.org/r/64ec8dc15c14d305295b64cdfcc6b2b9dd14753f.1555597045.git.tom.zanussi@linux.intel.comReported-by: NVincent Bernat <vincent@bernat.ch>
      Signed-off-by: NTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      76e1af66
    • T
      tracing: Avoid memory leak in predicate_parse() · 74792b82
      Tomas Bortoli 提交于
      commit dfb4a6f2191a80c8b790117d0ff592fd712d3296 upstream.
      
      In case of errors, predicate_parse() goes to the out_free label
      to free memory and to return an error code.
      
      However, predicate_parse() does not free the predicates of the
      temporary prog_stack array, thence leaking them.
      
      Link: http://lkml.kernel.org/r/20190528154338.29976-1-tomasbortoli@gmail.com
      
      Cc: stable@vger.kernel.org
      Fixes: 80765597 ("tracing: Rewrite filter logic to be simpler and faster")
      Reported-by: syzbot+6b8e0fb820e570c59e19@syzkaller.appspotmail.com
      Signed-off-by: NTomas Bortoli <tomasbortoli@gmail.com>
      [ Added protection around freeing prog_stack[i].pred ]
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      74792b82
    • W
      ftrace: fix NULL pointer dereference in free_ftrace_func_mapper() · 8d70a2b8
      Wei Li 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 16533
      CVE: NA
      
      -------------------------------------------------
      
      The mapper may be NULL when called from register_ftrace_function_probe()
      with probe->data == NULL.
      
      This issue can be reproduced as follow (it may be coverd by compiler
      optimization sometime):
      
      / # cat /sys/kernel/debug/tracing/set_ftrace_filter
      #### all functions enabled ####
      /sys/kernel/debug/tracing #  echo do_trap:dump > set_ftrace_filter
      [  559.030635] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
      [  559.034017] Mem abort info:
      [  559.034346]   ESR = 0x96000006
      [  559.035154]   Exception class = DABT (current EL), IL = 32 bits
      [  559.038036]   SET = 0, FnV = 0
      [  559.038403]   EA = 0, S1PTW = 0
      [  559.041292] Data abort info:
      [  559.041697]   ISV = 0, ISS = 0x00000006
      [  559.042081]   CM = 0, WnR = 0
      [  559.042655] user pgtable: 4k pages, 48-bit VAs, pgdp = (____ptrval____)
      [  559.043125] [0000000000000000] pgd=0000000429dc5003, pud=0000000429dc6003, pmd=0000000000000000
      [  559.046981] Internal error: Oops: 96000006 [#1] SMP
      [  559.050625] Dumping ftrace buffer:
      [  559.053384]    (ftrace buffer empty)
      
      Entering kdb (current=0xffff8003e8c457c0, pid 232) on processor 6 Oops: (null)
      due to oops @ 0xffff000008370f14
      CPU: 6 PID: 232 Comm: sh Not tainted 4.19.39+ #29
      Hardware name: linux,dummy-virt (DT)
      pstate: 60000005 (nZCv daif -PAN -UAO)
      pc : free_ftrace_func_mapper+0x2c/0x118
      lr : ftrace_count_free+0x68/0x80
      sp : ffff00000eceba30
      x29: ffff00000eceba30 x28: ffff8003e8d61480
      x27: ffff00000adeedb0 x26: 0000000000000001
      x25: ffff00000b5d0000 x24: 0000000000000000
      x23: ffff00000b1ea000 x22: ffff00000adee000
      x21: 0000000000000000 x20: ffff00000b5d05e8
      x19: ffff00000b5e2e90 x18: ffff8003e8ee0380
      x17: 0000000000000000 x16: 0000000000000000
      x15: 0000000000000000 x14: ffff000009f63400
      x13: 000000000026def4 x12: 0000000000002200
      x11: 0000000000000000 x10: 5a5a5a5a5a5a5a5a
      x9 : 0000000000000000 x8 : 0000000000000000
      x7 : 0000000000000000 x6 : ffff00000ad39748
      x5 : ffff0000083a4cf8 x4 : 0000000000000001
      x3 : 0000000000000001 x2 : ffff00000b5e2c88
      x1 : 0000000000000000 x0 : 0000000000000000
      Call trace:
       free_ftrace_func_mapper+0x2c/0x118
       ftrace_count_free+0x68/0x80
       release_probe+0xfc/0x1d0
       register_ftrace_function_probe+0x4b0/0x870
       ftrace_trace_probe_callback.isra.4+0xb8/0x180
       ftrace_dump_callback+0x50/0x70
       ftrace_regex_write.isra.30+0x290/0x3a8
       ftrace_filter_write+0x44/0x60
       __vfs_write+0x78/0x320
       vfs_write+0x14c/0x2d8
       ksys_write+0xa0/0x168
       __arm64_sys_write+0x3c/0x58
       el0_svc_common+0x41c/0x610
       el0_svc_handler+0x148/0x1d0
       el0_svc+0x8/0xc
      Signed-off-by: NWei Li <liwei391@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8d70a2b8
    • P
      x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP · 6537a7d5
      Peter Zijlstra 提交于
      [ Upstream commit 4a6c91fbdef846ec7250b82f2eeeb87ac5f18cf9 ]
      
      For CONFIG_TRACE_BRANCH_PROFILING=y the likely/unlikely things get
      overloaded and generate callouts to this code, and thus also when
      AC=1.
      
      Make it safe.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6537a7d5
    • E
      tracing: Fix partial reading of trace event's id file · c82b702f
      Elazar Leibovich 提交于
      commit cbe08bcbbe787315c425dde284dcb715cfbf3f39 upstream.
      
      When reading only part of the id file, the ppos isn't tracked correctly.
      This is taken care by simple_read_from_buffer.
      
      Reading a single byte, and then the next byte would result EOF.
      
      While this seems like not a big deal, this breaks abstractions that
      reads information from files unbuffered. See for example
      https://github.com/golang/go/issues/29399
      
      This code was mentioned as problematic in
      commit cd458ba9
      ("tracing: Do not (ab)use trace_seq in event_id_read()")
      
      An example C code that show this bug is:
      
        #include <stdio.h>
        #include <stdint.h>
      
        #include <sys/types.h>
        #include <sys/stat.h>
        #include <fcntl.h>
        #include <unistd.h>
      
        int main(int argc, char **argv) {
          if (argc < 2)
            return 1;
          int fd = open(argv[1], O_RDONLY);
          char c;
          read(fd, &c, 1);
          printf("First  %c\n", c);
          read(fd, &c, 1);
          printf("Second %c\n", c);
        }
      
      Then run with, e.g.
      
        sudo ./a.out /sys/kernel/debug/tracing/events/tcp/tcp_set_state/id
      
      You'll notice you're getting the first character twice, instead of the
      first two characters in the id file.
      
      Link: http://lkml.kernel.org/r/20181231115837.4932-1-elazar@lightbitslabs.com
      
      Cc: Orit Wasserman <orit.was@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 23725aee ("ftrace: provide an id file for each event")
      Signed-off-by: NElazar Leibovich <elazar@lightbitslabs.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c82b702f
    • P
      trace: Fix preempt_enable_no_resched() abuse · 1402e48c
      Peter Zijlstra 提交于
      commit d6097c9e4454adf1f8f2c9547c2fa6060d55d952 upstream.
      
      Unless the very next line is schedule(), or implies it, one must not use
      preempt_enable_no_resched(). It can cause a preemption to go missing and
      thereby cause arbitrary delays, breaking the PREEMPT=y invariant.
      
      Link: http://lkml.kernel.org/r/20190423200318.GY14281@hirez.programming.kicks-ass.net
      
      Cc: Waiman Long <longman@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: the arch/x86 maintainers <x86@kernel.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: stable@vger.kernel.org
      Fixes: 2c2d7329 ("tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()")
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1402e48c
    • J
      tracing: Fix buffer_ref pipe ops · d82ba528
      Jann Horn 提交于
      commit b987222654f84f7b4ca95b3a55eca784cb30235b upstream.
      
      This fixes multiple issues in buffer_pipe_buf_ops:
      
       - The ->steal() handler must not return zero unless the pipe buffer has
         the only reference to the page. But generic_pipe_buf_steal() assumes
         that every reference to the pipe is tracked by the page's refcount,
         which isn't true for these buffers - buffer_pipe_buf_get(), which
         duplicates a buffer, doesn't touch the page's refcount.
         Fix it by using generic_pipe_buf_nosteal(), which refuses every
         attempted theft. It should be easy to actually support ->steal, but the
         only current users of pipe_buf_steal() are the virtio console and FUSE,
         and they also only use it as an optimization. So it's probably not worth
         the effort.
       - The ->get() and ->release() handlers can be invoked concurrently on pipe
         buffers backed by the same struct buffer_ref. Make them safe against
         concurrency by using refcount_t.
       - The pointers stored in ->private were only zeroed out when the last
         reference to the buffer_ref was dropped. As far as I know, this
         shouldn't be necessary anyway, but if we do it, let's always do it.
      
      Link: http://lkml.kernel.org/r/20190404215925.253531-1-jannh@google.com
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: stable@vger.kernel.org
      Fixes: 73a757e6 ("ring-buffer: Return reader page back into existing ring buffer")
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      
      Conflicts:
        kernel/trace/trace.c
        include/linux/pipe_fs_i.h
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d82ba528
    • W
      tracing: Fix a memory leak by early error exit in trace_pid_write() · fa72b697
      Wenwen Wang 提交于
      commit 91862cc7867bba4ee5c8fcf0ca2f1d30427b6129 upstream.
      
      In trace_pid_write(), the buffer for trace parser is allocated through
      kmalloc() in trace_parser_get_init(). Later on, after the buffer is used,
      it is then freed through kfree() in trace_parser_put(). However, it is
      possible that trace_pid_write() is terminated due to unexpected errors,
      e.g., ENOMEM. In that case, the allocated buffer will not be freed, which
      is a memory leak bug.
      
      To fix this issue, free the allocated buffer when an error is encountered.
      
      Link: http://lkml.kernel.org/r/1555726979-15633-1-git-send-email-wang6495@umn.edu
      
      Fixes: f4d34a87 ("tracing: Use pid bitmap instead of a pid array for set_event_pid")
      Cc: stable@vger.kernel.org
      Signed-off-by: NWenwen Wang <wang6495@umn.edu>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fa72b697
    • M
      kprobes: Mark ftrace mcount handler functions nokprobe · 853f1d81
      Masami Hiramatsu 提交于
      commit fabe38ab6b2bd9418350284c63825f13b8a6abba upstream.
      
      Mark ftrace mcount handler functions nokprobe since
      probing on these functions with kretprobe pushes
      return address incorrectly on kretprobe shadow stack.
      Reported-by: NFrancis Deslauriers <francis.deslauriers@efficios.com>
      Tested-by: NAndrea Righi <righi.andrea@gmail.com>
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/155094062044.6137.6419622920568680640.stgit@devboxSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      853f1d81
    • M
      fs: prevent page refcount overflow in pipe_buf_get · 49ecd39f
      Matthew Wilcox 提交于
      mainline inclusion
      from mainline-5.1-rc5
      commit 15fab63e1e57be9fdb5eec1bbc5916e9825e9acb
      category: 13690
      bugzilla: NA
      CVE: CVE-2019-11487
      
      There are four commits to fix this CVE:
        fs: prevent page refcount overflow in pipe_buf_get
        mm: prevent get_user_pages() from overflowing page refcount
        mm: add 'try_get_page()' helper function
        mm: make page ref count overflow check tighter and more explicit
      
      -------------------------------------------------
      
      Change pipe_buf_get() to return a bool indicating whether it succeeded
      in raising the refcount of the page (if the thing in the pipe is a page).
      This removes another mechanism for overflowing the page refcount.  All
      callers converted to handle a failure.
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      49ecd39f
    • L
      remove unused pid_max in trace.h · b2e4b8da
      luojiajun 提交于
      euler inclusion
      category: bugfix
      bugzilla: 14007
      CVE: NA
      
      -------------------------------------------------
      
      In commit b8f8307ae1e9 ("pid_ns: Make pid_max per namespace"), pid_max
      is made per namespace, so global pid_max in trace.h is unused now.
      Remove it.
      Signed-off-by: Nluojiajun <luojiajun3@huawei.com>
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b2e4b8da
    • L
      pid_ns: Make pid_max per namespace · 3775c424
      Li Zefan 提交于
      euler inclusion
      category: bugfix
      bugzilla: NA
      CVE: NA
      
      -------------------------------------------------
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: Nluojiajun <luojiajun3@huawei.com>
      Reviewed-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3775c424
    • D
      tracing: kdb: Fix ftdump to not sleep · ac4d0e56
      Douglas Anderson 提交于
      mainline inclusion
      from mainline-5.1-rc1
      commit 31b265b3baaf
      category: bugfix
      bugzilla: 12725
      CVE: NA
      
      -------------------------------------------------
      
      As reported back in 2016-11 [1], the "ftdump" kdb command triggers a
      BUG for "sleeping function called from invalid context".
      
      kdb's "ftdump" command wants to call ring_buffer_read_prepare() in
      atomic context.  A very simple solution for this is to add allocation
      flags to ring_buffer_read_prepare() so kdb can call it without
      triggering the allocation error.  This patch does that.
      
      Note that in the original email thread about this, it was suggested
      that perhaps the solution for kdb was to either preallocate the buffer
      ahead of time or create our own iterator.  I'm hoping that this
      alternative of adding allocation flags to ring_buffer_read_prepare()
      can be considered since it means I don't need to duplicate more of the
      core trace code into "trace_kdb.c" (for either creating my own
      iterator or re-preparing a ring allocator whose memory was already
      allocated).
      
      NOTE: another option for kdb is to actually figure out how to make it
      reuse the existing ftrace_dump() function and totally eliminate the
      duplication.  This sounds very appealing and actually works (the "sr
      z" command can be seen to properly dump the ftrace buffer).  The
      downside here is that ftrace_dump() fully consumes the trace buffer.
      Unless that is changed I'd rather not use it because it means "ftdump
      | grep xyz" won't be very useful to search the ftrace buffer since it
      will throw away the whole trace on the first grep.  A future patch to
      dump only the last few lines of the buffer will also be hard to
      implement.
      
      [1] https://lkml.kernel.org/r/20161117191605.GA21459@google.com
      
      Link: http://lkml.kernel.org/r/20190308193205.213659-1-dianders@chromium.orgReported-by: NBrian Norris <briannorris@chromium.org>
      Signed-off-by: NDouglas Anderson <dianders@chromium.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NWei Li <liwei391@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ac4d0e56
    • J
      tracing/perf: Use strndup_user() instead of buggy open-coded version · bec92bb4
      Jann Horn 提交于
      commit 83540fbc8812a580b6ad8f93f4c29e62e417687e upstream.
      
      The first version of this method was missing the check for
      `ret == PATH_MAX`; then such a check was added, but it didn't call kfree()
      on error, so there was still a small memory leak in the error case.
      Fix it by using strndup_user() instead of open-coding it.
      
      Link: http://lkml.kernel.org/r/20190220165443.152385-1-jannh@google.com
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: stable@vger.kernel.org
      Fixes: 0eadcc7a ("perf/core: Fix perf_uprobe_init()")
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      bec92bb4
    • T
      tracing: Use strncpy instead of memcpy for string keys in hist triggers · 73325c7e
      Tom Zanussi 提交于
      commit 9f0bbf3115ca9f91f43b7c74e9ac7d79f47fc6c2 upstream.
      
      Because there may be random garbage beyond a string's null terminator,
      it's not correct to copy the the complete character array for use as a
      hist trigger key.  This results in multiple histogram entries for the
      'same' string key.
      
      So, in the case of a string key, use strncpy instead of memcpy to
      avoid copying in the extra bytes.
      
      Before, using the gdbus entries in the following hist trigger as an
      example:
      
        # echo 'hist:key=comm' > /sys/kernel/debug/tracing/events/sched/sched_waking/trigger
        # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist
      
        ...
      
        { comm: ImgDecoder #4                      } hitcount:        203
        { comm: gmain                              } hitcount:        213
        { comm: gmain                              } hitcount:        216
        { comm: StreamTrans #73                    } hitcount:        221
        { comm: mozStorage #3                      } hitcount:        230
        { comm: gdbus                              } hitcount:        233
        { comm: StyleThread#5                      } hitcount:        253
        { comm: gdbus                              } hitcount:        256
        { comm: gdbus                              } hitcount:        260
        { comm: StyleThread#4                      } hitcount:        271
      
        ...
      
        # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist | egrep gdbus | wc -l
        51
      
      After:
      
        # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist | egrep gdbus | wc -l
        1
      
      Link: http://lkml.kernel.org/r/50c35ae1267d64eee975b8125e151e600071d4dc.1549309756.git.tom.zanussi@linux.intel.com
      
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: stable@vger.kernel.org
      Fixes: 79e577cb ("tracing: Support string type key properly")
      Signed-off-by: NTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      73325c7e
    • P
      tracing: Fix event filters and triggers to handle negative numbers · 96d5943c
      Pavel Tikhomirov 提交于
      commit 6a072128d262d2b98d31626906a96700d1fc11eb upstream.
      
      Then tracing syscall exit event it is extremely useful to filter exit
      codes equal to some negative value, to react only to required errors.
      But negative numbers does not work:
      
      [root@snorch sys_exit_read]# echo "ret == -1" > filter
      bash: echo: write error: Invalid argument
      [root@snorch sys_exit_read]# cat filter
      ret == -1
              ^
      parse_error: Invalid value (did you forget quotes)?
      
      Similar thing happens when setting triggers.
      
      These is a regression in v4.17 introduced by the commit mentioned below,
      testing without these commit shows no problem with negative numbers.
      
      Link: http://lkml.kernel.org/r/20180823102534.7642-1-ptikhomirov@virtuozzo.com
      
      Cc: stable@vger.kernel.org
      Fixes: 80765597 ("tracing: Rewrite filter logic to be simpler and faster")
      Signed-off-by: NPavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      96d5943c
    • Q
      tracing: Fix number of entries in trace header · 2cc39ead
      Quentin Perret 提交于
      commit 9e7382153f80ba45a0bbcd540fb77d4b15f6e966 upstream.
      
      The following commit
      
        441dae8f ("tracing: Add support for display of tgid in trace output")
      
      removed the call to print_event_info() from print_func_help_header_irq()
      which results in the ftrace header not reporting the number of entries
      written in the buffer. As this wasn't the original intent of the patch,
      re-introduce the call to print_event_info() to restore the orginal
      behaviour.
      
      Link: http://lkml.kernel.org/r/20190214152950.4179-1-quentin.perret@arm.comAcked-by: NJoel Fernandes <joelaf@google.com>
      Cc: stable@vger.kernel.org
      Fixes: 441dae8f ("tracing: Add support for display of tgid in trace output")
      Signed-off-by: NQuentin Perret <quentin.perret@arm.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2cc39ead
    • A
      tracing/uprobes: Fix output for multiple string arguments · 382322b4
      Andreas Ziegler 提交于
      commit 0722069a upstream.
      
      When printing multiple uprobe arguments as strings the output for the
      earlier arguments would also include all later string arguments.
      
      This is best explained in an example:
      
      Consider adding a uprobe to a function receiving two strings as
      parameters which is at offset 0xa0 in strlib.so and we want to print
      both parameters when the uprobe is hit (on x86_64):
      
      $ echo 'p:func /lib/strlib.so:0xa0 +0(%di):string +0(%si):string' > \
          /sys/kernel/debug/tracing/uprobe_events
      
      When the function is called as func("foo", "bar") and we hit the probe,
      the trace file shows a line like the following:
      
        [...] func: (0x7f7e683706a0) arg1="foobar" arg2="bar"
      
      Note the extra "bar" printed as part of arg1. This behaviour stacks up
      for additional string arguments.
      
      The strings are stored in a dynamically growing part of the uprobe
      buffer by fetch_store_string() after copying them from userspace via
      strncpy_from_user(). The return value of strncpy_from_user() is then
      directly used as the required size for the string. However, this does
      not take the terminating null byte into account as the documentation
      for strncpy_from_user() cleary states that it "[...] returns the
      length of the string (not including the trailing NUL)" even though the
      null byte will be copied to the destination.
      
      Therefore, subsequent calls to fetch_store_string() will overwrite
      the terminating null byte of the most recently fetched string with
      the first character of the current string, leading to the
      "accumulation" of strings in earlier arguments in the output.
      
      Fix this by incrementing the return value of strncpy_from_user() by
      one if we did not hit the maximum buffer size.
      
      Link: http://lkml.kernel.org/r/20190116141629.5752-1-andreas.ziegler@fau.de
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 5baaa59e ("tracing/probes: Implement 'memory' fetch method for uprobes")
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NAndreas Ziegler <andreas.ziegler@fau.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      382322b4
    • A
      bpf: fix potential deadlock in bpf_prog_register · d1c1cb28
      Alexei Starovoitov 提交于
      mainline inclusion
      from mainline-5.0
      commit e16ec34039c7
      category: bugfix
      bugzilla: 9347
      CVE: NA
      
      -------------------------------------------------
      
      Lockdep found a potential deadlock between cpu_hotplug_lock, bpf_event_mutex, and cpuctx_mutex:
      [   13.007000] WARNING: possible circular locking dependency detected
      [   13.007587] 5.0.0-rc3-00018-g2fa53f892422-dirty #477 Not tainted
      [   13.008124] ------------------------------------------------------
      [   13.008624] test_progs/246 is trying to acquire lock:
      [   13.009030] 0000000094160d1d (tracepoints_mutex){+.+.}, at: tracepoint_probe_register_prio+0x2d/0x300
      [   13.009770]
      [   13.009770] but task is already holding lock:
      [   13.010239] 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
      [   13.010877]
      [   13.010877] which lock already depends on the new lock.
      [   13.010877]
      [   13.011532]
      [   13.011532] the existing dependency chain (in reverse order) is:
      [   13.012129]
      [   13.012129] -> #4 (bpf_event_mutex){+.+.}:
      [   13.012582]        perf_event_query_prog_array+0x9b/0x130
      [   13.013016]        _perf_ioctl+0x3aa/0x830
      [   13.013354]        perf_ioctl+0x2e/0x50
      [   13.013668]        do_vfs_ioctl+0x8f/0x6a0
      [   13.014003]        ksys_ioctl+0x70/0x80
      [   13.014320]        __x64_sys_ioctl+0x16/0x20
      [   13.014668]        do_syscall_64+0x4a/0x180
      [   13.015007]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   13.015469]
      [   13.015469] -> #3 (&cpuctx_mutex){+.+.}:
      [   13.015910]        perf_event_init_cpu+0x5a/0x90
      [   13.016291]        perf_event_init+0x1b2/0x1de
      [   13.016654]        start_kernel+0x2b8/0x42a
      [   13.016995]        secondary_startup_64+0xa4/0xb0
      [   13.017382]
      [   13.017382] -> #2 (pmus_lock){+.+.}:
      [   13.017794]        perf_event_init_cpu+0x21/0x90
      [   13.018172]        cpuhp_invoke_callback+0xb3/0x960
      [   13.018573]        _cpu_up+0xa7/0x140
      [   13.018871]        do_cpu_up+0xa4/0xc0
      [   13.019178]        smp_init+0xcd/0xd2
      [   13.019483]        kernel_init_freeable+0x123/0x24f
      [   13.019878]        kernel_init+0xa/0x110
      [   13.020201]        ret_from_fork+0x24/0x30
      [   13.020541]
      [   13.020541] -> #1 (cpu_hotplug_lock.rw_sem){++++}:
      [   13.021051]        static_key_slow_inc+0xe/0x20
      [   13.021424]        tracepoint_probe_register_prio+0x28c/0x300
      [   13.021891]        perf_trace_event_init+0x11f/0x250
      [   13.022297]        perf_trace_init+0x6b/0xa0
      [   13.022644]        perf_tp_event_init+0x25/0x40
      [   13.023011]        perf_try_init_event+0x6b/0x90
      [   13.023386]        perf_event_alloc+0x9a8/0xc40
      [   13.023754]        __do_sys_perf_event_open+0x1dd/0xd30
      [   13.024173]        do_syscall_64+0x4a/0x180
      [   13.024519]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   13.024968]
      [   13.024968] -> #0 (tracepoints_mutex){+.+.}:
      [   13.025434]        __mutex_lock+0x86/0x970
      [   13.025764]        tracepoint_probe_register_prio+0x2d/0x300
      [   13.026215]        bpf_probe_register+0x40/0x60
      [   13.026584]        bpf_raw_tracepoint_open.isra.34+0xa4/0x130
      [   13.027042]        __do_sys_bpf+0x94f/0x1a90
      [   13.027389]        do_syscall_64+0x4a/0x180
      [   13.027727]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   13.028171]
      [   13.028171] other info that might help us debug this:
      [   13.028171]
      [   13.028807] Chain exists of:
      [   13.028807]   tracepoints_mutex --> &cpuctx_mutex --> bpf_event_mutex
      [   13.028807]
      [   13.029666]  Possible unsafe locking scenario:
      [   13.029666]
      [   13.030140]        CPU0                    CPU1
      [   13.030510]        ----                    ----
      [   13.030875]   lock(bpf_event_mutex);
      [   13.031166]                                lock(&cpuctx_mutex);
      [   13.031645]                                lock(bpf_event_mutex);
      [   13.032135]   lock(tracepoints_mutex);
      [   13.032441]
      [   13.032441]  *** DEADLOCK ***
      [   13.032441]
      [   13.032911] 1 lock held by test_progs/246:
      [   13.033239]  #0: 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
      [   13.033909]
      [   13.033909] stack backtrace:
      [   13.034258] CPU: 1 PID: 246 Comm: test_progs Not tainted 5.0.0-rc3-00018-g2fa53f892422-dirty #477
      [   13.034964] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      [   13.035657] Call Trace:
      [   13.035859]  dump_stack+0x5f/0x8b
      [   13.036130]  print_circular_bug.isra.37+0x1ce/0x1db
      [   13.036526]  __lock_acquire+0x1158/0x1350
      [   13.036852]  ? lock_acquire+0x98/0x190
      [   13.037154]  lock_acquire+0x98/0x190
      [   13.037447]  ? tracepoint_probe_register_prio+0x2d/0x300
      [   13.037876]  __mutex_lock+0x86/0x970
      [   13.038167]  ? tracepoint_probe_register_prio+0x2d/0x300
      [   13.038600]  ? tracepoint_probe_register_prio+0x2d/0x300
      [   13.039028]  ? __mutex_lock+0x86/0x970
      [   13.039337]  ? __mutex_lock+0x24a/0x970
      [   13.039649]  ? bpf_probe_register+0x1d/0x60
      [   13.039992]  ? __bpf_trace_sched_wake_idle_without_ipi+0x10/0x10
      [   13.040478]  ? tracepoint_probe_register_prio+0x2d/0x300
      [   13.040906]  tracepoint_probe_register_prio+0x2d/0x300
      [   13.041325]  bpf_probe_register+0x40/0x60
      [   13.041649]  bpf_raw_tracepoint_open.isra.34+0xa4/0x130
      [   13.042068]  ? __might_fault+0x3e/0x90
      [   13.042374]  __do_sys_bpf+0x94f/0x1a90
      [   13.042678]  do_syscall_64+0x4a/0x180
      [   13.042975]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [   13.043382] RIP: 0033:0x7f23b10a07f9
      [   13.045155] RSP: 002b:00007ffdef42fdd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
      [   13.045759] RAX: ffffffffffffffda RBX: 00007ffdef42ff70 RCX: 00007f23b10a07f9
      [   13.046326] RDX: 0000000000000070 RSI: 00007ffdef42fe10 RDI: 0000000000000011
      [   13.046893] RBP: 00007ffdef42fdf0 R08: 0000000000000038 R09: 00007ffdef42fe10
      [   13.047462] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
      [   13.048029] R13: 0000000000000016 R14: 00007f23b1db4690 R15: 0000000000000000
      
      Since tracepoints_mutex will be taken in tracepoint_probe_register/unregister()
      there is no need to take bpf_event_mutex too.
      bpf_event_mutex is protecting modifications to prog array used in kprobe/perf bpf progs.
      bpf_raw_tracepoints don't need to take this mutex.
      
      Fixes: c4f6699d ("bpf: introduce BPF_RAW_TRACEPOINT")
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d1c1cb28
    • M
      bpf: support raw tracepoints in modules · 00d0f1b1
      Matt Mullins 提交于
      mainline inclusion
      from mainline-5.0
      commit a38d1107f937
      category: bugfix
      bugzilla: 9347
      CVE: NA
      
      -------------------------------------------------
      
      Distributions build drivers as modules, including network and filesystem
      drivers which export numerous tracepoints.  This enables
      bpf(BPF_RAW_TRACEPOINT_OPEN) to attach to those tracepoints.
      Signed-off-by: NMatt Mullins <mmullins@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      00d0f1b1
    • Z
      tracing: fix incorrect tracer freeing when opening tracing pipe · 2a2e4325
      zhangyi (F) 提交于
      euler inclusion
      category: bugfix
      bugzilla: 9292
      CVE: NA
      ---------------------------
      
      Commit d716ff71 ("tracing: Remove taking of trace_types_lock in
      pipe files") use the current tracer instead of the copy in
      tracing_open_pipe(), but it forget to remove the freeing sentence in
      the error path.
      
      Fixes: d716ff71 ("tracing: Remove taking of trace_types_lock in pipe files")
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Reviewed-by: NLi Bin <huawei.libin@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2a2e4325
    • A
      tracing: uprobes: Fix typo in pr_fmt string · dde7af59
      Andreas Ziegler 提交于
      commit ea6eb5e7d15e1838de335609994b4546e2abcaaf upstream.
      
      The subsystem-specific message prefix for uprobes was also
      "trace_kprobe: " instead of "trace_uprobe: " as described in
      the original commit message.
      
      Link: http://lkml.kernel.org/r/20190117133023.19292-1-andreas.ziegler@fau.de
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Fixes: 72576341 ("tracing/probe: Show subsystem name in messages")
      Signed-off-by: NAndreas Ziegler <andreas.ziegler@fau.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      dde7af59
    • D
      tracing: Have trace_stack nr_entries compare not be so subtle · 7ec07e11
      Dan Carpenter 提交于
      mainline inclusion
      from mainline-5.0
      commit ca16b0fbb052
      category: bugfix
      bugzilla: 5788
      CVE: NA
      
      -------------------------------------------------
      
      Dan Carpenter reviewed the trace_stack.c code and figured he found an off by
      one bug.
      
       "From reviewing the code, it seems possible for
        stack_trace_max.nr_entries to be set to .max_entries and in that case we
        would be reading one element beyond the end of the stack_dump_trace[]
        array.  If it's not set to .max_entries then the bug doesn't affect
        runtime."
      
      Although it looks to be the case, it is not. Because we have:
      
       static unsigned long stack_dump_trace[STACK_TRACE_ENTRIES+1] =
      	 { [0 ... (STACK_TRACE_ENTRIES)] = ULONG_MAX };
      
       struct stack_trace stack_trace_max = {
      	.max_entries		= STACK_TRACE_ENTRIES - 1,
      	.entries		= &stack_dump_trace[0],
       };
      
      And:
      
      	stack_trace_max.nr_entries = x;
      	for (; x < i; x++)
      		stack_dump_trace[x] = ULONG_MAX;
      
      Even if nr_entries equals max_entries, indexing with it into the
      stack_dump_trace[] array will not overflow the array. But if it is the case,
      the second part of the conditional that tests stack_dump_trace[nr_entries]
      to ULONG_MAX will always be true.
      
      By applying Dan's patch, it removes the subtle aspect of it and makes the if
      conditional slightly more efficient.
      
      Link: http://lkml.kernel.org/r/20180620110758.crunhd5bfep7zuiz@kili.mountainSigned-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      7ec07e11
    • L
      Remove 'type' argument from access_ok() function · 4983cb67
      Linus Torvalds 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit 96d4f267e40f9509e8a66e2b39e8b95655617693
      category: cleanup
      bugzilla: 9284
      CVE: NA
      
      It's a cleanup patch that prepare for applying CVE-2018-20669
      patch 594cc251fdd0 ("make 'user_access_begin()' do 'access_ok()'")
      
      -------------------------------------------------
      
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      
      Conflicts:
      	drivers/media/v4l2-core/v4l2-compat-ioctl32.c
      	drivers/infiniband/core/uverbs_main.c
      	drivers/platform/goldfish/goldfish_pipe.c
      	fs/namespace.c
      	fs/select.c
      	kernel/compat.c
      	arch/powerpc/include/asm/uaccess.h
      	arch/arm64/kernel/perf_callchain.c
      	arch/arm64/include/asm/uaccess.h
      	arch/ia64/kernel/signal.c
      	arch/x86/entry/vsyscall/vsyscall_64.c
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4983cb67
  2. 20 12月, 2018 3 次提交
  3. 17 12月, 2018 1 次提交
  4. 08 12月, 2018 1 次提交
    • S
      tracing/fgraph: Fix set_graph_function from showing interrupts · 81f96623
      Steven Rostedt (VMware) 提交于
      commit 5cf99a0f3161bc3ae2391269d134d6bf7e26f00e upstream.
      
      The tracefs file set_graph_function is used to only function graph functions
      that are listed in that file (or all functions if the file is empty). The
      way this is implemented is that the function graph tracer looks at every
      function, and if the current depth is zero and the function matches
      something in the file then it will trace that function. When other functions
      are called, the depth will be greater than zero (because the original
      function will be at depth zero), and all functions will be traced where the
      depth is greater than zero.
      
      The issue is that when a function is first entered, and the handler that
      checks this logic is called, the depth is set to zero. If an interrupt comes
      in and a function in the interrupt handler is traced, its depth will be
      greater than zero and it will automatically be traced, even if the original
      function was not. But because the logic only looks at depth it may trace
      interrupts when it should not be.
      
      The recent design change of the function graph tracer to fix other bugs
      caused the depth to be zero while the function graph callback handler is
      being called for a longer time, widening the race of this happening. This
      bug was actually there for a longer time, but because the race window was so
      small it seldom happened. The Fixes tag below is for the commit that widen
      the race window, because that commit belongs to a series that will also help
      fix the original bug.
      
      Cc: stable@kernel.org
      Fixes: 39eb456dacb5 ("function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack")
      Reported-by: NJoe Lawrence <joe.lawrence@redhat.com>
      Tested-by: NJoe Lawrence <joe.lawrence@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      81f96623
  5. 06 12月, 2018 4 次提交
    • S
      function_graph: Reverse the order of pushing the ret_stack and the callback · a22ff9df
      Steven Rostedt (VMware) 提交于
      commit 7c6ea35ef50810aa12ab26f21cb858d980881576 upstream.
      
      The function graph profiler uses the ret_stack to store the "subtime" and
      reuse it by nested functions and also on the return. But the current logic
      has the profiler callback called before the ret_stack is updated, and it is
      just modifying the ret_stack that will later be allocated (it's just lucky
      that the "subtime" is not touched when it is allocated).
      
      This could also cause a crash if we are at the end of the ret_stack when
      this happens.
      
      By reversing the order of the allocating the ret_stack and then calling the
      callbacks attached to a function being traced, the ret_stack entry is no
      longer used before it is allocated.
      
      Cc: stable@kernel.org
      Fixes: 03274a3f ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a22ff9df
    • S
      function_graph: Move return callback before update of curr_ret_stack · d2bcf809
      Steven Rostedt (VMware) 提交于
      commit 552701dd0fa7c3d448142e87210590ba424694a0 upstream.
      
      In the past, curr_ret_stack had two functions. One was to denote the depth
      of the call graph, the other is to keep track of where on the ret_stack the
      data is used. Although they may be slightly related, there are two cases
      where they need to be used differently.
      
      The one case is that it keeps the ret_stack data from being corrupted by an
      interrupt coming in and overwriting the data still in use. The other is just
      to know where the depth of the stack currently is.
      
      The function profiler uses the ret_stack to save a "subtime" variable that
      is part of the data on the ret_stack. If curr_ret_stack is modified too
      early, then this variable can be corrupted.
      
      The "max_depth" option, when set to 1, will record the first functions going
      into the kernel. To see all top functions (when dealing with timings), the
      depth variable needs to be lowered before calling the return hook. But by
      lowering the curr_ret_stack, it makes the data on the ret_stack still being
      used by the return hook susceptible to being overwritten.
      
      Now that there's two variables to handle both cases (curr_ret_depth), we can
      move them to the locations where they can handle both cases.
      
      Cc: stable@kernel.org
      Fixes: 03274a3f ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d2bcf809
    • S
      function_graph: Have profiler use curr_ret_stack and not depth · aec14c81
      Steven Rostedt (VMware) 提交于
      commit b1b35f2e218a5b57d03bbc3b0667d5064570dc60 upstream.
      
      The profiler uses trace->depth to find its entry on the ret_stack, but the
      depth may not match the actual location of where its entry is (if an
      interrupt were to preempt the processing of the profiler for another
      function, the depth and the curr_ret_stack will be different).
      
      Have it use the curr_ret_stack as the index to find its ret_stack entry
      instead of using the depth variable, as that is no longer guaranteed to be
      the same.
      
      Cc: stable@kernel.org
      Fixes: 03274a3f ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aec14c81
    • S
      function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack · 39237432
      Steven Rostedt (VMware) 提交于
      commit 39eb456d upstream.
      
      Currently, the depth of the ret_stack is determined by curr_ret_stack index.
      The issue is that there's a race between setting of the curr_ret_stack and
      calling of the callback attached to the return of the function.
      
      Commit 03274a3f ("tracing/fgraph: Adjust fgraph depth before calling
      trace return callback") moved the calling of the callback to after the
      setting of the curr_ret_stack, even stating that it was safe to do so, when
      in fact, it was the reason there was a barrier() there (yes, I should have
      commented that barrier()).
      
      Not only does the curr_ret_stack keep track of the current call graph depth,
      it also keeps the ret_stack content from being overwritten by new data.
      
      The function profiler, uses the "subtime" variable of ret_stack structure
      and by moving the curr_ret_stack, it allows for interrupts to use the same
      structure it was using, corrupting the data, and breaking the profiler.
      
      To fix this, there needs to be two variables to handle the call stack depth
      and the pointer to where the ret_stack is being used, as they need to change
      at two different locations.
      
      Cc: stable@kernel.org
      Fixes: 03274a3f ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      39237432