1. 27 12月, 2019 40 次提交
    • P
      ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code() · 97fa53b7
      Petr Mladek 提交于
      commit d5b844a2cf507fc7642c9ae80a9d585db3065c28 upstream.
      
      The commit 9f255b632bf12c4dd7 ("module: Fix livepatch/ftrace module text
      permissions race") causes a possible deadlock between register_kprobe()
      and ftrace_run_update_code() when ftrace is using stop_machine().
      
      The existing dependency chain (in reverse order) is:
      
      -> #1 (text_mutex){+.+.}:
             validate_chain.isra.21+0xb32/0xd70
             __lock_acquire+0x4b8/0x928
             lock_acquire+0x102/0x230
             __mutex_lock+0x88/0x908
             mutex_lock_nested+0x32/0x40
             register_kprobe+0x254/0x658
             init_kprobes+0x11a/0x168
             do_one_initcall+0x70/0x318
             kernel_init_freeable+0x456/0x508
             kernel_init+0x22/0x150
             ret_from_fork+0x30/0x34
             kernel_thread_starter+0x0/0xc
      
      -> #0 (cpu_hotplug_lock.rw_sem){++++}:
             check_prev_add+0x90c/0xde0
             validate_chain.isra.21+0xb32/0xd70
             __lock_acquire+0x4b8/0x928
             lock_acquire+0x102/0x230
             cpus_read_lock+0x62/0xd0
             stop_machine+0x2e/0x60
             arch_ftrace_update_code+0x2e/0x40
             ftrace_run_update_code+0x40/0xa0
             ftrace_startup+0xb2/0x168
             register_ftrace_function+0x64/0x88
             klp_patch_object+0x1a2/0x290
             klp_enable_patch+0x554/0x980
             do_one_initcall+0x70/0x318
             do_init_module+0x6e/0x250
             load_module+0x1782/0x1990
             __s390x_sys_finit_module+0xaa/0xf0
             system_call+0xd8/0x2d0
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(text_mutex);
                                     lock(cpu_hotplug_lock.rw_sem);
                                     lock(text_mutex);
        lock(cpu_hotplug_lock.rw_sem);
      
      It is similar problem that has been solved by the commit 2d1e38f5
      ("kprobes: Cure hotplug lock ordering issues"). Many locks are involved.
      To be on the safe side, text_mutex must become a low level lock taken
      after cpu_hotplug_lock.rw_sem.
      
      This can't be achieved easily with the current ftrace design.
      For example, arm calls set_all_modules_text_rw() already in
      ftrace_arch_code_modify_prepare(), see arch/arm/kernel/ftrace.c.
      This functions is called:
      
        + outside stop_machine() from ftrace_run_update_code()
        + without stop_machine() from ftrace_module_enable()
      
      Fortunately, the problematic fix is needed only on x86_64. It is
      the only architecture that calls set_all_modules_text_rw()
      in ftrace path and supports livepatching at the same time.
      
      Therefore it is enough to move text_mutex handling from the generic
      kernel/trace/ftrace.c into arch/x86/kernel/ftrace.c:
      
         ftrace_arch_code_modify_prepare()
         ftrace_arch_code_modify_post_process()
      
      This patch basically reverts the ftrace part of the problematic
      commit 9f255b632bf12c4dd7 ("module: Fix livepatch/ftrace module
      text permissions race"). And provides x86_64 specific-fix.
      
      Some refactoring of the ftrace code will be needed when livepatching
      is implemented for arm or nds32. These architectures call
      set_all_modules_text_rw() and use stop_machine() at the same time.
      
      Link: http://lkml.kernel.org/r/20190627081334.12793-1-pmladek@suse.com
      
      Fixes: 9f255b632bf12c4dd7 ("module: Fix livepatch/ftrace module text permissions race")
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Reported-by: NMiroslav Benes <mbenes@suse.cz>
      Reviewed-by: NMiroslav Benes <mbenes@suse.cz>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      [
        As reviewed by Miroslav Benes <mbenes@suse.cz>, removed return value of
        ftrace_run_update_code() as it is a void function.
      ]
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      97fa53b7
    • E
      tracing/snapshot: Resize spare buffer if size changed · f927e496
      Eiichi Tsukata 提交于
      commit 46cc0b44428d0f0e81f11ea98217fc0edfbeab07 upstream.
      
      Current snapshot implementation swaps two ring_buffers even though their
      sizes are different from each other, that can cause an inconsistency
      between the contents of buffer_size_kb file and the current buffer size.
      
      For example:
      
        # cat buffer_size_kb
        7 (expanded: 1408)
        # echo 1 > events/enable
        # grep bytes per_cpu/cpu0/stats
        bytes: 1441020
        # echo 1 > snapshot             // current:1408, spare:1408
        # echo 123 > buffer_size_kb     // current:123,  spare:1408
        # echo 1 > snapshot             // current:1408, spare:123
        # grep bytes per_cpu/cpu0/stats
        bytes: 1443700
        # cat buffer_size_kb
        123                             // != current:1408
      
      And also, a similar per-cpu case hits the following WARNING:
      
      Reproducer:
      
        # echo 1 > per_cpu/cpu0/snapshot
        # echo 123 > buffer_size_kb
        # echo 1 > per_cpu/cpu0/snapshot
      
      WARNING:
      
        WARNING: CPU: 0 PID: 1946 at kernel/trace/trace.c:1607 update_max_tr_single.part.0+0x2b8/0x380
        Modules linked in:
        CPU: 0 PID: 1946 Comm: bash Not tainted 5.2.0-rc6 #20
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
        RIP: 0010:update_max_tr_single.part.0+0x2b8/0x380
        Code: ff e8 dc da f9 ff 0f 0b e9 88 fe ff ff e8 d0 da f9 ff 44 89 ee bf f5 ff ff ff e8 33 dc f9 ff 41 83 fd f5 74 96 e8 b8 da f9 ff <0f> 0b eb 8d e8 af da f9 ff 0f 0b e9 bf fd ff ff e8 a3 da f9 ff 48
        RSP: 0018:ffff888063e4fca0 EFLAGS: 00010093
        RAX: ffff888066214380 RBX: ffffffff99850fe0 RCX: ffffffff964298a8
        RDX: 0000000000000000 RSI: 00000000fffffff5 RDI: 0000000000000005
        RBP: 1ffff1100c7c9f96 R08: ffff888066214380 R09: ffffed100c7c9f9b
        R10: ffffed100c7c9f9a R11: 0000000000000003 R12: 0000000000000000
        R13: 00000000ffffffea R14: ffff888066214380 R15: ffffffff99851060
        FS:  00007f9f8173c700(0000) GS:ffff88806d000000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000714dc0 CR3: 0000000066fa6000 CR4: 00000000000006f0
        Call Trace:
         ? trace_array_printk_buf+0x140/0x140
         ? __mutex_lock_slowpath+0x10/0x10
         tracing_snapshot_write+0x4c8/0x7f0
         ? trace_printk_init_buffers+0x60/0x60
         ? selinux_file_permission+0x3b/0x540
         ? tracer_preempt_off+0x38/0x506
         ? trace_printk_init_buffers+0x60/0x60
         __vfs_write+0x81/0x100
         vfs_write+0x1e1/0x560
         ksys_write+0x126/0x250
         ? __ia32_sys_read+0xb0/0xb0
         ? do_syscall_64+0x1f/0x390
         do_syscall_64+0xc1/0x390
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This patch adds resize_buffer_duplicate_size() to check if there is a
      difference between current/spare buffer sizes and resize a spare buffer
      if necessary.
      
      Link: http://lkml.kernel.org/r/20190625012910.13109-1-devel@etsukata.com
      
      Cc: stable@vger.kernel.org
      Fixes: ad909e21 ("tracing: Add internal tracing_snapshot() functions")
      Signed-off-by: NEiichi Tsukata <devel@etsukata.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f927e496
    • J
      ptrace: Fix ->ptracer_cred handling for PTRACE_TRACEME · 69a16f06
      Jann Horn 提交于
      commit 6994eefb0053799d2e07cd140df6c2ea106c41ee upstream.
      
      Fix two issues:
      
      When called for PTRACE_TRACEME, ptrace_link() would obtain an RCU
      reference to the parent's objective credentials, then give that pointer
      to get_cred().  However, the object lifetime rules for things like
      struct cred do not permit unconditionally turning an RCU reference into
      a stable reference.
      
      PTRACE_TRACEME records the parent's credentials as if the parent was
      acting as the subject, but that's not the case.  If a malicious
      unprivileged child uses PTRACE_TRACEME and the parent is privileged, and
      at a later point, the parent process becomes attacker-controlled
      (because it drops privileges and calls execve()), the attacker ends up
      with control over two processes with a privileged ptrace relationship,
      which can be abused to ptrace a suid binary and obtain root privileges.
      
      Fix both of these by always recording the credentials of the process
      that is requesting the creation of the ptrace relationship:
      current_cred() can't change under us, and current is the proper subject
      for access control.
      
      This change is theoretically userspace-visible, but I am not aware of
      any code that it will actually break.
      
      Fixes: 64b875f7 ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP")
      Signed-off-by: NJann Horn <jannh@google.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      69a16f06
    • J
      module: Fix livepatch/ftrace module text permissions race · aa4d90fc
      Josh Poimboeuf 提交于
      [ Upstream commit 9f255b632bf12c4dd7fc31caee89aa991ef75176 ]
      
      It's possible for livepatch and ftrace to be toggling a module's text
      permissions at the same time, resulting in the following panic:
      
        BUG: unable to handle page fault for address: ffffffffc005b1d9
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0003) - permissions violation
        PGD 3ea0c067 P4D 3ea0c067 PUD 3ea0e067 PMD 3cc13067 PTE 3b8a1061
        Oops: 0003 [#1] PREEMPT SMP PTI
        CPU: 1 PID: 453 Comm: insmod Tainted: G           O  K   5.2.0-rc1-a188339ca5 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
        RIP: 0010:apply_relocate_add+0xbe/0x14c
        Code: fa 0b 74 21 48 83 fa 18 74 38 48 83 fa 0a 75 40 eb 08 48 83 38 00 74 33 eb 53 83 38 00 75 4e 89 08 89 c8 eb 0a 83 38 00 75 43 <89> 08 48 63 c1 48 39 c8 74 2e eb 48 83 38 00 75 32 48 29 c1 89 08
        RSP: 0018:ffffb223c00dbb10 EFLAGS: 00010246
        RAX: ffffffffc005b1d9 RBX: 0000000000000000 RCX: ffffffff8b200060
        RDX: 000000000000000b RSI: 0000004b0000000b RDI: ffff96bdfcd33000
        RBP: ffffb223c00dbb38 R08: ffffffffc005d040 R09: ffffffffc005c1f0
        R10: ffff96bdfcd33c40 R11: ffff96bdfcd33b80 R12: 0000000000000018
        R13: ffffffffc005c1f0 R14: ffffffffc005e708 R15: ffffffff8b2fbc74
        FS:  00007f5f447beba8(0000) GS:ffff96bdff900000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffffffffc005b1d9 CR3: 000000003cedc002 CR4: 0000000000360ea0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         klp_init_object_loaded+0x10f/0x219
         ? preempt_latency_start+0x21/0x57
         klp_enable_patch+0x662/0x809
         ? virt_to_head_page+0x3a/0x3c
         ? kfree+0x8c/0x126
         patch_init+0x2ed/0x1000 [livepatch_test02]
         ? 0xffffffffc0060000
         do_one_initcall+0x9f/0x1c5
         ? kmem_cache_alloc_trace+0xc4/0xd4
         ? do_init_module+0x27/0x210
         do_init_module+0x5f/0x210
         load_module+0x1c41/0x2290
         ? fsnotify_path+0x3b/0x42
         ? strstarts+0x2b/0x2b
         ? kernel_read+0x58/0x65
         __do_sys_finit_module+0x9f/0xc3
         ? __do_sys_finit_module+0x9f/0xc3
         __x64_sys_finit_module+0x1a/0x1c
         do_syscall_64+0x52/0x61
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The above panic occurs when loading two modules at the same time with
      ftrace enabled, where at least one of the modules is a livepatch module:
      
      CPU0					CPU1
      klp_enable_patch()
        klp_init_object_loaded()
          module_disable_ro()
          					ftrace_module_enable()
      					  ftrace_arch_code_modify_post_process()
      				    	    set_all_modules_text_ro()
            klp_write_object_relocations()
              apply_relocate_add()
      	  *patches read-only code* - BOOM
      
      A similar race exists when toggling ftrace while loading a livepatch
      module.
      
      Fix it by ensuring that the livepatch and ftrace code patching
      operations -- and their respective permissions changes -- are protected
      by the text_mutex.
      
      Link: http://lkml.kernel.org/r/ab43d56ab909469ac5d2520c5d944ad6d4abd476.1560474114.git.jpoimboe@redhat.comReported-by: NJohannes Erdfelt <johannes@erdfelt.com>
      Fixes: 444d13ff ("modules: add ro_after_init support")
      Acked-by: NJessica Yu <jeyu@kernel.org>
      Reviewed-by: NPetr Mladek <pmladek@suse.com>
      Reviewed-by: NMiroslav Benes <mbenes@suse.cz>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      aa4d90fc
    • V
      tracing: avoid build warning with HAVE_NOP_MCOUNT · 0bb6e415
      Vasily Gorbik 提交于
      [ Upstream commit cbdaeaf050b730ea02e9ab4ff844ce54d85dbe1d ]
      
      Selecting HAVE_NOP_MCOUNT enables -mnop-mcount (if gcc supports it)
      and sets CC_USING_NOP_MCOUNT. Reuse __is_defined (which is suitable for
      testing CC_USING_* defines) to avoid conditional compilation and fix
      the following gcc 9 warning on s390:
      
      kernel/trace/ftrace.c:2514:1: warning: ‘ftrace_code_disable’ defined
      but not used [-Wunused-function]
      
      Link: http://lkml.kernel.org/r/patch.git-1a82d13f33ac.your-ad-here.call-01559732716-ext-6629@work.hours
      
      Fixes: 2f4df001 ("tracing: Add -mcount-nop option support")
      Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0bb6e415
    • J
      cpuset: restore sanity to cpuset_cpus_allowed_fallback() · aa1c6046
      Joel Savitz 提交于
      [ Upstream commit d477f8c202d1f0d4791ab1263ca7657bbe5cf79e ]
      
      In the case that a process is constrained by taskset(1) (i.e.
      sched_setaffinity(2)) to a subset of available cpus, and all of those are
      subsequently offlined, the scheduler will set tsk->cpus_allowed to
      the current value of task_cs(tsk)->effective_cpus.
      
      This is done via a call to do_set_cpus_allowed() in the context of
      cpuset_cpus_allowed_fallback() made by the scheduler when this case is
      detected. This is the only call made to cpuset_cpus_allowed_fallback()
      in the latest mainline kernel.
      
      However, this is not sane behavior.
      
      I will demonstrate this on a system running the latest upstream kernel
      with the following initial configuration:
      
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:	ffffffff,fffffff
      	Cpus_allowed_list:	0-63
      
      (Where cpus 32-63 are provided via smt.)
      
      If we limit our current shell process to cpu2 only and then offline it
      and reonline it:
      
      	# taskset -p 4 $$
      	pid 2272's current affinity mask: ffffffffffffffff
      	pid 2272's new affinity mask: 4
      
      	# echo off > /sys/devices/system/cpu/cpu2/online
      	# dmesg | tail -3
      	[ 2195.866089] process 2272 (bash) no longer affine to cpu2
      	[ 2195.872700] IRQ 114: no longer affine to CPU2
      	[ 2195.879128] smpboot: CPU 2 is now offline
      
      	# echo on > /sys/devices/system/cpu/cpu2/online
      	# dmesg | tail -1
      	[ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4
      
      We see that our current process now has an affinity mask containing
      every cpu available on the system _except_ the one we originally
      constrained it to:
      
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:   ffffffff,fffffffb
      	Cpus_allowed_list:      0-1,3-63
      
      This is not sane behavior, as the scheduler can now not only place the
      process on previously forbidden cpus, it can't even schedule it on
      the cpu it was originally constrained to!
      
      Other cases result in even more exotic affinity masks. Take for instance
      a process with an affinity mask containing only cpus provided by smt at
      the moment that smt is toggled, in a configuration such as the following:
      
      	# taskset -p f000000000 $$
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:	000000f0,00000000
      	Cpus_allowed_list:	36-39
      
      A double toggle of smt results in the following behavior:
      
      	# echo off > /sys/devices/system/cpu/smt/control
      	# echo on > /sys/devices/system/cpu/smt/control
      	# grep -i cpus /proc/$$/status
      	Cpus_allowed:	ffffff00,ffffffff
      	Cpus_allowed_list:	0-31,40-63
      
      This is even less sane than the previous case, as the new affinity mask
      excludes all smt-provided cpus with ids less than those that were
      previously in the affinity mask, as well as those that were actually in
      the mask.
      
      With this patch applied, both of these cases end in the following state:
      
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:	ffffffff,ffffffff
      	Cpus_allowed_list:	0-63
      
      The original policy is discarded. Though not ideal, it is the simplest way
      to restore sanity to this fallback case without reinventing the cpuset
      wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
      for the previous affinity mask to be restored in this fallback case can use
      that mechanism instead.
      
      This patch modifies scheduler behavior by instead resetting the mask to
      task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
      mode. I tested the cases above on both modes.
      
      Note that the scheduler uses this fallback mechanism if and only if
      _every_ other valid avenue has been traveled, and it is the last resort
      before calling BUG().
      Suggested-by: NWaiman Long <longman@redhat.com>
      Suggested-by: NPhil Auld <pauld@redhat.com>
      Signed-off-by: NJoel Savitz <jsavitz@redhat.com>
      Acked-by: NPhil Auld <pauld@redhat.com>
      Acked-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      aa1c6046
    • D
      bpf: fix unconnected udp hooks · f3b609d8
      Daniel Borkmann 提交于
      commit 983695fa676568fc0fe5ddd995c7267aabc24632 upstream.
      
      Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
      to applications as also stated in original motivation in 7828f20e ("Merge
      branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
      two hooks into Cilium to enable host based load-balancing with Kubernetes,
      I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
      typically sets up DNS as a service and is thus subject to load-balancing.
      
      Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
      is currently insufficient and thus not usable as-is for standard applications
      shipped with most distros. To break down the issue we ran into with a simple
      example:
      
        # cat /etc/resolv.conf
        nameserver 147.75.207.207
        nameserver 147.75.207.208
      
      For the purpose of a simple test, we set up above IPs as service IPs and
      transparently redirect traffic to a different DNS backend server for that
      node:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      The attached BPF program is basically selecting one of the backends if the
      service IP/port matches on the cgroup hook. DNS breaks here, because the
      hooks are not transparent enough to applications which have built-in msg_name
      address checks:
      
        # nslookup 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
        ;; connection timed out; no servers could be reached
      
        # dig 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; connection timed out; no servers could be reached
      
      For comparison, if none of the service IPs is used, and we tell nslookup
      to use 8.8.8.8 directly it works just fine, of course:
      
        # nslookup 1.1.1.1 8.8.8.8
        1.1.1.1.in-addr.arpa	name = one.one.one.one.
      
      In order to fix this and thus act more transparent to the application,
      this needs reverse translation on recvmsg() side. A minimal fix for this
      API is to add similar recvmsg() hooks behind the BPF cgroups static key
      such that the program can track state and replace the current sockaddr_in{,6}
      with the original service IP. From BPF side, this basically tracks the
      service tuple plus socket cookie in an LRU map where the reverse NAT can
      then be retrieved via map value as one example. Side-note: the BPF cgroups
      static key should be converted to a per-hook static key in future.
      
      Same example after this fix:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      Lookups work fine now:
      
        # nslookup 1.1.1.1
        1.1.1.1.in-addr.arpa    name = one.one.one.one.
      
        Authoritative answers can be found from:
      
        # dig 1.1.1.1
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
        ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
      
        ;; OPT PSEUDOSECTION:
        ; EDNS: version: 0, flags:; udp: 512
        ;; QUESTION SECTION:
        ;1.1.1.1.                       IN      A
      
        ;; AUTHORITY SECTION:
        .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
      
        ;; Query time: 17 msec
        ;; SERVER: 147.75.207.207#53(147.75.207.207)
        ;; WHEN: Tue May 21 12:59:38 UTC 2019
        ;; MSG SIZE  rcvd: 111
      
      And from an actual packet level it shows that we're using the back end
      server when talking via 147.75.207.20{7,8} front end:
      
        # tcpdump -i any udp
        [...]
        12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        [...]
      
      In order to be flexible and to have same semantics as in sendmsg BPF
      programs, we only allow return codes in [1,1] range. In the sendmsg case
      the program is called if msg->msg_name is present which can be the case
      in both, connected and unconnected UDP.
      
      The former only relies on the sockaddr_in{,6} passed via connect(2) if
      passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
      way to call into the BPF program whenever a non-NULL msg->msg_name was
      passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
      that for TCP case, the msg->msg_name is ignored in the regular recvmsg
      path and therefore not relevant.
      
      For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
      the hook is not called. This is intentional as it aligns with the same
      semantics as in case of TCP cgroup BPF hooks right now. This might be
      better addressed in future through a different bpf_attach_type such
      that this case can be distinguished from the regular recvmsg paths,
      for example.
      
      Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NMartynas Pumputis <m@lambda.lt>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f3b609d8
    • M
      bpf: fix nested bpf tracepoints with per-cpu data · 1cd29ed5
      Matt Mullins 提交于
      commit 9594dc3c7e71b9f52bee1d7852eb3d4e3aea9e99 upstream.
      
      BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as
      they do not increment bpf_prog_active while executing.
      
      This enables three levels of nesting, to support
        - a kprobe or raw tp or perf event,
        - another one of the above that irq context happens to call, and
        - another one in nmi context
      (at most one of which may be a kprobe or perf event).
      
      Fixes: 20b9d7ac ("bpf: avoid excessive stack usage for perf_sample_data")
      Signed-off-by: NMatt Mullins <mmullins@fb.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1cd29ed5
    • J
      bpf: lpm_trie: check left child of last leftmost node for NULL · e3d4966f
      Jonathan Lemon 提交于
      commit da2577fdd0932ea4eefe73903f1130ee366767d2 upstream.
      
      If the leftmost parent node of the tree has does not have a child
      on the left side, then trie_get_next_key (and bpftool map dump) will
      not look at the child on the right.  This leads to the traversal
      missing elements.
      
      Lookup is not affected.
      
      Update selftest to handle this case.
      
      Reproducer:
      
       bpftool map create /sys/fs/bpf/lpm type lpm_trie key 6 \
           value 1 entries 256 name test_lpm flags 1
       bpftool map update pinned /sys/fs/bpf/lpm key  8 0 0 0  0   0 value 1
       bpftool map update pinned /sys/fs/bpf/lpm key 16 0 0 0  0 128 value 2
       bpftool map dump   pinned /sys/fs/bpf/lpm
      
      Returns only 1 element. (2 expected)
      
      Fixes: b471f2f1 ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE")
      Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e3d4966f
    • G
      cpu/speculation: Warn on unsupported mitigations= parameter · 0579feb4
      Geert Uytterhoeven 提交于
      commit 1bf72720 upstream.
      
      Currently, if the user specifies an unsupported mitigation strategy on the
      kernel command line, it will be ignored silently.  The code will fall back
      to the default strategy, possibly leaving the system more vulnerable than
      expected.
      
      This may happen due to e.g. a simple typo, or, for a stable kernel release,
      because not all mitigation strategies have been backported.
      
      Inform the user by printing a message.
      
      Fixes: 98af8452 ("cpu/speculation: Add 'mitigations=' cmdline option")
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190516070935.22546-1-geert@linux-m68k.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0579feb4
    • S
      Revert "x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP" · 1d6d11fc
      Sasha Levin 提交于
      This reverts commit 1a3188d737ceb922166d8fe78a5fc4f89907e31b, which was
      upstream commit 4a6c91fbdef846ec7250b82f2eeeb87ac5f18cf9.
      
      On Tue, Jun 25, 2019 at 09:39:45AM +0200, Sebastian Andrzej Siewior wrote:
      >Please backport commit e74deb11931ff682b59d5b9d387f7115f689698e to
      >stable _or_ revert the backport of commit 4a6c91fbdef84 ("x86/uaccess,
      >ftrace: Fix ftrace_likely_update() vs. SMAP"). It uses
      >user_access_{save|restore}() which has been introduced in the following
      >commit.
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1d6d11fc
    • M
      tracing: Silence GCC 9 array bounds warning · a66fb5c9
      Miguel Ojeda 提交于
      commit 0c97bf863efce63d6ab7971dad811601e6171d2f upstream.
      
      Starting with GCC 9, -Warray-bounds detects cases when memset is called
      starting on a member of a struct but the size to be cleared ends up
      writing over further members.
      
      Such a call happens in the trace code to clear, at once, all members
      after and including `seq` on struct trace_iterator:
      
          In function 'memset',
              inlined from 'ftrace_dump' at kernel/trace/trace.c:8914:3:
          ./include/linux/string.h:344:9: warning: '__builtin_memset' offset
          [8505, 8560] from the object at 'iter' is out of the bounds of
          referenced subobject 'seq' with type 'struct trace_seq' at offset
          4368 [-Warray-bounds]
            344 |  return __builtin_memset(p, c, size);
                |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      In order to avoid GCC complaining about it, we compute the address
      ourselves by adding the offsetof distance instead of referring
      directly to the member.
      
      Since there are two places doing this clear (trace.c and trace_kdb.c),
      take the chance to move the workaround into a single place in
      the internal header.
      
      Link: http://lkml.kernel.org/r/20190523124535.GA12931@gmail.comSigned-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      [ Removed unnecessary parenthesis around "iter" ]
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a66fb5c9
    • B
      cgroup: check if cgroup root is alive in cgroupstats_show() · bc65f89f
      Bixuan Cui 提交于
      euler inclusion
      category: bugfix
      bugzilla: 8883
      CVE: NA
      
      -------------------------------------------------
      
      If a cgroup root is dying, show its hierarchy_id and num_cgroups
      as 0.
      Signed-off-by: NZefan Li <lizefan@huawei.com>
      Tested-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NHanjun Guo <hanjun.guo@linaro.org>
      Signed-off-by: NBixuan Cui <cuibixuan@huawei.com>
      Reviewed-by: NMiao Xie <miaoxie@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      bc65f89f
    • R
      sched, rt: fix isolated CPUs leaving task_group indefinitely throttled · 1b91af3b
      Rui Xiang 提交于
      euler inclusion
      category:  bugfix
      Bugzilla:  7466
      CVE:	   N/A
      
      ----------------------------------------
      
      e221d028 (sched,rt: fix isolated CPUs leaving root_task_group
      indefinitely throttled) only fixes isolated CPUs leaving root_task_group,
      and not fix all other ordinary task_groutask_group. In some scenarios
      where we need attach task bind to isolated CPUs to task_group, the
      same problem will occurs.
      
      This commit fixes it.
      Reported-and-tested-by: NLikun <hw.likun@huawei.com>
      Signed-off-by: NRui Xiang <rui.xiang@huawei.com>
      Signed-off-by: NBixuan Cui <cuibixuan@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1b91af3b
    • P
      perf/ring-buffer: Always use {READ, WRITE}_ONCE() for rb->user_page data · 15fd91f5
      Peter Zijlstra 提交于
      [ Upstream commit 4d839dd9 ]
      
      We must use {READ,WRITE}_ONCE() on rb->user_page data such that
      concurrent usage will see whole values. A few key sites were missing
      this.
      Suggested-by: NYabin Cui <yabinc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: mark.rutland@arm.com
      Cc: namhyung@kernel.org
      Fixes: 7b732a75 ("perf_counter: new output ABI - part 1")
      Link: http://lkml.kernel.org/r/20190517115418.394192145@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      15fd91f5
    • P
      perf/ring_buffer: Add ordering to rb->nest increment · 4663f173
      Peter Zijlstra 提交于
      [ Upstream commit 3f9fbe9b ]
      
      Similar to how decrementing rb->next too early can cause data_head to
      (temporarily) be observed to go backward, so too can this happen when
      we increment too late.
      
      This barrier() ensures the rb->head load happens after the increment,
      both the one in the 'goto again' path, as the one from
      perf_output_get_handle() -- albeit very unlikely to matter for the
      latter.
      Suggested-by: NYabin Cui <yabinc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: mark.rutland@arm.com
      Cc: namhyung@kernel.org
      Fixes: ef60777c ("perf: Optimize the perf_output() path by removing IRQ-disables")
      Link: http://lkml.kernel.org/r/20190517115418.309516009@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4663f173
    • Y
      perf/ring_buffer: Fix exposing a temporarily decreased data_head · be10e62a
      Yabin Cui 提交于
      [ Upstream commit 1b038c6e ]
      
      In perf_output_put_handle(), an IRQ/NMI can happen in below location and
      write records to the same ring buffer:
      
      	...
      	local_dec_and_test(&rb->nest)
      	...                          <-- an IRQ/NMI can happen here
      	rb->user_page->data_head = head;
      	...
      
      In this case, a value A is written to data_head in the IRQ, then a value
      B is written to data_head after the IRQ. And A > B. As a result,
      data_head is temporarily decreased from A to B. And a reader may see
      data_head < data_tail if it read the buffer frequently enough, which
      creates unexpected behaviors.
      
      This can be fixed by moving dec(&rb->nest) to after updating data_head,
      which prevents the IRQ/NMI above from updating data_head.
      
      [ Split up by peterz. ]
      Signed-off-by: NYabin Cui <yabinc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: mark.rutland@arm.com
      Fixes: ef60777c ("perf: Optimize the perf_output() path by removing IRQ-disables")
      Link: http://lkml.kernel.org/r/20190517115418.224478157@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      be10e62a
    • T
      timekeeping: Repair ktime_get_coarse*() granularity · fc6e0d14
      Thomas Gleixner 提交于
      commit e3ff9c36 upstream.
      
      Jason reported that the coarse ktime based time getters advance only once
      per second and not once per tick as advertised.
      
      The code reads only the monotonic base time, which advances once per
      second. The nanoseconds are accumulated on every tick in xtime_nsec up to
      a second and the regular time getters take this nanoseconds offset into
      account, but the ktime_get_coarse*() implementation fails to do so.
      
      Add the accumulated xtime_nsec value to the monotonic base time to get the
      proper per tick advancing coarse tinme.
      
      Fixes: b9ff604c ("timekeeping: Add ktime_get_coarse_with_offset")
      Reported-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Clemens Ladisch <clemens@ladisch.de>
      Cc: Sultan Alsawaf <sultan@kerneltoast.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1906132136280.1791@nanos.tec.linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fc6e0d14
    • T
      tracing: Prevent hist_field_var_ref() from accessing NULL tracing_map_elts · 76e1af66
      Tom Zanussi 提交于
      [ Upstream commit 55267c88c003a3648567beae7c90512d3e2ab15e ]
      
      hist_field_var_ref() is an implementation of hist_field_fn_t(), which
      can be called with a null tracing_map_elt elt param when assembling a
      key in event_hist_trigger().
      
      In the case of hist_field_var_ref() this doesn't make sense, because a
      variable can only be resolved by looking it up using an already
      assembled key i.e. a variable can't be used to assemble a key since
      the key is required in order to access the variable.
      
      Upper layers should prevent the user from constructing a key using a
      variable in the first place, but in case one slips through, it
      shouldn't cause a NULL pointer dereference.  Also if one does slip
      through, we want to know about it, so emit a one-time warning in that
      case.
      
      Link: http://lkml.kernel.org/r/64ec8dc15c14d305295b64cdfcc6b2b9dd14753f.1555597045.git.tom.zanussi@linux.intel.comReported-by: NVincent Bernat <vincent@bernat.ch>
      Signed-off-by: NTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      76e1af66
    • P
      x86/uaccess, kcov: Disable stack protector · bb90ada4
      Peter Zijlstra 提交于
      [ Upstream commit 40ea97290b08be2e038b31cbb33097d1145e8169 ]
      
      New tooling noticed this mishap:
      
        kernel/kcov.o: warning: objtool: write_comp_data()+0x138: call to __stack_chk_fail() with UACCESS enabled
        kernel/kcov.o: warning: objtool: __sanitizer_cov_trace_pc()+0xd9: call to __stack_chk_fail() with UACCESS enabled
      
      All the other instrumentation (KASAN,UBSAN) also have stack protector
      disabled.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      bb90ada4
    • J
      ptrace: restore smp_rmb() in __ptrace_may_access() · 4275c8f4
      Jann Horn 提交于
      commit f6581f5b55141a95657ef5742cf6a6bfa20a109f upstream.
      
      Restore the read memory barrier in __ptrace_may_access() that was deleted
      a couple years ago. Also add comments on this barrier and the one it pairs
      with to explain why they're there (as far as I understand).
      
      Fixes: bfedb589 ("mm: Add a user_ns owner to mm_struct and fix ptrace permission checks")
      Cc: stable@vger.kernel.org
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4275c8f4
    • E
      signal/ptrace: Don't leak unitialized kernel memory with PTRACE_PEEK_SIGINFO · db8c4150
      Eric W. Biederman 提交于
      [ Upstream commit f6e2aa91a46d2bc79fce9b93a988dbe7655c90c0 ]
      
      Recently syzbot in conjunction with KMSAN reported that
      ptrace_peek_siginfo can copy an uninitialized siginfo to userspace.
      Inspecting ptrace_peek_siginfo confirms this.
      
      The problem is that off when initialized from args.off can be
      initialized to a negaive value.  At which point the "if (off >= 0)"
      test to see if off became negative fails because off started off
      negative.
      
      Prevent the core problem by adding a variable found that is only true
      if a siginfo is found and copied to a temporary in preparation for
      being copied to userspace.
      
      Prevent args.off from being truncated when being assigned to off by
      testing that off is <= the maximum possible value of off.  Convert off
      to an unsigned long so that we should not have to truncate args.off,
      we have well defined overflow behavior so if we add another check we
      won't risk fighting undefined compiler behavior, and so that we have a
      type whose maximum value is easy to test for.
      
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+0d602a1b0d8c95bdf299@syzkaller.appspotmail.com
      Fixes: 84c751bd ("ptrace: add ability to retrieve signals without removing from a queue (v4)")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      db8c4150
    • M
      ntp: Allow TAI-UTC offset to be set to zero · b891236c
      Miroslav Lichvar 提交于
      [ Upstream commit fdc6bae940ee9eb869e493990540098b8c0fd6ab ]
      
      The ADJ_TAI adjtimex mode sets the TAI-UTC offset of the system clock.
      It is typically set by NTP/PTP implementations and it is automatically
      updated by the kernel on leap seconds. The initial value is zero (which
      applications may interpret as unknown), but this value cannot be set by
      adjtimex. This limitation seems to go back to the original "nanokernel"
      implementation by David Mills.
      
      Change the ADJ_TAI check to accept zero as a valid TAI-UTC offset in
      order to allow setting it back to the initial value.
      
      Fixes: 153b5d05 ("ntp: support for TAI")
      Suggested-by: NOndrej Mosnacek <omosnace@redhat.com>
      Signed-off-by: NMiroslav Lichvar <mlichvar@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Link: https://lkml.kernel.org/r/20190417084833.7401-1-mlichvar@redhat.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b891236c
    • K
      bpf: fix undefined behavior in narrow load handling · 9818af73
      Krzesimir Nowak 提交于
      [ Upstream commit e2f7fc0ac6957cabff4cecf6c721979b571af208 ]
      
      Commit 31fd8581 ("bpf: permits narrower load from bpf program
      context fields") made the verifier add AND instructions to clear the
      unwanted bits with a mask when doing a narrow load. The mask is
      computed with
      
        (1 << size * 8) - 1
      
      where "size" is the size of the narrow load. When doing a 4 byte load
      of a an 8 byte field the verifier shifts the literal 1 by 32 places to
      the left. This results in an overflow of a signed integer, which is an
      undefined behavior. Typically, the computed mask was zero, so the
      result of the narrow load ended up being zero too.
      
      Cast the literal to long long to avoid overflows. Note that narrow
      load of the 4 byte fields does not have the undefined behavior,
      because the load size can only be either 1 or 2 bytes, so shifting 1
      by 8 or 16 places will not overflow it. And reading 4 bytes would not
      be a narrow load of a 4 bytes field.
      
      Fixes: 31fd8581 ("bpf: permits narrower load from bpf program context fields")
      Reviewed-by: NAlban Crequy <alban@kinvolk.io>
      Reviewed-by: NIago López Galeiras <iago@kinvolk.io>
      Signed-off-by: NKrzesimir Nowak <krzesimir@kinvolk.io>
      Cc: Yonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      9818af73
    • C
      kernel/sys.c: prctl: fix false positive in validate_prctl_map() · 7986979c
      Cyrill Gorcunov 提交于
      [ Upstream commit a9e73998f9d705c94a8dca9687633adc0f24a19a ]
      
      While validating new map we require the @start_data to be strictly less
      than @end_data, which is fine for regular applications (this is why this
      nit didn't trigger for that long).  These members are set from executable
      loaders such as elf handers, still it is pretty valid to have a loadable
      data section with zero size in file, in such case the start_data is equal
      to end_data once kernel loader finishes.
      
      As a result when we're trying to restore such programs the procedure fails
      and the kernel returns -EINVAL.  From the image dump of a program:
      
       | "mm_start_code": "0x400000",
       | "mm_end_code": "0x8f5fb4",
       | "mm_start_data": "0xf1bfb0",
       | "mm_end_data": "0xf1bfb0",
      
      Thus we need to change validate_prctl_map from strictly less to less or
      equal operator use.
      
      Link: http://lkml.kernel.org/r/20190408143554.GY1421@uranus.lan
      Fixes: f606b77f ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
      Signed-off-by: NCyrill Gorcunov <gorcunov@gmail.com>
      Cc: Andrey Vagin <avagin@gmail.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      7986979c
    • C
      sysctl: return -EINVAL if val violates minmax · d4efd873
      Christian Brauner 提交于
      [ Upstream commit e260ad01f0aa9e96b5386d5cd7184afd949dc457 ]
      
      Currently when userspace gives us a values that overflow e.g.  file-max
      and other callers of __do_proc_doulongvec_minmax() we simply ignore the
      new value and leave the current value untouched.
      
      This can be problematic as it gives the illusion that the limit has
      indeed be bumped when in fact it failed.  This commit makes sure to
      return EINVAL when an overflow is detected.  Please note that this is a
      userspace facing change.
      
      Link: http://lkml.kernel.org/r/20190210203943.8227-4-christian@brauner.ioSigned-off-by: NChristian Brauner <christian@brauner.io>
      Acked-by: NLuis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Joe Lawrence <joe.lawrence@redhat.com>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d4efd873
    • J
      x86/power: Fix 'nosmt' vs hibernation triple fault during resume · 910ba4d7
      Jiri Kosina 提交于
      commit ec527c318036a65a083ef68d8ba95789d2212246 upstream.
      
      As explained in
      
      	0cc3cd21 ("cpu/hotplug: Boot HT siblings at least once")
      
      we always, no matter what, have to bring up x86 HT siblings during boot at
      least once in order to avoid first MCE bringing the system to its knees.
      
      That means that whenever 'nosmt' is supplied on the kernel command-line,
      all the HT siblings are as a result sitting in mwait or cpudile after
      going through the online-offline cycle at least once.
      
      This causes a serious issue though when a kernel, which saw 'nosmt' on its
      commandline, is going to perform resume from hibernation: if the resume
      from the hibernated image is successful, cr3 is flipped in order to point
      to the address space of the kernel that is being resumed, which in turn
      means that all the HT siblings are all of a sudden mwaiting on address
      which is no longer valid.
      
      That results in triple fault shortly after cr3 is switched, and machine
      reboots.
      
      Fix this by always waking up all the SMT siblings before initiating the
      'restore from hibernation' process; this guarantees that all the HT
      siblings will be properly carried over to the resumed kernel waiting in
      resume_play_dead(), and acted upon accordingly afterwards, based on the
      target kernel configuration.
      
      Symmetricaly, the resumed kernel has to push the SMT siblings to mwait
      again in case it has SMT disabled; this means it has to online all
      the siblings when resuming (so that they come out of hlt) and offline
      them again to let them reach mwait.
      
      Cc: 4.19+ <stable@vger.kernel.org> # v4.19+
      Debugged-by: NThomas Gleixner <tglx@linutronix.de>
      Fixes: 0cc3cd21 ("cpu/hotplug: Boot HT siblings at least once")
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      910ba4d7
    • Z
      kernel/signal.c: trace_signal_deliver when signal_group_exit · f1bfcb97
      Zhenliang Wei 提交于
      commit 98af37d624ed8c83f1953b1b6b2f6866011fc064 upstream.
      
      In the fixes commit, removing SIGKILL from each thread signal mask and
      executing "goto fatal" directly will skip the call to
      "trace_signal_deliver".  At this point, the delivery tracking of the
      SIGKILL signal will be inaccurate.
      
      Therefore, we need to add trace_signal_deliver before "goto fatal" after
      executing sigdelset.
      
      Note: SEND_SIG_NOINFO matches the fact that SIGKILL doesn't have any info.
      
      Link: http://lkml.kernel.org/r/20190425025812.91424-1-weizhenliang@huawei.com
      Fixes: cf43a757fd4944 ("signal: Restore the stop PTRACE_EVENT_EXIT")
      Signed-off-by: NZhenliang Wei <weizhenliang@huawei.com>
      Reviewed-by: NChristian Brauner <christian@brauner.io>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Ivan Delalande <colona@arista.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Deepa Dinamani <deepa.kernel@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f1bfcb97
    • T
      tracing: Avoid memory leak in predicate_parse() · 74792b82
      Tomas Bortoli 提交于
      commit dfb4a6f2191a80c8b790117d0ff592fd712d3296 upstream.
      
      In case of errors, predicate_parse() goes to the out_free label
      to free memory and to return an error code.
      
      However, predicate_parse() does not free the predicates of the
      temporary prog_stack array, thence leaking them.
      
      Link: http://lkml.kernel.org/r/20190528154338.29976-1-tomasbortoli@gmail.com
      
      Cc: stable@vger.kernel.org
      Fixes: 80765597 ("tracing: Rewrite filter logic to be simpler and faster")
      Reported-by: syzbot+6b8e0fb820e570c59e19@syzkaller.appspotmail.com
      Signed-off-by: NTomas Bortoli <tomasbortoli@gmail.com>
      [ Added protection around freeing prog_stack[i].pred ]
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      74792b82
    • M
      jump_label: move 'asm goto' support test to Kconfig · a7928808
      Masahiro Yamada 提交于
      commit e9666d10a5677a494260d60d1fa0b73cc7646eb3 upstream.
      
      Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".
      
      The jump label is controlled by HAVE_JUMP_LABEL, which is defined
      like this:
      
        #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
        # define HAVE_JUMP_LABEL
        #endif
      
      We can improve this by testing 'asm goto' support in Kconfig, then
      make JUMP_LABEL depend on CC_HAS_ASM_GOTO.
      
      Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
      match to the real kernel capability.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      [nc: Fix trivial conflicts in 4.19
           arch/xtensa/kernel/jump_label.c doesn't exist yet
           Ensured CC_HAVE_ASM_GOTO and HAVE_JUMP_LABEL were sufficiently
           eliminated]
      Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Conflicts:
        kernel/module.c
        include/linux/module.h
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a7928808
    • W
      ftrace: fix NULL pointer dereference in free_ftrace_func_mapper() · 8d70a2b8
      Wei Li 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 16533
      CVE: NA
      
      -------------------------------------------------
      
      The mapper may be NULL when called from register_ftrace_function_probe()
      with probe->data == NULL.
      
      This issue can be reproduced as follow (it may be coverd by compiler
      optimization sometime):
      
      / # cat /sys/kernel/debug/tracing/set_ftrace_filter
      #### all functions enabled ####
      /sys/kernel/debug/tracing #  echo do_trap:dump > set_ftrace_filter
      [  559.030635] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
      [  559.034017] Mem abort info:
      [  559.034346]   ESR = 0x96000006
      [  559.035154]   Exception class = DABT (current EL), IL = 32 bits
      [  559.038036]   SET = 0, FnV = 0
      [  559.038403]   EA = 0, S1PTW = 0
      [  559.041292] Data abort info:
      [  559.041697]   ISV = 0, ISS = 0x00000006
      [  559.042081]   CM = 0, WnR = 0
      [  559.042655] user pgtable: 4k pages, 48-bit VAs, pgdp = (____ptrval____)
      [  559.043125] [0000000000000000] pgd=0000000429dc5003, pud=0000000429dc6003, pmd=0000000000000000
      [  559.046981] Internal error: Oops: 96000006 [#1] SMP
      [  559.050625] Dumping ftrace buffer:
      [  559.053384]    (ftrace buffer empty)
      
      Entering kdb (current=0xffff8003e8c457c0, pid 232) on processor 6 Oops: (null)
      due to oops @ 0xffff000008370f14
      CPU: 6 PID: 232 Comm: sh Not tainted 4.19.39+ #29
      Hardware name: linux,dummy-virt (DT)
      pstate: 60000005 (nZCv daif -PAN -UAO)
      pc : free_ftrace_func_mapper+0x2c/0x118
      lr : ftrace_count_free+0x68/0x80
      sp : ffff00000eceba30
      x29: ffff00000eceba30 x28: ffff8003e8d61480
      x27: ffff00000adeedb0 x26: 0000000000000001
      x25: ffff00000b5d0000 x24: 0000000000000000
      x23: ffff00000b1ea000 x22: ffff00000adee000
      x21: 0000000000000000 x20: ffff00000b5d05e8
      x19: ffff00000b5e2e90 x18: ffff8003e8ee0380
      x17: 0000000000000000 x16: 0000000000000000
      x15: 0000000000000000 x14: ffff000009f63400
      x13: 000000000026def4 x12: 0000000000002200
      x11: 0000000000000000 x10: 5a5a5a5a5a5a5a5a
      x9 : 0000000000000000 x8 : 0000000000000000
      x7 : 0000000000000000 x6 : ffff00000ad39748
      x5 : ffff0000083a4cf8 x4 : 0000000000000001
      x3 : 0000000000000001 x2 : ffff00000b5e2c88
      x1 : 0000000000000000 x0 : 0000000000000000
      Call trace:
       free_ftrace_func_mapper+0x2c/0x118
       ftrace_count_free+0x68/0x80
       release_probe+0xfc/0x1d0
       register_ftrace_function_probe+0x4b0/0x870
       ftrace_trace_probe_callback.isra.4+0xb8/0x180
       ftrace_dump_callback+0x50/0x70
       ftrace_regex_write.isra.30+0x290/0x3a8
       ftrace_filter_write+0x44/0x60
       __vfs_write+0x78/0x320
       vfs_write+0x14c/0x2d8
       ksys_write+0xa0/0x168
       __arm64_sys_write+0x3c/0x58
       el0_svc_common+0x41c/0x610
       el0_svc_handler+0x148/0x1d0
       el0_svc+0x8/0xc
      Signed-off-by: NWei Li <liwei391@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8d70a2b8
    • P
      rcuperf: Fix cleanup path for invalid perf_type strings · 65dfa2a7
      Paul E. McKenney 提交于
      [ Upstream commit ad092c027713a68a34168942a5ef422e42e039f4 ]
      
      If the specified rcuperf.perf_type is not in the rcu_perf_init()
      function's perf_ops[] array, rcuperf prints some console messages and
      then invokes rcu_perf_cleanup() to set state so that a future torture
      test can run.  However, rcu_perf_cleanup() also attempts to end the
      test that didn't actually start, and in doing so relies on the value
      of cur_ops, a value that is not particularly relevant in this case.
      This can result in confusing output or even follow-on failures due to
      attempts to use facilities that have not been properly initialized.
      
      This commit therefore sets the value of cur_ops to NULL in this case and
      inserts a check near the beginning of rcu_perf_cleanup(), thus avoiding
      relying on an irrelevant cur_ops value.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      65dfa2a7
    • P
      rcutorture: Fix cleanup path for invalid torture_type strings · 31354cc0
      Paul E. McKenney 提交于
      [ Upstream commit b813afae7ab6a5e91b4e16cc567331d9c2ae1f04 ]
      
      If the specified rcutorture.torture_type is not in the rcu_torture_init()
      function's torture_ops[] array, rcutorture prints some console messages
      and then invokes rcu_torture_cleanup() to set state so that a future
      torture test can run.  However, rcu_torture_cleanup() also attempts to
      end the test that didn't actually start, and in doing so relies on the
      value of cur_ops, a value that is not particularly relevant in this case.
      This can result in confusing output or even follow-on failures due to
      attempts to use facilities that have not been properly initialized.
      
      This commit therefore sets the value of cur_ops to NULL in this case
      and inserts a check near the beginning of rcu_torture_cleanup(),
      thus avoiding relying on an irrelevant cur_ops value.
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      31354cc0
    • T
      timekeeping: Force upper bound for setting CLOCK_REALTIME · 30db00a9
      Thomas Gleixner 提交于
      [ Upstream commit 7a8e61f8478639072d402a26789055a4a4de8f77 ]
      
      Several people reported testing failures after setting CLOCK_REALTIME close
      to the limits of the kernel internal representation in nanoseconds,
      i.e. year 2262.
      
      The failures are exposed in subsequent operations, i.e. when arming timers
      or when the advancing CLOCK_MONOTONIC makes the calculation of
      CLOCK_REALTIME overflow into negative space.
      
      Now people start to paper over the underlying problem by clamping
      calculations to the valid range, but that's just wrong because such
      workarounds will prevent detection of real issues as well.
      
      It is reasonable to force an upper bound for the various methods of setting
      CLOCK_REALTIME. Year 2262 is the absolute upper bound. Assume a maximum
      uptime of 30 years which is plenty enough even for esoteric embedded
      systems. That results in an upper bound of year 2232 for setting the time.
      
      Once that limit is reached in reality this limit is only a small part of
      the problem space. But until then this stops people from trying to paper
      over the problem at the wrong places.
      Reported-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reported-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Stephen Boyd <sboyd@kernel.org>
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903231125480.2157@nanos.tec.linutronix.deSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      30db00a9
    • P
      x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP · 6537a7d5
      Peter Zijlstra 提交于
      [ Upstream commit 4a6c91fbdef846ec7250b82f2eeeb87ac5f18cf9 ]
      
      For CONFIG_TRACE_BRANCH_PROFILING=y the likely/unlikely things get
      overloaded and generate callouts to this code, and thus also when
      AC=1.
      
      Make it safe.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6537a7d5
    • N
      irq_work: Do not raise an IPI when queueing work on the local CPU · b20a00cb
      Nicholas Piggin 提交于
      [ Upstream commit 471ba0e686cb13752bc1ff3216c54b69a2d250ea ]
      
      The QEMU PowerPC/PSeries machine model was not expecting a self-IPI,
      and it may be a bit surprising thing to do, so have irq_work_queue_on
      do local queueing when target is the current CPU.
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Reported-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Tested-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: =?UTF-8?q?C=C3=A9dric=20Le=20Goater?= <clg@kaod.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190409093403.20994-1-npiggin@gmail.com
      [ Simplified the preprocessor comments.
        Fixed unbalanced curly brackets pointed out by Thomas. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b20a00cb
    • K
      sched/core: Handle overflow in cpu_shares_write_u64 · ffc10fd4
      Konstantin Khlebnikov 提交于
      [ Upstream commit 5b61d50ab4ef590f5e1d4df15cd2cea5f5715308 ]
      
      Bit shift in scale_load() could overflow shares. This patch saturates
      it to MAX_SHARES like following sched_group_set_shares().
      
      Example:
      
       # echo 9223372036854776832 > cpu.shares
       # cat cpu.shares
      
      Before patch: 1024
      After pattch: 262144
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/155125501891.293431.3345233332801109696.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ffc10fd4
    • K
      sched/rt: Check integer overflow at usec to nsec conversion · 6896f520
      Konstantin Khlebnikov 提交于
      [ Upstream commit 1a010e29cfa00fee2888fd2fd4983f848cbafb58 ]
      
      Example of unhandled overflows:
      
       # echo 18446744073709651 > cpu.rt_runtime_us
       # cat cpu.rt_runtime_us
       99
      
       # echo 18446744073709900 > cpu.rt_period_us
       # cat cpu.rt_period_us
       348
      
      After this patch they will fail with -EINVAL.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/155125501739.293431.5252197504404771496.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6896f520
    • K
      sched/core: Check quota and period overflow at usec to nsec conversion · e046bf8d
      Konstantin Khlebnikov 提交于
      [ Upstream commit 1a8b4540db732ca16c9e43ac7c08b1b8f0b252d8 ]
      
      Large values could overflow u64 and pass following sanity checks.
      
       # echo 18446744073750000 > cpu.cfs_period_us
       # cat cpu.cfs_period_us
       40448
      
       # echo 18446744073750000 > cpu.cfs_quota_us
       # cat cpu.cfs_quota_us
       40448
      
      After this patch they will fail with -EINVAL.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/155125502079.293431.3947497929372138600.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e046bf8d
    • R
      cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock · 6b86cecb
      Roman Gushchin 提交于
      [ Upstream commit 4dcabece4c3a9f9522127be12cc12cc120399b2f ]
      
      The number of descendant cgroups and the number of dying
      descendant cgroups are currently synchronized using the cgroup_mutex.
      
      The number of descendant cgroups will be required by the cgroup v2
      freezer, which will use it to determine if a cgroup is frozen
      (depending on total number of descendants and number of frozen
      descendants). It's not always acceptable to grab the cgroup_mutex,
      especially from quite hot paths (e.g. exit()).
      
      To avoid this, let's additionally synchronize these counters using
      the css_set_lock.
      
      So, it's safe to read these counters with either cgroup_mutex or
      css_set_lock locked, and for changing both locks should be acquired.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kernel-team@fb.com
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6b86cecb