1. 31 5月, 2019 12 次提交
    • P
      x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP · 1a3188d7
      Peter Zijlstra 提交于
      [ Upstream commit 4a6c91fbdef846ec7250b82f2eeeb87ac5f18cf9 ]
      
      For CONFIG_TRACE_BRANCH_PROFILING=y the likely/unlikely things get
      overloaded and generate callouts to this code, and thus also when
      AC=1.
      
      Make it safe.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1a3188d7
    • N
      irq_work: Do not raise an IPI when queueing work on the local CPU · afee27f3
      Nicholas Piggin 提交于
      [ Upstream commit 471ba0e686cb13752bc1ff3216c54b69a2d250ea ]
      
      The QEMU PowerPC/PSeries machine model was not expecting a self-IPI,
      and it may be a bit surprising thing to do, so have irq_work_queue_on
      do local queueing when target is the current CPU.
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Reported-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Tested-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: =?UTF-8?q?C=C3=A9dric=20Le=20Goater?= <clg@kaod.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190409093403.20994-1-npiggin@gmail.com
      [ Simplified the preprocessor comments.
        Fixed unbalanced curly brackets pointed out by Thomas. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      afee27f3
    • K
      sched/core: Handle overflow in cpu_shares_write_u64 · 355673f8
      Konstantin Khlebnikov 提交于
      [ Upstream commit 5b61d50ab4ef590f5e1d4df15cd2cea5f5715308 ]
      
      Bit shift in scale_load() could overflow shares. This patch saturates
      it to MAX_SHARES like following sched_group_set_shares().
      
      Example:
      
       # echo 9223372036854776832 > cpu.shares
       # cat cpu.shares
      
      Before patch: 1024
      After pattch: 262144
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/155125501891.293431.3345233332801109696.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      355673f8
    • K
      sched/rt: Check integer overflow at usec to nsec conversion · 7053046e
      Konstantin Khlebnikov 提交于
      [ Upstream commit 1a010e29cfa00fee2888fd2fd4983f848cbafb58 ]
      
      Example of unhandled overflows:
      
       # echo 18446744073709651 > cpu.rt_runtime_us
       # cat cpu.rt_runtime_us
       99
      
       # echo 18446744073709900 > cpu.rt_period_us
       # cat cpu.rt_period_us
       348
      
      After this patch they will fail with -EINVAL.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/155125501739.293431.5252197504404771496.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7053046e
    • K
      sched/core: Check quota and period overflow at usec to nsec conversion · 925275d0
      Konstantin Khlebnikov 提交于
      [ Upstream commit 1a8b4540db732ca16c9e43ac7c08b1b8f0b252d8 ]
      
      Large values could overflow u64 and pass following sanity checks.
      
       # echo 18446744073750000 > cpu.cfs_period_us
       # cat cpu.cfs_period_us
       40448
      
       # echo 18446744073750000 > cpu.cfs_quota_us
       # cat cpu.cfs_quota_us
       40448
      
      After this patch they will fail with -EINVAL.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/155125502079.293431.3947497929372138600.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      925275d0
    • R
      cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock · 4e4d5cea
      Roman Gushchin 提交于
      [ Upstream commit 4dcabece4c3a9f9522127be12cc12cc120399b2f ]
      
      The number of descendant cgroups and the number of dying
      descendant cgroups are currently synchronized using the cgroup_mutex.
      
      The number of descendant cgroups will be required by the cgroup v2
      freezer, which will use it to determine if a cgroup is frozen
      (depending on total number of descendants and number of frozen
      descendants). It's not always acceptable to grab the cgroup_mutex,
      especially from quite hot paths (e.g. exit()).
      
      To avoid this, let's additionally synchronize these counters using
      the css_set_lock.
      
      So, it's safe to read these counters with either cgroup_mutex or
      css_set_lock locked, and for changing both locks should be acquired.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kernel-team@fb.com
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4e4d5cea
    • W
      audit: fix a memory leak bug · 6c21fa84
      Wenwen Wang 提交于
      [ Upstream commit 70c4cf17e445264453bc5323db3e50aa0ac9e81f ]
      
      In audit_rule_change(), audit_data_to_entry() is firstly invoked to
      translate the payload data to the kernel's rule representation. In
      audit_data_to_entry(), depending on the audit field type, an audit tree may
      be created in audit_make_tree(), which eventually invokes kmalloc() to
      allocate the tree.  Since this tree is a temporary tree, it will be then
      freed in the following execution, e.g., audit_add_rule() if the message
      type is AUDIT_ADD_RULE or audit_del_rule() if the message type is
      AUDIT_DEL_RULE. However, if the message type is neither AUDIT_ADD_RULE nor
      AUDIT_DEL_RULE, i.e., the default case of the switch statement, this
      temporary tree is not freed.
      
      To fix this issue, only allocate the tree when the type is AUDIT_ADD_RULE
      or AUDIT_DEL_RULE.
      Signed-off-by: NWenwen Wang <wang6495@umn.edu>
      Reviewed-by: NRichard Guy Briggs <rgb@redhat.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6c21fa84
    • N
      sched/nohz: Run NOHZ idle load balancer on HK_FLAG_MISC CPUs · 07da741d
      Nicholas Piggin 提交于
      [ Upstream commit 9b019acb72e4b5741d88e8936d6f200ed44b66b2 ]
      
      The NOHZ idle balancer runs on the lowest idle CPU. This can
      interfere with isolated CPUs, so confine it to HK_FLAG_MISC
      housekeeping CPUs.
      
      HK_FLAG_SCHED is not used for this because it is not set anywhere
      at the moment. This could be folded into HK_FLAG_SCHED once that
      option is fixed.
      
      The problem was observed with increased jitter on an application
      running on CPU0, caused by NOHZ idle load balancing being run on
      CPU1 (an SMT sibling).
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190412042613.28930-1-npiggin@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      07da741d
    • N
      x86/modules: Avoid breaking W^X while loading modules · 8715ce03
      Nadav Amit 提交于
      [ Upstream commit f2c65fb3221adc6b73b0549fc7ba892022db9797 ]
      
      When modules and BPF filters are loaded, there is a time window in
      which some memory is both writable and executable. An attacker that has
      already found another vulnerability (e.g., a dangling pointer) might be
      able to exploit this behavior to overwrite kernel code. Prevent having
      writable executable PTEs in this stage.
      
      In addition, avoiding having W+X mappings can also slightly simplify the
      patching of modules code on initialization (e.g., by alternatives and
      static-key), as would be done in the next patch. This was actually the
      main motivation for this patch.
      
      To avoid having W+X mappings, set them initially as RW (NX) and after
      they are set as RO set them as X as well. Setting them as executable is
      done as a separate step to avoid one core in which the old PTE is cached
      (hence writable), and another which sees the updated PTE (executable),
      which would break the W^X protection.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Suggested-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <akpm@linux-foundation.org>
      Cc: <ard.biesheuvel@linaro.org>
      Cc: <deneen.t.dock@intel.com>
      Cc: <kernel-hardening@lists.openwall.com>
      Cc: <kristen@linux.intel.com>
      Cc: <linux_dti@icloud.com>
      Cc: <will.deacon@arm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: https://lkml.kernel.org/r/20190426001143.4983-12-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      8715ce03
    • A
      acct_on(): don't mess with freeze protection · 7c2bcb3c
      Al Viro 提交于
      commit 9419a3191dcb27f24478d288abaab697228d28e6 upstream.
      
      What happens there is that we are replacing file->path.mnt of
      a file we'd just opened with a clone and we need the write
      count contribution to be transferred from original mount to
      new one.  That's it.  We do *NOT* want any kind of freeze
      protection for the duration of switchover.
      
      IOW, we should just use __mnt_{want,drop}_write() for that
      switchover; no need to bother with mnt_{want,drop}_write()
      there.
      Tested-by: NAmir Goldstein <amir73il@gmail.com>
      Reported-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7c2bcb3c
    • E
      bpf: devmap: fix use-after-free Read in __dev_map_entry_free · 003e2d74
      Eric Dumazet 提交于
      commit 2baae3545327632167c0180e9ca1d467416f1919 upstream.
      
      synchronize_rcu() is fine when the rcu callbacks only need
      to free memory (kfree_rcu() or direct kfree() call rcu call backs)
      
      __dev_map_entry_free() is a bit more complex, so we need to make
      sure that call queued __dev_map_entry_free() callbacks have completed.
      
      sysbot report:
      
      BUG: KASAN: use-after-free in dev_map_flush_old kernel/bpf/devmap.c:365
      [inline]
      BUG: KASAN: use-after-free in __dev_map_entry_free+0x2a8/0x300
      kernel/bpf/devmap.c:379
      Read of size 8 at addr ffff8801b8da38c8 by task ksoftirqd/1/18
      
      CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.17.0+ #39
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x1b9/0x294 lib/dump_stack.c:113
        print_address_description+0x6c/0x20b mm/kasan/report.c:256
        kasan_report_error mm/kasan/report.c:354 [inline]
        kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
        __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
        dev_map_flush_old kernel/bpf/devmap.c:365 [inline]
        __dev_map_entry_free+0x2a8/0x300 kernel/bpf/devmap.c:379
        __rcu_reclaim kernel/rcu/rcu.h:178 [inline]
        rcu_do_batch kernel/rcu/tree.c:2558 [inline]
        invoke_rcu_callbacks kernel/rcu/tree.c:2818 [inline]
        __rcu_process_callbacks kernel/rcu/tree.c:2785 [inline]
        rcu_process_callbacks+0xe9d/0x1760 kernel/rcu/tree.c:2802
        __do_softirq+0x2e0/0xaf5 kernel/softirq.c:284
        run_ksoftirqd+0x86/0x100 kernel/softirq.c:645
        smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164
        kthread+0x345/0x410 kernel/kthread.c:240
        ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
      
      Allocated by task 6675:
        save_stack+0x43/0xd0 mm/kasan/kasan.c:448
        set_track mm/kasan/kasan.c:460 [inline]
        kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
        kmem_cache_alloc_trace+0x152/0x780 mm/slab.c:3620
        kmalloc include/linux/slab.h:513 [inline]
        kzalloc include/linux/slab.h:706 [inline]
        dev_map_alloc+0x208/0x7f0 kernel/bpf/devmap.c:102
        find_and_alloc_map kernel/bpf/syscall.c:129 [inline]
        map_create+0x393/0x1010 kernel/bpf/syscall.c:453
        __do_sys_bpf kernel/bpf/syscall.c:2351 [inline]
        __se_sys_bpf kernel/bpf/syscall.c:2328 [inline]
        __x64_sys_bpf+0x303/0x510 kernel/bpf/syscall.c:2328
        do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:290
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 26:
        save_stack+0x43/0xd0 mm/kasan/kasan.c:448
        set_track mm/kasan/kasan.c:460 [inline]
        __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
        kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
        __cache_free mm/slab.c:3498 [inline]
        kfree+0xd9/0x260 mm/slab.c:3813
        dev_map_free+0x4fa/0x670 kernel/bpf/devmap.c:191
        bpf_map_free_deferred+0xba/0xf0 kernel/bpf/syscall.c:262
        process_one_work+0xc64/0x1b70 kernel/workqueue.c:2153
        worker_thread+0x181/0x13a0 kernel/workqueue.c:2296
        kthread+0x345/0x410 kernel/kthread.c:240
        ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
      
      The buggy address belongs to the object at ffff8801b8da37c0
        which belongs to the cache kmalloc-512 of size 512
      The buggy address is located 264 bytes inside of
        512-byte region [ffff8801b8da37c0, ffff8801b8da39c0)
      The buggy address belongs to the page:
      page:ffffea0006e368c0 count:1 mapcount:0 mapping:ffff8801da800940
      index:0xffff8801b8da3540
      flags: 0x2fffc0000000100(slab)
      raw: 02fffc0000000100 ffffea0007217b88 ffffea0006e30cc8 ffff8801da800940
      raw: ffff8801b8da3540 ffff8801b8da3040 0000000100000004 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
        ffff8801b8da3780: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
        ffff8801b8da3800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      > ffff8801b8da3880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                     ^
        ffff8801b8da3900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff8801b8da3980: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
      
      Fixes: 546ac1ff ("bpf: add devmap, a map for storing net device references")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: syzbot+457d3e2ffbcf31aee5c0@syzkaller.appspotmail.com
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      003e2d74
    • D
      bpf: add bpf_jit_limit knob to restrict unpriv allocations · 43caa29c
      Daniel Borkmann 提交于
      commit ede95a63b5e84ddeea6b0c473b36ab8bfd8c6ce3 upstream.
      
      Rick reported that the BPF JIT could potentially fill the entire module
      space with BPF programs from unprivileged users which would prevent later
      attempts to load normal kernel modules or privileged BPF programs, for
      example. If JIT was enabled but unsuccessful to generate the image, then
      before commit 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      we would always fall back to the BPF interpreter. Nowadays in the case
      where the CONFIG_BPF_JIT_ALWAYS_ON could be set, then the load will abort
      with a failure since the BPF interpreter was compiled out.
      
      Add a global limit and enforce it for unprivileged users such that in case
      of BPF interpreter compiled out we fail once the limit has been reached
      or we fall back to BPF interpreter earlier w/o using module mem if latter
      was compiled in. In a next step, fair share among unprivileged users can
      be resolved in particular for the case where we would fail hard once limit
      is reached.
      
      Fixes: 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      Fixes: 0a14842f ("net: filter: Just In Time compiler for x86-64")
      Co-Developed-by: NRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: LKML <linux-kernel@vger.kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43caa29c
  2. 26 5月, 2019 5 次提交
    • D
      bpf, lru: avoid messing with eviction heuristics upon syscall lookup · 107e215c
      Daniel Borkmann 提交于
      commit 50b045a8c0ccf44f76640ac3eea8d80ca53979a3 upstream.
      
      One of the biggest issues we face right now with picking LRU map over
      regular hash table is that a map walk out of user space, for example,
      to just dump the existing entries or to remove certain ones, will
      completely mess up LRU eviction heuristics and wrong entries such
      as just created ones will get evicted instead. The reason for this
      is that we mark an entry as "in use" via bpf_lru_node_set_ref() from
      system call lookup side as well. Thus upon walk, all entries are
      being marked, so information of actual least recently used ones
      are "lost".
      
      In case of Cilium where it can be used (besides others) as a BPF
      based connection tracker, this current behavior causes disruption
      upon control plane changes that need to walk the map from user space
      to evict certain entries. Discussion result from bpfconf [0] was that
      we should simply just remove marking from system call side as no
      good use case could be found where it's actually needed there.
      Therefore this patch removes marking for regular LRU and per-CPU
      flavor. If there ever should be a need in future, the behavior could
      be selected via map creation flag, but due to mentioned reason we
      avoid this here.
      
        [0] http://vger.kernel.org/bpfconf.html
      
      Fixes: 29ba732a ("bpf: Add BPF_MAP_TYPE_LRU_HASH")
      Fixes: 8f844938 ("bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      107e215c
    • D
      bpf: add map_lookup_elem_sys_only for lookups from syscall side · 2bb3c547
      Daniel Borkmann 提交于
      commit c6110222c6f49ea68169f353565eb865488a8619 upstream.
      
      Add a callback map_lookup_elem_sys_only() that map implementations
      could use over map_lookup_elem() from system call side in case the
      map implementation needs to handle the latter differently than from
      the BPF data path. If map_lookup_elem_sys_only() is set, this will
      be preferred pick for map lookups out of user space. This hook is
      used in a follow-up fix for LRU map, but once development window
      opens, we can convert other map types from map_lookup_elem() (here,
      the one called upon BPF_MAP_LOOKUP_ELEM cmd is meant) over to use
      the callback to simplify and clean up the latter.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      2bb3c547
    • C
      bpf: relax inode permission check for retrieving bpf program · 3ded3aaa
      Chenbo Feng 提交于
      commit e547ff3f803e779a3898f1f48447b29f43c54085 upstream.
      
      For iptable module to load a bpf program from a pinned location, it
      only retrieve a loaded program and cannot change the program content so
      requiring a write permission for it might not be necessary.
      Also when adding or removing an unrelated iptable rule, it might need to
      flush and reload the xt_bpf related rules as well and triggers the inode
      permission check. It might be better to remove the write premission
      check for the inode so we won't need to grant write access to all the
      processes that flush and restore iptables rules.
      Signed-off-by: NChenbo Feng <fengc@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3ded3aaa
    • T
      sched/cpufreq: Fix kobject memleak · 290da8e7
      Tobin C. Harding 提交于
      [ Upstream commit 9a4f26cc98d81b67ecc23b890c28e2df324e29f3 ]
      
      Currently the error return path from kobject_init_and_add() is not
      followed by a call to kobject_put() - which means we are leaking
      the kobject.
      
      Fix it by adding a call to kobject_put() in the error path of
      kobject_init_and_add().
      Signed-off-by: NTobin C. Harding <tobin@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: http://lkml.kernel.org/r/20190430001144.24890-1-tobin@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      290da8e7
    • E
      tracing: Fix partial reading of trace event's id file · fb8c9c90
      Elazar Leibovich 提交于
      commit cbe08bcbbe787315c425dde284dcb715cfbf3f39 upstream.
      
      When reading only part of the id file, the ppos isn't tracked correctly.
      This is taken care by simple_read_from_buffer.
      
      Reading a single byte, and then the next byte would result EOF.
      
      While this seems like not a big deal, this breaks abstractions that
      reads information from files unbuffered. See for example
      https://github.com/golang/go/issues/29399
      
      This code was mentioned as problematic in
      commit cd458ba9
      ("tracing: Do not (ab)use trace_seq in event_id_read()")
      
      An example C code that show this bug is:
      
        #include <stdio.h>
        #include <stdint.h>
      
        #include <sys/types.h>
        #include <sys/stat.h>
        #include <fcntl.h>
        #include <unistd.h>
      
        int main(int argc, char **argv) {
          if (argc < 2)
            return 1;
          int fd = open(argv[1], O_RDONLY);
          char c;
          read(fd, &c, 1);
          printf("First  %c\n", c);
          read(fd, &c, 1);
          printf("Second %c\n", c);
        }
      
      Then run with, e.g.
      
        sudo ./a.out /sys/kernel/debug/tracing/events/tcp/tcp_set_state/id
      
      You'll notice you're getting the first character twice, instead of the
      first two characters in the id file.
      
      Link: http://lkml.kernel.org/r/20181231115837.4932-1-elazar@lightbitslabs.com
      
      Cc: Orit Wasserman <orit.was@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 23725aee ("ftrace: provide an id file for each event")
      Signed-off-by: NElazar Leibovich <elazar@lightbitslabs.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fb8c9c90
  3. 22 5月, 2019 2 次提交
    • A
      userfaultfd: use RCU to free the task struct when fork fails · 8bae4398
      Andrea Arcangeli 提交于
      commit c3f3ce049f7d97cc7ec9c01cb51d9ec74e0f37c2 upstream.
      
      The task structure is freed while get_mem_cgroup_from_mm() holds
      rcu_read_lock() and dereferences mm->owner.
      
        get_mem_cgroup_from_mm()                failing fork()
        ----                                    ---
        task = mm->owner
                                                mm->owner = NULL;
                                                free(task)
        if (task) *task; /* use after free */
      
      The fix consists in freeing the task with RCU also in the fork failure
      case, exactly like it always happens for the regular exit(2) path.  That
      is enough to make the rcu_read_lock hold in get_mem_cgroup_from_mm()
      (left side above) effective to avoid a use after free when dereferencing
      the task structure.
      
      An alternate possible fix would be to defer the delivery of the
      userfaultfd contexts to the monitor until after fork() is guaranteed to
      succeed.  Such a change would require more changes because it would
      create a strict ordering dependency where the uffd methods would need to
      be called beyond the last potentially failing branch in order to be
      safe.  This solution as opposed only adds the dependency to common code
      to set mm->owner to NULL and to free the task struct that was pointed by
      mm->owner with RCU, if fork ends up failing.  The userfaultfd methods
      can still be called anywhere during the fork runtime and the monitor
      will keep discarding orphaned "mm" coming from failed forks in userland.
      
      This race condition couldn't trigger if CONFIG_MEMCG was set =n at build
      time.
      
      [aarcange@redhat.com: improve changelog, reduce #ifdefs per Michal]
        Link: http://lkml.kernel.org/r/20190429035752.4508-1-aarcange@redhat.com
      Link: http://lkml.kernel.org/r/20190325225636.11635-2-aarcange@redhat.com
      Fixes: 893e26e6 ("userfaultfd: non-cooperative: Add fork() event")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: Nzhong jiang <zhongjiang@huawei.com>
      Reported-by: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8bae4398
    • W
      locking/rwsem: Prevent decrement of reader count before increment · 7761dbf5
      Waiman Long 提交于
      [ Upstream commit a9e9bcb45b1525ba7aea26ed9441e8632aeeda58 ]
      
      During my rwsem testing, it was found that after a down_read(), the
      reader count may occasionally become 0 or even negative. Consequently,
      a writer may steal the lock at that time and execute with the reader
      in parallel thus breaking the mutual exclusion guarantee of the write
      lock. In other words, both readers and writer can become rwsem owners
      simultaneously.
      
      The current reader wakeup code does it in one pass to clear waiter->task
      and put them into wake_q before fully incrementing the reader count.
      Once waiter->task is cleared, the corresponding reader may see it,
      finish the critical section and do unlock to decrement the count before
      the count is incremented. This is not a problem if there is only one
      reader to wake up as the count has been pre-incremented by 1.  It is
      a problem if there are more than one readers to be woken up and writer
      can steal the lock.
      
      The wakeup was actually done in 2 passes before the following v4.9 commit:
      
        70800c3c ("locking/rwsem: Scan the wait_list for readers only once")
      
      To fix this problem, the wakeup is now done in two passes
      again. In the first pass, we collect the readers and count them.
      The reader count is then fully incremented. In the second pass, the
      waiter->task is then cleared and they are put into wake_q to be woken
      up later.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Fixes: 70800c3c ("locking/rwsem: Scan the wait_list for readers only once")
      Link: http://lkml.kernel.org/r/20190428212557.13482-2-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7761dbf5
  4. 15 5月, 2019 1 次提交
    • J
      cpu/speculation: Add 'mitigations=' cmdline option · 8cb932ac
      Josh Poimboeuf 提交于
      commit 98af8452945c55652de68536afdde3b520fec429 upstream
      
      Keeping track of the number of mitigations for all the CPU speculation
      bugs has become overwhelming for many users.  It's getting more and more
      complicated to decide which mitigations are needed for a given
      architecture.  Complicating matters is the fact that each arch tends to
      have its own custom way to mitigate the same vulnerability.
      
      Most users fall into a few basic categories:
      
      a) they want all mitigations off;
      
      b) they want all reasonable mitigations on, with SMT enabled even if
         it's vulnerable; or
      
      c) they want all reasonable mitigations on, with SMT disabled if
         vulnerable.
      
      Define a set of curated, arch-independent options, each of which is an
      aggregation of existing options:
      
      - mitigations=off: Disable all mitigations.
      
      - mitigations=auto: [default] Enable all the default mitigations, but
        leave SMT enabled, even if it's vulnerable.
      
      - mitigations=auto,nosmt: Enable all the default mitigations, disabling
        SMT if needed by a mitigation.
      
      Currently, these options are placeholders which don't actually do
      anything.  They will be fleshed out in upcoming patches.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86)
      Reviewed-by: NJiri Kosina <jkosina@suse.cz>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux-s390@vger.kernel.org
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-arch@vger.kernel.org
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Phil Auld <pauld@redhat.com>
      Link: https://lkml.kernel.org/r/b07a8ef9b7c5055c3a4637c87d07c296d5016fe0.1555085500.git.jpoimboe@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8cb932ac
  5. 10 5月, 2019 3 次提交
    • W
      locking/futex: Allow low-level atomic operations to return -EAGAIN · 0f4ef8fb
      Will Deacon 提交于
      commit 6b4f4bc9cb22875f97023984a625386f0c7cc1c0 upstream.
      
      Some futex() operations, including FUTEX_WAKE_OP, require the kernel to
      perform an atomic read-modify-write of the futex word via the userspace
      mapping. These operations are implemented by each architecture in
      arch_futex_atomic_op_inuser() and futex_atomic_cmpxchg_inatomic(), which
      are called in atomic context with the relevant hash bucket locks held.
      
      Although these routines may return -EFAULT in response to a page fault
      generated when accessing userspace, they are expected to succeed (i.e.
      return 0) in all other cases. This poses a problem for architectures
      that do not provide bounded forward progress guarantees or fairness of
      contended atomic operations and can lead to starvation in some cases.
      
      In these problematic scenarios, we must return back to the core futex
      code so that we can drop the hash bucket locks and reschedule if
      necessary, much like we do in the case of a page fault.
      
      Allow architectures to return -EAGAIN from their implementations of
      arch_futex_atomic_op_inuser() and futex_atomic_cmpxchg_inatomic(), which
      will cause the core futex code to reschedule if necessary and return
      back to the architecture code later on.
      
      Cc: <stable@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0f4ef8fb
    • P
      genirq: Prevent use-after-free and work list corruption · 33f2aa87
      Prasad Sodagudi 提交于
      [ Upstream commit 59c39840f5abf4a71e1810a8da71aaccd6c17d26 ]
      
      When irq_set_affinity_notifier() replaces the notifier, then the
      reference count on the old notifier is dropped which causes it to be
      freed. But nothing ensures that the old notifier is not longer queued
      in the work list. If it is queued this results in a use after free and
      possibly in work list corruption.
      
      Ensure that the work is canceled before the reference is dropped.
      Signed-off-by: NPrasad Sodagudi <psodagud@codeaurora.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: marc.zyngier@arm.com
      Link: https://lkml.kernel.org/r/1553439424-6529-1-git-send-email-psodagud@codeaurora.orgSigned-off-by: NSasha Levin <sashal@kernel.org>
      33f2aa87
    • P
      perf/core: Fix perf_event_disable_inatomic() race · 42638d6a
      Peter Zijlstra 提交于
      [ Upstream commit 1d54ad944074010609562da5c89e4f5df2f4e5db ]
      
      Thomas-Mich Richter reported he triggered a WARN()ing from event_function_local()
      on his s390. The problem boils down to:
      
      	CPU-A				CPU-B
      
      	perf_event_overflow()
      	  perf_event_disable_inatomic()
      	    @pending_disable = 1
      	    irq_work_queue();
      
      	sched-out
      	  event_sched_out()
      	    @pending_disable = 0
      
      					sched-in
      					perf_event_overflow()
      					  perf_event_disable_inatomic()
      					    @pending_disable = 1;
      					    irq_work_queue(); // FAILS
      
      	irq_work_run()
      	  perf_pending_event()
      	    if (@pending_disable)
      	      perf_event_disable_local(); // WHOOPS
      
      The problem exists in generic, but s390 is particularly sensitive
      because it doesn't implement arch_irq_work_raise(), nor does it call
      irq_work_run() from it's PMU interrupt handler (nor would that be
      sufficient in this case, because s390 also generates
      perf_event_overflow() from pmu::stop). Add to that the fact that s390
      is a virtual architecture and (virtual) CPU-A can stall long enough
      for the above race to happen, even if it would self-IPI.
      
      Adding a irq_work_sync() to event_sched_in() would work for all hardare
      PMUs that properly use irq_work_run() but fails for software PMUs.
      
      Instead encode the CPU number in @pending_disable, such that we can
      tell which CPU requested the disable. This then allows us to detect
      the above scenario and even redirect the IPI to make up for the failed
      queue.
      Reported-by: NThomas-Mich Richter <tmricht@linux.ibm.com>
      Tested-by: NThomas Richter <tmricht@linux.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hendrik Brueckner <brueckner@linux.ibm.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      42638d6a
  6. 04 5月, 2019 2 次提交
  7. 02 5月, 2019 6 次提交
  8. 27 4月, 2019 6 次提交
    • W
      kernel/sysctl.c: fix out-of-bounds access when setting file-max · cdd369fe
      Will Deacon 提交于
      commit 9002b21465fa4d829edfc94a5a441005cffaa972 upstream.
      
      Commit 32a5ad9c2285 ("sysctl: handle overflow for file-max") hooked up
      min/max values for the file-max sysctl parameter via the .extra1 and
      .extra2 fields in the corresponding struct ctl_table entry.
      
      Unfortunately, the minimum value points at the global 'zero' variable,
      which is an int.  This results in a KASAN splat when accessed as a long
      by proc_doulongvec_minmax on 64-bit architectures:
      
        | BUG: KASAN: global-out-of-bounds in __do_proc_doulongvec_minmax+0x5d8/0x6a0
        | Read of size 8 at addr ffff2000133d1c20 by task systemd/1
        |
        | CPU: 0 PID: 1 Comm: systemd Not tainted 5.1.0-rc3-00012-g40b114779944 #2
        | Hardware name: linux,dummy-virt (DT)
        | Call trace:
        |  dump_backtrace+0x0/0x228
        |  show_stack+0x14/0x20
        |  dump_stack+0xe8/0x124
        |  print_address_description+0x60/0x258
        |  kasan_report+0x140/0x1a0
        |  __asan_report_load8_noabort+0x18/0x20
        |  __do_proc_doulongvec_minmax+0x5d8/0x6a0
        |  proc_doulongvec_minmax+0x4c/0x78
        |  proc_sys_call_handler.isra.19+0x144/0x1d8
        |  proc_sys_write+0x34/0x58
        |  __vfs_write+0x54/0xe8
        |  vfs_write+0x124/0x3c0
        |  ksys_write+0xbc/0x168
        |  __arm64_sys_write+0x68/0x98
        |  el0_svc_common+0x100/0x258
        |  el0_svc_handler+0x48/0xc0
        |  el0_svc+0x8/0xc
        |
        | The buggy address belongs to the variable:
        |  zero+0x0/0x40
        |
        | Memory state around the buggy address:
        |  ffff2000133d1b00: 00 00 00 00 00 00 00 00 fa fa fa fa 04 fa fa fa
        |  ffff2000133d1b80: fa fa fa fa 04 fa fa fa fa fa fa fa 04 fa fa fa
        | >ffff2000133d1c00: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 00 00
        |                                ^
        |  ffff2000133d1c80: fa fa fa fa 00 fa fa fa fa fa fa fa 00 00 00 00
        |  ffff2000133d1d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      
      Fix the splat by introducing a unsigned long 'zero_ul' and using that
      instead.
      
      Link: http://lkml.kernel.org/r/20190403153409.17307-1-will.deacon@arm.com
      Fixes: 32a5ad9c2285 ("sysctl: handle overflow for file-max")
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NChristian Brauner <christian@brauner.io>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matteo Croce <mcroce@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cdd369fe
    • G
      Revert "locking/lockdep: Add debug_locks check in __lock_downgrade()" · ac54bc12
      Greg Kroah-Hartman 提交于
      This reverts commit 0e0f7b30 which was
      commit 71492580571467fb7177aade19c18ce7486267f5 upstream.
      
      Tetsuo rightly points out that the backport here is incorrect, as it
      touches the __lock_set_class function instead of the intended
      __lock_downgrade function.
      Reported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ac54bc12
    • P
      sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup · c3edd427
      Phil Auld 提交于
      [ Upstream commit 2e8e19226398db8265a8e675fcc0118b9e80c9e8 ]
      
      With extremely short cfs_period_us setting on a parent task group with a large
      number of children the for loop in sched_cfs_period_timer() can run until the
      watchdog fires. There is no guarantee that the call to hrtimer_forward_now()
      will ever return 0.  The large number of children can make
      do_sched_cfs_period_timer() take longer than the period.
      
       NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
       RIP: 0010:tg_nop+0x0/0x10
        <IRQ>
        walk_tg_tree_from+0x29/0xb0
        unthrottle_cfs_rq+0xe0/0x1a0
        distribute_cfs_runtime+0xd3/0xf0
        sched_cfs_period_timer+0xcb/0x160
        ? sched_cfs_slack_timer+0xd0/0xd0
        __hrtimer_run_queues+0xfb/0x270
        hrtimer_interrupt+0x122/0x270
        smp_apic_timer_interrupt+0x6a/0x140
        apic_timer_interrupt+0xf/0x20
        </IRQ>
      
      To prevent this we add protection to the loop that detects when the loop has run
      too many times and scales the period and quota up, proportionally, so that the timer
      can complete before then next period expires.  This preserves the relative runtime
      quota while preventing the hard lockup.
      
      A warning is issued reporting this state and the new values.
      Signed-off-by: NPhil Auld <pauld@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190319130005.25492-1-pauld@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c3edd427
    • C
      timers/sched_clock: Prevent generic sched_clock wrap caused by tick_freeze() · cd37fd46
      Chang-An Chen 提交于
      commit 3f2552f7e9c5abef2775c53f7af66532f8bf65bc upstream.
      
      tick_freeze() introduced by suspend-to-idle in commit 124cf911 ("PM /
      sleep: Make it possible to quiesce timers during suspend-to-idle") uses
      timekeeping_suspend() instead of syscore_suspend() during
      suspend-to-idle. As a consequence generic sched_clock will keep going
      because sched_clock_suspend() and sched_clock_resume() are not invoked
      during suspend-to-idle which can result in a generic sched_clock wrap.
      
      On a ARM system with suspend-to-idle enabled, sched_clock is registered
      as "56 bits at 13MHz, resolution 76ns, wraps every 4398046511101ns", which
      means the real wrapping duration is 8796093022202ns.
      
      [  134.551779] suspend-to-idle suspend (timekeeping_suspend())
      [ 1204.912239] suspend-to-idle resume (timekeeping_resume())
      ......
      [ 1206.912239] suspend-to-idle suspend (timekeeping_suspend())
      [ 5880.502807] suspend-to-idle resume (timekeeping_resume())
      ......
      [ 6000.403724] suspend-to-idle suspend (timekeeping_suspend())
      [ 8035.753167] suspend-to-idle resume  (timekeeping_resume())
      ......
      [ 8795.786684] (2)[321:charger_thread]......
      [ 8795.788387] (2)[321:charger_thread]......
      [    0.057226] (0)[0:swapper/0]......
      [    0.061447] (2)[0:swapper/2]......
      
      sched_clock was not stopped during suspend-to-idle, and sched_clock_poll
      hrtimer was not expired because timekeeping_suspend() was invoked during
      suspend-to-idle. It makes sched_clock wrap at kernel time 8796s.
      
      To prevent this, invoke sched_clock_suspend() and sched_clock_resume() in
      tick_freeze() together with timekeeping_suspend() and timekeeping_resume().
      
      Fixes: 124cf911 (PM / sleep: Make it possible to quiesce timers during suspend-to-idle)
      Signed-off-by: NChang-An Chen <chang-an.chen@mediatek.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Corey Minyard <cminyard@mvista.com>
      Cc: <linux-mediatek@lists.infradead.org>
      Cc: <linux-arm-kernel@lists.infradead.org>
      Cc: Stanley Chu <stanley.chu@mediatek.com>
      Cc: <kuohong.wang@mediatek.com>
      Cc: <freddy.hsin@mediatek.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1553828349-8914-1-git-send-email-chang-an.chen@mediatek.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cd37fd46
    • M
      kprobes: Fix error check when reusing optimized probes · 23a926e5
      Masami Hiramatsu 提交于
      commit 5f843ed415581cfad4ef8fefe31c138a8346ca8a upstream.
      
      The following commit introduced a bug in one of our error paths:
      
        819319fc9346 ("kprobes: Return error if we fail to reuse kprobe instead of BUG_ON()")
      
      it missed to handle the return value of kprobe_optready() as
      error-value. In reality, the kprobe_optready() returns a bool
      result, so "true" case must be passed instead of 0.
      
      This causes some errors on kprobe boot-time selftests on ARM:
      
       [   ] Beginning kprobe tests...
       [   ] Probe ARM code
       [   ]     kprobe
       [   ]     kretprobe
       [   ] ARM instruction simulation
       [   ]     Check decoding tables
       [   ]     Run test cases
       [   ] FAIL: test_case_handler not run
       [   ] FAIL: Test andge	r10, r11, r14, asr r7
       [   ] FAIL: Scenario 11
       ...
       [   ] FAIL: Scenario 7
       [   ] Total instruction simulation tests=1631, pass=1433 fail=198
       [   ] kprobe tests failed
      
      This can happen if an optimized probe is unregistered and next
      kprobe is registered on same address until the previous probe
      is not reclaimed.
      
      If this happens, a hidden aggregated probe may be kept in memory,
      and no new kprobe can probe same address. Also, in that case
      register_kprobe() will return "1" instead of minus error value,
      which can mislead caller logic.
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S . Miller <davem@davemloft.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org # v5.0+
      Fixes: 819319fc9346 ("kprobes: Return error if we fail to reuse kprobe instead of BUG_ON()")
      Link: http://lkml.kernel.org/r/155530808559.32517.539898325433642204.stgit@devnote2Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23a926e5
    • M
      kprobes: Mark ftrace mcount handler functions nokprobe · 426e2a80
      Masami Hiramatsu 提交于
      commit fabe38ab6b2bd9418350284c63825f13b8a6abba upstream.
      
      Mark ftrace mcount handler functions nokprobe since
      probing on these functions with kretprobe pushes
      return address incorrectly on kretprobe shadow stack.
      Reported-by: NFrancis Deslauriers <francis.deslauriers@efficios.com>
      Tested-by: NAndrea Righi <righi.andrea@gmail.com>
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/155094062044.6137.6419622920568680640.stgit@devboxSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      426e2a80
  9. 20 4月, 2019 3 次提交
    • D
      bpf: fix use after free in bpf_evict_inode · e8eef7ad
      Daniel Borkmann 提交于
      [ Upstream commit 1da6c4d9140cb7c13e87667dc4e1488d6c8fc10f ]
      
      syzkaller was able to generate the following UAF in bpf:
      
        BUG: KASAN: use-after-free in lookup_last fs/namei.c:2269 [inline]
        BUG: KASAN: use-after-free in path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318
        Read of size 1 at addr ffff8801c4865c47 by task syz-executor2/9423
      
        CPU: 0 PID: 9423 Comm: syz-executor2 Not tainted 4.20.0-rc1-next-20181109+
        #110
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
        Google 01/01/2011
        Call Trace:
          __dump_stack lib/dump_stack.c:77 [inline]
          dump_stack+0x244/0x39d lib/dump_stack.c:113
          print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
          kasan_report_error mm/kasan/report.c:354 [inline]
          kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
          __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
          lookup_last fs/namei.c:2269 [inline]
          path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318
          filename_lookup+0x26a/0x520 fs/namei.c:2348
          user_path_at_empty+0x40/0x50 fs/namei.c:2608
          user_path include/linux/namei.h:62 [inline]
          do_mount+0x180/0x1ff0 fs/namespace.c:2980
          ksys_mount+0x12d/0x140 fs/namespace.c:3258
          __do_sys_mount fs/namespace.c:3272 [inline]
          __se_sys_mount fs/namespace.c:3269 [inline]
          __x64_sys_mount+0xbe/0x150 fs/namespace.c:3269
          do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
          entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x457569
        Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
        48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
        ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
        RSP: 002b:00007fde6ed96c78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
        RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000457569
        RDX: 0000000020000040 RSI: 0000000020000000 RDI: 0000000000000000
        RBP: 000000000072bf00 R08: 0000000020000340 R09: 0000000000000000
        R10: 0000000000200000 R11: 0000000000000246 R12: 00007fde6ed976d4
        R13: 00000000004c2c24 R14: 00000000004d4990 R15: 00000000ffffffff
      
        Allocated by task 9424:
          save_stack+0x43/0xd0 mm/kasan/kasan.c:448
          set_track mm/kasan/kasan.c:460 [inline]
          kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
          __do_kmalloc mm/slab.c:3722 [inline]
          __kmalloc_track_caller+0x157/0x760 mm/slab.c:3737
          kstrdup+0x39/0x70 mm/util.c:49
          bpf_symlink+0x26/0x140 kernel/bpf/inode.c:356
          vfs_symlink+0x37a/0x5d0 fs/namei.c:4127
          do_symlinkat+0x242/0x2d0 fs/namei.c:4154
          __do_sys_symlink fs/namei.c:4173 [inline]
          __se_sys_symlink fs/namei.c:4171 [inline]
          __x64_sys_symlink+0x59/0x80 fs/namei.c:4171
          do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
          entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        Freed by task 9425:
          save_stack+0x43/0xd0 mm/kasan/kasan.c:448
          set_track mm/kasan/kasan.c:460 [inline]
          __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
          kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
          __cache_free mm/slab.c:3498 [inline]
          kfree+0xcf/0x230 mm/slab.c:3817
          bpf_evict_inode+0x11f/0x150 kernel/bpf/inode.c:565
          evict+0x4b9/0x980 fs/inode.c:558
          iput_final fs/inode.c:1550 [inline]
          iput+0x674/0xa90 fs/inode.c:1576
          do_unlinkat+0x733/0xa30 fs/namei.c:4069
          __do_sys_unlink fs/namei.c:4110 [inline]
          __se_sys_unlink fs/namei.c:4108 [inline]
          __x64_sys_unlink+0x42/0x50 fs/namei.c:4108
          do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
          entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      In this scenario path lookup under RCU is racing with the final
      unlink in case of symlinks. As Linus puts it in his analysis:
      
        [...] We actually RCU-delay the inode freeing itself, but
        when we do the final iput(), the "evict()" function is called
        synchronously. Now, the simple fix would seem to just RCU-delay
        the kfree() of the symlink data in bpf_evict_inode(). Maybe
        that's the right thing to do. [...]
      
      Al suggested to piggy-back on the ->destroy_inode() callback in
      order to implement RCU deferral there which can then kfree() the
      inode->i_link eventually right before putting inode back into
      inode cache. By reusing free_inode_nonrcu() from there we can
      avoid the need for our own inode cache and just reuse generic
      one as we currently do.
      
      And in-fact on top of all this we should just get rid of the
      bpf_evict_inode() entirely. This means truncate_inode_pages_final()
      and clear_inode() will then simply be called by the fs core via
      evict(). Dropping the reference should really only be done when
      inode is unhashed and nothing reachable anymore, so it's better
      also moved into the final ->destroy_inode() callback.
      
      Fixes: 0f98621b ("bpf, inode: add support for symlinks and fix mtime/ctime")
      Reported-by: syzbot+fb731ca573367b7f6564@syzkaller.appspotmail.com
      Reported-by: syzbot+a13e5ead792d6df37818@syzkaller.appspotmail.com
      Reported-by: syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Analyzed-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Link: https://lore.kernel.org/lkml/0000000000006946d2057bbd0eef@google.com/T/Signed-off-by: NSasha Levin (Microsoft) <sashal@kernel.org>
      e8eef7ad
    • V
      kernel: hung_task.c: disable on suspend · 491dee74
      Vitaly Kuznetsov 提交于
      [ Upstream commit a1c6ca3c6de763459a6e93b644ec6518c890ba1c ]
      
      It is possible to observe hung_task complaints when system goes to
      suspend-to-idle state:
      
       # echo freeze > /sys/power/state
      
       PM: Syncing filesystems ... done.
       Freezing user space processes ... (elapsed 0.001 seconds) done.
       OOM killer disabled.
       Freezing remaining freezable tasks ... (elapsed 0.002 seconds) done.
       sd 0:0:0:0: [sda] Synchronizing SCSI cache
       INFO: task bash:1569 blocked for more than 120 seconds.
             Not tainted 4.19.0-rc3_+ #687
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       bash            D    0  1569    604 0x00000000
       Call Trace:
        ? __schedule+0x1fe/0x7e0
        schedule+0x28/0x80
        suspend_devices_and_enter+0x4ac/0x750
        pm_suspend+0x2c0/0x310
      
      Register a PM notifier to disable the detector on suspend and re-enable
      back on wakeup.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      491dee74
    • K
      sched/core: Fix buffer overflow in cgroup2 property cpu.max · 52466ab2
      Konstantin Khlebnikov 提交于
      [ Upstream commit 4c47acd824aaaa8fc6dc519fb4e08d1522105b7a ]
      
      Add limit into sscanf format string for on-stack buffer.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 0d593634 ("sched: Implement interface for cgroup unified hierarchy")
      Link: https://lkml.kernel.org/r/155189230232.2620.13120481613524200065.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      52466ab2