1. 11 6月, 2019 17 次提交
  2. 04 6月, 2019 3 次提交
  3. 31 5月, 2019 15 次提交
    • P
      rcuperf: Fix cleanup path for invalid perf_type strings · 506b28fb
      Paul E. McKenney 提交于
      [ Upstream commit ad092c027713a68a34168942a5ef422e42e039f4 ]
      
      If the specified rcuperf.perf_type is not in the rcu_perf_init()
      function's perf_ops[] array, rcuperf prints some console messages and
      then invokes rcu_perf_cleanup() to set state so that a future torture
      test can run.  However, rcu_perf_cleanup() also attempts to end the
      test that didn't actually start, and in doing so relies on the value
      of cur_ops, a value that is not particularly relevant in this case.
      This can result in confusing output or even follow-on failures due to
      attempts to use facilities that have not been properly initialized.
      
      This commit therefore sets the value of cur_ops to NULL in this case and
      inserts a check near the beginning of rcu_perf_cleanup(), thus avoiding
      relying on an irrelevant cur_ops value.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      506b28fb
    • P
      rcutorture: Fix cleanup path for invalid torture_type strings · aa7919e3
      Paul E. McKenney 提交于
      [ Upstream commit b813afae7ab6a5e91b4e16cc567331d9c2ae1f04 ]
      
      If the specified rcutorture.torture_type is not in the rcu_torture_init()
      function's torture_ops[] array, rcutorture prints some console messages
      and then invokes rcu_torture_cleanup() to set state so that a future
      torture test can run.  However, rcu_torture_cleanup() also attempts to
      end the test that didn't actually start, and in doing so relies on the
      value of cur_ops, a value that is not particularly relevant in this case.
      This can result in confusing output or even follow-on failures due to
      attempts to use facilities that have not been properly initialized.
      
      This commit therefore sets the value of cur_ops to NULL in this case
      and inserts a check near the beginning of rcu_torture_cleanup(),
      thus avoiding relying on an irrelevant cur_ops value.
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      aa7919e3
    • T
      timekeeping: Force upper bound for setting CLOCK_REALTIME · dc0f37b7
      Thomas Gleixner 提交于
      [ Upstream commit 7a8e61f8478639072d402a26789055a4a4de8f77 ]
      
      Several people reported testing failures after setting CLOCK_REALTIME close
      to the limits of the kernel internal representation in nanoseconds,
      i.e. year 2262.
      
      The failures are exposed in subsequent operations, i.e. when arming timers
      or when the advancing CLOCK_MONOTONIC makes the calculation of
      CLOCK_REALTIME overflow into negative space.
      
      Now people start to paper over the underlying problem by clamping
      calculations to the valid range, but that's just wrong because such
      workarounds will prevent detection of real issues as well.
      
      It is reasonable to force an upper bound for the various methods of setting
      CLOCK_REALTIME. Year 2262 is the absolute upper bound. Assume a maximum
      uptime of 30 years which is plenty enough even for esoteric embedded
      systems. That results in an upper bound of year 2232 for setting the time.
      
      Once that limit is reached in reality this limit is only a small part of
      the problem space. But until then this stops people from trying to paper
      over the problem at the wrong places.
      Reported-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reported-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Stephen Boyd <sboyd@kernel.org>
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903231125480.2157@nanos.tec.linutronix.deSigned-off-by: NSasha Levin <sashal@kernel.org>
      dc0f37b7
    • P
      x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP · 1a3188d7
      Peter Zijlstra 提交于
      [ Upstream commit 4a6c91fbdef846ec7250b82f2eeeb87ac5f18cf9 ]
      
      For CONFIG_TRACE_BRANCH_PROFILING=y the likely/unlikely things get
      overloaded and generate callouts to this code, and thus also when
      AC=1.
      
      Make it safe.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1a3188d7
    • N
      irq_work: Do not raise an IPI when queueing work on the local CPU · afee27f3
      Nicholas Piggin 提交于
      [ Upstream commit 471ba0e686cb13752bc1ff3216c54b69a2d250ea ]
      
      The QEMU PowerPC/PSeries machine model was not expecting a self-IPI,
      and it may be a bit surprising thing to do, so have irq_work_queue_on
      do local queueing when target is the current CPU.
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Reported-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Tested-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: =?UTF-8?q?C=C3=A9dric=20Le=20Goater?= <clg@kaod.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190409093403.20994-1-npiggin@gmail.com
      [ Simplified the preprocessor comments.
        Fixed unbalanced curly brackets pointed out by Thomas. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      afee27f3
    • K
      sched/core: Handle overflow in cpu_shares_write_u64 · 355673f8
      Konstantin Khlebnikov 提交于
      [ Upstream commit 5b61d50ab4ef590f5e1d4df15cd2cea5f5715308 ]
      
      Bit shift in scale_load() could overflow shares. This patch saturates
      it to MAX_SHARES like following sched_group_set_shares().
      
      Example:
      
       # echo 9223372036854776832 > cpu.shares
       # cat cpu.shares
      
      Before patch: 1024
      After pattch: 262144
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/155125501891.293431.3345233332801109696.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      355673f8
    • K
      sched/rt: Check integer overflow at usec to nsec conversion · 7053046e
      Konstantin Khlebnikov 提交于
      [ Upstream commit 1a010e29cfa00fee2888fd2fd4983f848cbafb58 ]
      
      Example of unhandled overflows:
      
       # echo 18446744073709651 > cpu.rt_runtime_us
       # cat cpu.rt_runtime_us
       99
      
       # echo 18446744073709900 > cpu.rt_period_us
       # cat cpu.rt_period_us
       348
      
      After this patch they will fail with -EINVAL.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/155125501739.293431.5252197504404771496.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7053046e
    • K
      sched/core: Check quota and period overflow at usec to nsec conversion · 925275d0
      Konstantin Khlebnikov 提交于
      [ Upstream commit 1a8b4540db732ca16c9e43ac7c08b1b8f0b252d8 ]
      
      Large values could overflow u64 and pass following sanity checks.
      
       # echo 18446744073750000 > cpu.cfs_period_us
       # cat cpu.cfs_period_us
       40448
      
       # echo 18446744073750000 > cpu.cfs_quota_us
       # cat cpu.cfs_quota_us
       40448
      
      After this patch they will fail with -EINVAL.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/155125502079.293431.3947497929372138600.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      925275d0
    • R
      cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock · 4e4d5cea
      Roman Gushchin 提交于
      [ Upstream commit 4dcabece4c3a9f9522127be12cc12cc120399b2f ]
      
      The number of descendant cgroups and the number of dying
      descendant cgroups are currently synchronized using the cgroup_mutex.
      
      The number of descendant cgroups will be required by the cgroup v2
      freezer, which will use it to determine if a cgroup is frozen
      (depending on total number of descendants and number of frozen
      descendants). It's not always acceptable to grab the cgroup_mutex,
      especially from quite hot paths (e.g. exit()).
      
      To avoid this, let's additionally synchronize these counters using
      the css_set_lock.
      
      So, it's safe to read these counters with either cgroup_mutex or
      css_set_lock locked, and for changing both locks should be acquired.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kernel-team@fb.com
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4e4d5cea
    • W
      audit: fix a memory leak bug · 6c21fa84
      Wenwen Wang 提交于
      [ Upstream commit 70c4cf17e445264453bc5323db3e50aa0ac9e81f ]
      
      In audit_rule_change(), audit_data_to_entry() is firstly invoked to
      translate the payload data to the kernel's rule representation. In
      audit_data_to_entry(), depending on the audit field type, an audit tree may
      be created in audit_make_tree(), which eventually invokes kmalloc() to
      allocate the tree.  Since this tree is a temporary tree, it will be then
      freed in the following execution, e.g., audit_add_rule() if the message
      type is AUDIT_ADD_RULE or audit_del_rule() if the message type is
      AUDIT_DEL_RULE. However, if the message type is neither AUDIT_ADD_RULE nor
      AUDIT_DEL_RULE, i.e., the default case of the switch statement, this
      temporary tree is not freed.
      
      To fix this issue, only allocate the tree when the type is AUDIT_ADD_RULE
      or AUDIT_DEL_RULE.
      Signed-off-by: NWenwen Wang <wang6495@umn.edu>
      Reviewed-by: NRichard Guy Briggs <rgb@redhat.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6c21fa84
    • N
      sched/nohz: Run NOHZ idle load balancer on HK_FLAG_MISC CPUs · 07da741d
      Nicholas Piggin 提交于
      [ Upstream commit 9b019acb72e4b5741d88e8936d6f200ed44b66b2 ]
      
      The NOHZ idle balancer runs on the lowest idle CPU. This can
      interfere with isolated CPUs, so confine it to HK_FLAG_MISC
      housekeeping CPUs.
      
      HK_FLAG_SCHED is not used for this because it is not set anywhere
      at the moment. This could be folded into HK_FLAG_SCHED once that
      option is fixed.
      
      The problem was observed with increased jitter on an application
      running on CPU0, caused by NOHZ idle load balancing being run on
      CPU1 (an SMT sibling).
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190412042613.28930-1-npiggin@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      07da741d
    • N
      x86/modules: Avoid breaking W^X while loading modules · 8715ce03
      Nadav Amit 提交于
      [ Upstream commit f2c65fb3221adc6b73b0549fc7ba892022db9797 ]
      
      When modules and BPF filters are loaded, there is a time window in
      which some memory is both writable and executable. An attacker that has
      already found another vulnerability (e.g., a dangling pointer) might be
      able to exploit this behavior to overwrite kernel code. Prevent having
      writable executable PTEs in this stage.
      
      In addition, avoiding having W+X mappings can also slightly simplify the
      patching of modules code on initialization (e.g., by alternatives and
      static-key), as would be done in the next patch. This was actually the
      main motivation for this patch.
      
      To avoid having W+X mappings, set them initially as RW (NX) and after
      they are set as RO set them as X as well. Setting them as executable is
      done as a separate step to avoid one core in which the old PTE is cached
      (hence writable), and another which sees the updated PTE (executable),
      which would break the W^X protection.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Suggested-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <akpm@linux-foundation.org>
      Cc: <ard.biesheuvel@linaro.org>
      Cc: <deneen.t.dock@intel.com>
      Cc: <kernel-hardening@lists.openwall.com>
      Cc: <kristen@linux.intel.com>
      Cc: <linux_dti@icloud.com>
      Cc: <will.deacon@arm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: https://lkml.kernel.org/r/20190426001143.4983-12-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      8715ce03
    • A
      acct_on(): don't mess with freeze protection · 7c2bcb3c
      Al Viro 提交于
      commit 9419a3191dcb27f24478d288abaab697228d28e6 upstream.
      
      What happens there is that we are replacing file->path.mnt of
      a file we'd just opened with a clone and we need the write
      count contribution to be transferred from original mount to
      new one.  That's it.  We do *NOT* want any kind of freeze
      protection for the duration of switchover.
      
      IOW, we should just use __mnt_{want,drop}_write() for that
      switchover; no need to bother with mnt_{want,drop}_write()
      there.
      Tested-by: NAmir Goldstein <amir73il@gmail.com>
      Reported-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7c2bcb3c
    • E
      bpf: devmap: fix use-after-free Read in __dev_map_entry_free · 003e2d74
      Eric Dumazet 提交于
      commit 2baae3545327632167c0180e9ca1d467416f1919 upstream.
      
      synchronize_rcu() is fine when the rcu callbacks only need
      to free memory (kfree_rcu() or direct kfree() call rcu call backs)
      
      __dev_map_entry_free() is a bit more complex, so we need to make
      sure that call queued __dev_map_entry_free() callbacks have completed.
      
      sysbot report:
      
      BUG: KASAN: use-after-free in dev_map_flush_old kernel/bpf/devmap.c:365
      [inline]
      BUG: KASAN: use-after-free in __dev_map_entry_free+0x2a8/0x300
      kernel/bpf/devmap.c:379
      Read of size 8 at addr ffff8801b8da38c8 by task ksoftirqd/1/18
      
      CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.17.0+ #39
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x1b9/0x294 lib/dump_stack.c:113
        print_address_description+0x6c/0x20b mm/kasan/report.c:256
        kasan_report_error mm/kasan/report.c:354 [inline]
        kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
        __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
        dev_map_flush_old kernel/bpf/devmap.c:365 [inline]
        __dev_map_entry_free+0x2a8/0x300 kernel/bpf/devmap.c:379
        __rcu_reclaim kernel/rcu/rcu.h:178 [inline]
        rcu_do_batch kernel/rcu/tree.c:2558 [inline]
        invoke_rcu_callbacks kernel/rcu/tree.c:2818 [inline]
        __rcu_process_callbacks kernel/rcu/tree.c:2785 [inline]
        rcu_process_callbacks+0xe9d/0x1760 kernel/rcu/tree.c:2802
        __do_softirq+0x2e0/0xaf5 kernel/softirq.c:284
        run_ksoftirqd+0x86/0x100 kernel/softirq.c:645
        smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164
        kthread+0x345/0x410 kernel/kthread.c:240
        ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
      
      Allocated by task 6675:
        save_stack+0x43/0xd0 mm/kasan/kasan.c:448
        set_track mm/kasan/kasan.c:460 [inline]
        kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
        kmem_cache_alloc_trace+0x152/0x780 mm/slab.c:3620
        kmalloc include/linux/slab.h:513 [inline]
        kzalloc include/linux/slab.h:706 [inline]
        dev_map_alloc+0x208/0x7f0 kernel/bpf/devmap.c:102
        find_and_alloc_map kernel/bpf/syscall.c:129 [inline]
        map_create+0x393/0x1010 kernel/bpf/syscall.c:453
        __do_sys_bpf kernel/bpf/syscall.c:2351 [inline]
        __se_sys_bpf kernel/bpf/syscall.c:2328 [inline]
        __x64_sys_bpf+0x303/0x510 kernel/bpf/syscall.c:2328
        do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:290
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 26:
        save_stack+0x43/0xd0 mm/kasan/kasan.c:448
        set_track mm/kasan/kasan.c:460 [inline]
        __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
        kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
        __cache_free mm/slab.c:3498 [inline]
        kfree+0xd9/0x260 mm/slab.c:3813
        dev_map_free+0x4fa/0x670 kernel/bpf/devmap.c:191
        bpf_map_free_deferred+0xba/0xf0 kernel/bpf/syscall.c:262
        process_one_work+0xc64/0x1b70 kernel/workqueue.c:2153
        worker_thread+0x181/0x13a0 kernel/workqueue.c:2296
        kthread+0x345/0x410 kernel/kthread.c:240
        ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
      
      The buggy address belongs to the object at ffff8801b8da37c0
        which belongs to the cache kmalloc-512 of size 512
      The buggy address is located 264 bytes inside of
        512-byte region [ffff8801b8da37c0, ffff8801b8da39c0)
      The buggy address belongs to the page:
      page:ffffea0006e368c0 count:1 mapcount:0 mapping:ffff8801da800940
      index:0xffff8801b8da3540
      flags: 0x2fffc0000000100(slab)
      raw: 02fffc0000000100 ffffea0007217b88 ffffea0006e30cc8 ffff8801da800940
      raw: ffff8801b8da3540 ffff8801b8da3040 0000000100000004 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
        ffff8801b8da3780: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
        ffff8801b8da3800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      > ffff8801b8da3880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                     ^
        ffff8801b8da3900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff8801b8da3980: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
      
      Fixes: 546ac1ff ("bpf: add devmap, a map for storing net device references")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: syzbot+457d3e2ffbcf31aee5c0@syzkaller.appspotmail.com
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      003e2d74
    • D
      bpf: add bpf_jit_limit knob to restrict unpriv allocations · 43caa29c
      Daniel Borkmann 提交于
      commit ede95a63b5e84ddeea6b0c473b36ab8bfd8c6ce3 upstream.
      
      Rick reported that the BPF JIT could potentially fill the entire module
      space with BPF programs from unprivileged users which would prevent later
      attempts to load normal kernel modules or privileged BPF programs, for
      example. If JIT was enabled but unsuccessful to generate the image, then
      before commit 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      we would always fall back to the BPF interpreter. Nowadays in the case
      where the CONFIG_BPF_JIT_ALWAYS_ON could be set, then the load will abort
      with a failure since the BPF interpreter was compiled out.
      
      Add a global limit and enforce it for unprivileged users such that in case
      of BPF interpreter compiled out we fail once the limit has been reached
      or we fall back to BPF interpreter earlier w/o using module mem if latter
      was compiled in. In a next step, fair share among unprivileged users can
      be resolved in particular for the case where we would fail hard once limit
      is reached.
      
      Fixes: 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      Fixes: 0a14842f ("net: filter: Just In Time compiler for x86-64")
      Co-Developed-by: NRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: LKML <linux-kernel@vger.kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43caa29c
  4. 26 5月, 2019 5 次提交
    • D
      bpf, lru: avoid messing with eviction heuristics upon syscall lookup · 107e215c
      Daniel Borkmann 提交于
      commit 50b045a8c0ccf44f76640ac3eea8d80ca53979a3 upstream.
      
      One of the biggest issues we face right now with picking LRU map over
      regular hash table is that a map walk out of user space, for example,
      to just dump the existing entries or to remove certain ones, will
      completely mess up LRU eviction heuristics and wrong entries such
      as just created ones will get evicted instead. The reason for this
      is that we mark an entry as "in use" via bpf_lru_node_set_ref() from
      system call lookup side as well. Thus upon walk, all entries are
      being marked, so information of actual least recently used ones
      are "lost".
      
      In case of Cilium where it can be used (besides others) as a BPF
      based connection tracker, this current behavior causes disruption
      upon control plane changes that need to walk the map from user space
      to evict certain entries. Discussion result from bpfconf [0] was that
      we should simply just remove marking from system call side as no
      good use case could be found where it's actually needed there.
      Therefore this patch removes marking for regular LRU and per-CPU
      flavor. If there ever should be a need in future, the behavior could
      be selected via map creation flag, but due to mentioned reason we
      avoid this here.
      
        [0] http://vger.kernel.org/bpfconf.html
      
      Fixes: 29ba732a ("bpf: Add BPF_MAP_TYPE_LRU_HASH")
      Fixes: 8f844938 ("bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      107e215c
    • D
      bpf: add map_lookup_elem_sys_only for lookups from syscall side · 2bb3c547
      Daniel Borkmann 提交于
      commit c6110222c6f49ea68169f353565eb865488a8619 upstream.
      
      Add a callback map_lookup_elem_sys_only() that map implementations
      could use over map_lookup_elem() from system call side in case the
      map implementation needs to handle the latter differently than from
      the BPF data path. If map_lookup_elem_sys_only() is set, this will
      be preferred pick for map lookups out of user space. This hook is
      used in a follow-up fix for LRU map, but once development window
      opens, we can convert other map types from map_lookup_elem() (here,
      the one called upon BPF_MAP_LOOKUP_ELEM cmd is meant) over to use
      the callback to simplify and clean up the latter.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      2bb3c547
    • C
      bpf: relax inode permission check for retrieving bpf program · 3ded3aaa
      Chenbo Feng 提交于
      commit e547ff3f803e779a3898f1f48447b29f43c54085 upstream.
      
      For iptable module to load a bpf program from a pinned location, it
      only retrieve a loaded program and cannot change the program content so
      requiring a write permission for it might not be necessary.
      Also when adding or removing an unrelated iptable rule, it might need to
      flush and reload the xt_bpf related rules as well and triggers the inode
      permission check. It might be better to remove the write premission
      check for the inode so we won't need to grant write access to all the
      processes that flush and restore iptables rules.
      Signed-off-by: NChenbo Feng <fengc@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3ded3aaa
    • T
      sched/cpufreq: Fix kobject memleak · 290da8e7
      Tobin C. Harding 提交于
      [ Upstream commit 9a4f26cc98d81b67ecc23b890c28e2df324e29f3 ]
      
      Currently the error return path from kobject_init_and_add() is not
      followed by a call to kobject_put() - which means we are leaking
      the kobject.
      
      Fix it by adding a call to kobject_put() in the error path of
      kobject_init_and_add().
      Signed-off-by: NTobin C. Harding <tobin@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: http://lkml.kernel.org/r/20190430001144.24890-1-tobin@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      290da8e7
    • E
      tracing: Fix partial reading of trace event's id file · fb8c9c90
      Elazar Leibovich 提交于
      commit cbe08bcbbe787315c425dde284dcb715cfbf3f39 upstream.
      
      When reading only part of the id file, the ppos isn't tracked correctly.
      This is taken care by simple_read_from_buffer.
      
      Reading a single byte, and then the next byte would result EOF.
      
      While this seems like not a big deal, this breaks abstractions that
      reads information from files unbuffered. See for example
      https://github.com/golang/go/issues/29399
      
      This code was mentioned as problematic in
      commit cd458ba9
      ("tracing: Do not (ab)use trace_seq in event_id_read()")
      
      An example C code that show this bug is:
      
        #include <stdio.h>
        #include <stdint.h>
      
        #include <sys/types.h>
        #include <sys/stat.h>
        #include <fcntl.h>
        #include <unistd.h>
      
        int main(int argc, char **argv) {
          if (argc < 2)
            return 1;
          int fd = open(argv[1], O_RDONLY);
          char c;
          read(fd, &c, 1);
          printf("First  %c\n", c);
          read(fd, &c, 1);
          printf("Second %c\n", c);
        }
      
      Then run with, e.g.
      
        sudo ./a.out /sys/kernel/debug/tracing/events/tcp/tcp_set_state/id
      
      You'll notice you're getting the first character twice, instead of the
      first two characters in the id file.
      
      Link: http://lkml.kernel.org/r/20181231115837.4932-1-elazar@lightbitslabs.com
      
      Cc: Orit Wasserman <orit.was@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 23725aee ("ftrace: provide an id file for each event")
      Signed-off-by: NElazar Leibovich <elazar@lightbitslabs.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fb8c9c90