1. 25 4月, 2017 1 次提交
  2. 22 4月, 2017 2 次提交
  3. 20 4月, 2017 2 次提交
    • S
      ring-buffer: Have ring_buffer_iter_empty() return true when empty · 78f7a45d
      Steven Rostedt (VMware) 提交于
      I noticed that reading the snapshot file when it is empty no longer gives a
      status. It suppose to show the status of the snapshot buffer as well as how
      to allocate and use it. For example:
      
       ># cat snapshot
       # tracer: nop
       #
       #
       # * Snapshot is allocated *
       #
       # Snapshot commands:
       # echo 0 > snapshot : Clears and frees snapshot buffer
       # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
       #                      Takes a snapshot of the main buffer.
       # echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free)
       #                      (Doesn't have to be '2' works with any number that
       #                       is not a '0' or '1')
      
      But instead it just showed an empty buffer:
      
       ># cat snapshot
       # tracer: nop
       #
       # entries-in-buffer/entries-written: 0/0   #P:4
       #
       #                              _-----=> irqs-off
       #                             / _----=> need-resched
       #                            | / _---=> hardirq/softirq
       #                            || / _--=> preempt-depth
       #                            ||| /     delay
       #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
       #              | |       |   ||||       |         |
      
      What happened was that it was using the ring_buffer_iter_empty() function to
      see if it was empty, and if it was, it showed the status. But that function
      was returning false when it was empty. The reason was that the iter header
      page was on the reader page, and the reader page was empty, but so was the
      buffer itself. The check only tested to see if the iter was on the commit
      page, but the commit page was no longer pointing to the reader page, but as
      all pages were empty, the buffer is also.
      
      Cc: stable@vger.kernel.org
      Fixes: 651e22f2 ("ring-buffer: Always reset iterator to reader page")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      78f7a45d
    • S
      tracing: Allocate the snapshot buffer before enabling probe · df62db5b
      Steven Rostedt (VMware) 提交于
      Currently the snapshot trigger enables the probe and then allocates the
      snapshot. If the probe triggers before the allocation, it could cause the
      snapshot to fail and turn tracing off. It's best to allocate the snapshot
      buffer first, and then enable the trigger. If something goes wrong in the
      enabling of the trigger, the snapshot buffer is still allocated, but it can
      also be freed by the user by writting zero into the snapshot buffer file.
      
      Also add a check of the return status of alloc_snapshot().
      
      Cc: stable@vger.kernel.org
      Fixes: 77fd5c15 ("tracing: Add snapshot trigger to function probes")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      df62db5b
  4. 19 4月, 2017 1 次提交
    • D
      sparc64: Use LOCKDEP_SMALL, not PROVE_LOCKING_SMALL · 395102db
      Daniel Jordan 提交于
      CONFIG_PROVE_LOCKING_SMALL shrinks the memory usage of lockdep so the
      kernel text, data, and bss fit in the required 32MB limit, but this
      option is not set for every config that enables lockdep.
      
      A 4.10 kernel fails to boot with the console output
      
          Kernel: Using 8 locked TLB entries for main kernel image.
          hypervisor_tlb_lock[2000000:0:8000000071c007c3:1]: errors with f
          Program terminated
      
      with these config options
      
          CONFIG_LOCKDEP=y
          CONFIG_LOCK_STAT=y
          CONFIG_PROVE_LOCKING=n
      
      To fix, rename CONFIG_PROVE_LOCKING_SMALL to CONFIG_LOCKDEP_SMALL, and
      enable this option with CONFIG_LOCKDEP=y so we get the reduced memory
      usage every time lockdep is turned on.
      
      Tested that CONFIG_LOCKDEP_SMALL is set to 'y' if and only if
      CONFIG_LOCKDEP is set to 'y'.  When other lockdep-related config options
      that select CONFIG_LOCKDEP are enabled (e.g. CONFIG_LOCK_STAT or
      CONFIG_PROVE_LOCKING), verified that CONFIG_LOCKDEP_SMALL is also
      enabled.
      
      Fixes: e6b5f1be ("config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc")
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBabu Moger <babu.moger@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      395102db
  5. 18 4月, 2017 4 次提交
    • N
      ftrace: Fix function pid filter on instances · d879d0b8
      Namhyung Kim 提交于
      When function tracer has a pid filter, it adds a probe to sched_switch
      to track if current task can be ignored.  The probe checks the
      ftrace_ignore_pid from current tr to filter tasks.  But it misses to
      delete the probe when removing an instance so that it can cause a crash
      due to the invalid tr pointer (use-after-free).
      
      This is easily reproducible with the following:
      
        # cd /sys/kernel/debug/tracing
        # mkdir instances/buggy
        # echo $$ > instances/buggy/set_ftrace_pid
        # rmdir instances/buggy
      
        ============================================================================
        BUG: KASAN: use-after-free in ftrace_filter_pid_sched_switch_probe+0x3d/0x90
        Read of size 8 by task kworker/0:1/17
        CPU: 0 PID: 17 Comm: kworker/0:1 Tainted: G    B           4.11.0-rc3  #198
        Call Trace:
         dump_stack+0x68/0x9f
         kasan_object_err+0x21/0x70
         kasan_report.part.1+0x22b/0x500
         ? ftrace_filter_pid_sched_switch_probe+0x3d/0x90
         kasan_report+0x25/0x30
         __asan_load8+0x5e/0x70
         ftrace_filter_pid_sched_switch_probe+0x3d/0x90
         ? fpid_start+0x130/0x130
         __schedule+0x571/0xce0
         ...
      
      To fix it, use ftrace_clear_pids() to unregister the probe.  As
      instance_rmdir() already updated ftrace codes, it can just free the
      filter safely.
      
      Link: http://lkml.kernel.org/r/20170417024430.21194-2-namhyung@kernel.org
      
      Fixes: 0c8916c3 ("tracing: Add rmdir to remove multibuffer instances")
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: stable@vger.kernel.org
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      d879d0b8
    • D
      bpf: fix checking xdp_adjust_head on tail calls · c2002f98
      Daniel Borkmann 提交于
      Commit 17bedab2 ("bpf: xdp: Allow head adjustment in XDP prog")
      added the xdp_adjust_head bit to the BPF prog in order to tell drivers
      that the program that is to be attached requires support for the XDP
      bpf_xdp_adjust_head() helper such that drivers not supporting this
      helper can reject the program. There are also drivers that do support
      the helper, but need to check for xdp_adjust_head bit in order to move
      packet metadata prepended by the firmware away for making headroom.
      
      For these cases, the current check for xdp_adjust_head bit is insufficient
      since there can be cases where the program itself does not use the
      bpf_xdp_adjust_head() helper, but tail calls into another program that
      uses bpf_xdp_adjust_head(). As such, the xdp_adjust_head bit is still
      set to 0. Since the first program has no control over which program it
      calls into, we need to assume that bpf_xdp_adjust_head() helper is used
      upon tail calls. Thus, for the very same reasons in cb_access, set the
      xdp_adjust_head bit to 1 when the main program uses tail calls.
      
      Fixes: 17bedab2 ("bpf: xdp: Allow head adjustment in XDP prog")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2002f98
    • D
      bpf: fix cb access in socket filter programs on tail calls · 6b1bb01b
      Daniel Borkmann 提交于
      Commit ff936a04 ("bpf: fix cb access in socket filter programs")
      added a fix for socket filter programs such that in i) AF_PACKET the
      20 bytes of skb->cb[] area gets zeroed before use in order to not leak
      data, and ii) socket filter programs attached to TCP/UDP sockets need
      to save/restore these 20 bytes since they are also used by protocol
      layers at that time.
      
      The problem is that bpf_prog_run_save_cb() and bpf_prog_run_clear_cb()
      only look at the actual attached program to determine whether to zero
      or save/restore the skb->cb[] parts. There can be cases where the
      actual attached program does not access the skb->cb[], but the program
      tail calls into another program which does access this area. In such
      a case, the zero or save/restore is currently not performed.
      
      Since the programs we tail call into are unknown at verification time
      and can dynamically change, we need to assume that whenever the attached
      program performs a tail call, that later programs could access the
      skb->cb[], and therefore we need to always set cb_access to 1.
      
      Fixes: ff936a04 ("bpf: fix cb access in socket filter programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b1bb01b
    • M
      bpf: lru: Lower the PERCPU_NR_SCANS from 16 to 4 · 695ba265
      Martin KaFai Lau 提交于
      After doing map_perf_test with a much bigger
      BPF_F_NO_COMMON_LRU map, the perf report shows a
      lot of time spent in rotating the inactive list (i.e.
      __bpf_lru_list_rotate_inactive):
      > map_perf_test 32 8 10000 1000000 | awk '{sum += $3}END{print sum}'
      19644783 (19M/s)
      > map_perf_test 32 8 10000000 10000000 |  awk '{sum += $3}END{print sum}'
      6283930 (6.28M/s)
      
      By inactive, it usually means the element is not in cache.  Hence,
      there is a need to tune the PERCPU_NR_SCANS value.
      
      This patch finds a better number of elements to
      scan during each list rotation.  The PERCPU_NR_SCANS (which
      is defined the same as PERCPU_FREE_TARGET) decreases
      from 16 elements to 4 elements.  This change only
      affects the BPF_F_NO_COMMON_LRU map.
      
      The test_lru_dist does not show meaningful difference
      between 16 and 4.  Our production L4 load balancer which uses
      the LRU map for conntrack-ing also shows little change in cache
      hit rate.  Since both benchmark and production data show no
      cache-hit difference, PERCPU_NR_SCANS is lowered from 16 to 4.
      We can consider making it configurable if we find a usecase
      later that shows another value works better and/or use
      a different rotation strategy.
      
      After this change:
      > map_perf_test 32 8 10000000 10000000 |  awk '{sum += $3}END{print sum}'
      9240324 (9.2M/s)
      
      i.e. 6.28M/s -> 9.2M/s
      
      The test_lru_dist has not shown meaningful difference:
      > test_lru_dist zipf.100k.a1_01.out 4000 1:
      nr_misses: 31575 (Before) vs 31566 (After)
      
      > test_lru_dist zipf.100k.a0_01.out 40000 1
      nr_misses: 67036 (Before) vs 67031 (After)
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      695ba265
  6. 16 4月, 2017 1 次提交
  7. 15 4月, 2017 1 次提交
    • S
      ftrace: Fix removing of second function probe · 82cc4fc2
      Steven Rostedt (VMware) 提交于
      When two function probes are added to set_ftrace_filter, and then one of
      them is removed, the update to the function locations is not performed, and
      the record keeping of the function states are corrupted, and causes an
      ftrace_bug() to occur.
      
      This is easily reproducable by adding two probes, removing one, and then
      adding it back again.
      
       # cd /sys/kernel/debug/tracing
       # echo schedule:traceoff > set_ftrace_filter
       # echo do_IRQ:traceoff > set_ftrace_filter
       # echo \!do_IRQ:traceoff > /debug/tracing/set_ftrace_filter
       # echo do_IRQ:traceoff > set_ftrace_filter
      
      Causes:
       ------------[ cut here ]------------
       WARNING: CPU: 2 PID: 1098 at kernel/trace/ftrace.c:2369 ftrace_get_addr_curr+0x143/0x220
       Modules linked in: [...]
       CPU: 2 PID: 1098 Comm: bash Not tainted 4.10.0-test+ #405
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
       Call Trace:
        dump_stack+0x68/0x9f
        __warn+0x111/0x130
        ? trace_irq_work_interrupt+0xa0/0xa0
        warn_slowpath_null+0x1d/0x20
        ftrace_get_addr_curr+0x143/0x220
        ? __fentry__+0x10/0x10
        ftrace_replace_code+0xe3/0x4f0
        ? ftrace_int3_handler+0x90/0x90
        ? printk+0x99/0xb5
        ? 0xffffffff81000000
        ftrace_modify_all_code+0x97/0x110
        arch_ftrace_update_code+0x10/0x20
        ftrace_run_update_code+0x1c/0x60
        ftrace_run_modify_code.isra.48.constprop.62+0x8e/0xd0
        register_ftrace_function_probe+0x4b6/0x590
        ? ftrace_startup+0x310/0x310
        ? debug_lockdep_rcu_enabled.part.4+0x1a/0x30
        ? update_stack_state+0x88/0x110
        ? ftrace_regex_write.isra.43.part.44+0x1d3/0x320
        ? preempt_count_sub+0x18/0xd0
        ? mutex_lock_nested+0x104/0x800
        ? ftrace_regex_write.isra.43.part.44+0x1d3/0x320
        ? __unwind_start+0x1c0/0x1c0
        ? _mutex_lock_nest_lock+0x800/0x800
        ftrace_trace_probe_callback.isra.3+0xc0/0x130
        ? func_set_flag+0xe0/0xe0
        ? __lock_acquire+0x642/0x1790
        ? __might_fault+0x1e/0x20
        ? trace_get_user+0x398/0x470
        ? strcmp+0x35/0x60
        ftrace_trace_onoff_callback+0x48/0x70
        ftrace_regex_write.isra.43.part.44+0x251/0x320
        ? match_records+0x420/0x420
        ftrace_filter_write+0x2b/0x30
        __vfs_write+0xd7/0x330
        ? do_loop_readv_writev+0x120/0x120
        ? locks_remove_posix+0x90/0x2f0
        ? do_lock_file_wait+0x160/0x160
        ? __lock_is_held+0x93/0x100
        ? rcu_read_lock_sched_held+0x5c/0xb0
        ? preempt_count_sub+0x18/0xd0
        ? __sb_start_write+0x10a/0x230
        ? vfs_write+0x222/0x240
        vfs_write+0xef/0x240
        SyS_write+0xab/0x130
        ? SyS_read+0x130/0x130
        ? trace_hardirqs_on_caller+0x182/0x280
        ? trace_hardirqs_on_thunk+0x1a/0x1c
        entry_SYSCALL_64_fastpath+0x18/0xad
       RIP: 0033:0x7fe61c157c30
       RSP: 002b:00007ffe87890258 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
       RAX: ffffffffffffffda RBX: ffffffff8114a410 RCX: 00007fe61c157c30
       RDX: 0000000000000010 RSI: 000055814798f5e0 RDI: 0000000000000001
       RBP: ffff8800c9027f98 R08: 00007fe61c422740 R09: 00007fe61ca53700
       R10: 0000000000000073 R11: 0000000000000246 R12: 0000558147a36400
       R13: 00007ffe8788f160 R14: 0000000000000024 R15: 00007ffe8788f15c
        ? trace_hardirqs_off_caller+0xc0/0x110
       ---[ end trace 99fa09b3d9869c2c ]---
       Bad trampoline accounting at: ffffffff81cc3b00 (do_IRQ+0x0/0x150)
      
      Cc: stable@vger.kernel.org
      Fixes: 59df055f ("ftrace: trace different functions with a different tracer")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      82cc4fc2
  8. 14 4月, 2017 2 次提交
  9. 12 4月, 2017 3 次提交
  10. 11 4月, 2017 2 次提交
    • J
      bpf: reference may_access_skb() from __bpf_prog_run() · 96a94cc5
      Johannes Berg 提交于
      It took me quite some time to figure out how this was linked,
      so in order to save the next person the effort of finding it
      add a comment in __bpf_prog_run() that indicates what exactly
      determines that a program can access the ctx == skb.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96a94cc5
    • Z
      cgroup: avoid attaching a cgroup root to two different superblocks · bfb0b80d
      Zefan Li 提交于
      Run this:
      
          touch file0
          for ((; ;))
          {
              mount -t cpuset xxx file0
          }
      
      And this concurrently:
      
          touch file1
          for ((; ;))
          {
              mount -t cpuset xxx file1
          }
      
      We'll trigger a warning like this:
      
       ------------[ cut here ]------------
       WARNING: CPU: 1 PID: 4675 at lib/percpu-refcount.c:317 percpu_ref_kill_and_confirm+0x92/0xb0
       percpu_ref_kill_and_confirm called more than once on css_release!
       CPU: 1 PID: 4675 Comm: mount Not tainted 4.11.0-rc5+ #5
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
       Call Trace:
        dump_stack+0x63/0x84
        __warn+0xd1/0xf0
        warn_slowpath_fmt+0x5f/0x80
        percpu_ref_kill_and_confirm+0x92/0xb0
        cgroup_kill_sb+0x95/0xb0
        deactivate_locked_super+0x43/0x70
        deactivate_super+0x46/0x60
       ...
       ---[ end trace a79f61c2a2633700 ]---
      
      Here's a race:
      
        Thread A				Thread B
      
        cgroup1_mount()
          # alloc a new cgroup root
          cgroup_setup_root()
      					cgroup1_mount()
      					  # no sb yet, returns NULL
      					  kernfs_pin_sb()
      
      					  # but succeeds in getting the refcnt,
      					  # so re-use cgroup root
      					  percpu_ref_tryget_live()
          # alloc sb with cgroup root
          cgroup_do_mount()
      
        cgroup_kill_sb()
      					  # alloc another sb with same root
      					  cgroup_do_mount()
      
      					cgroup_kill_sb()
      
      We end up using the same cgroup root for two different superblocks,
      so percpu_ref_kill() will be called twice on the same root when the
      two superblocks are destroyed.
      
      We should fix to make sure the superblock pinning is really successful.
      
      Cc: stable@vger.kernel.org # 3.16+
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      bfb0b80d
  11. 10 4月, 2017 1 次提交
    • P
      audit: make sure we don't let the retry queue grow without bounds · 264d5096
      Paul Moore 提交于
      The retry queue is intended to provide a temporary buffer in the case
      of transient errors when communicating with auditd, it is not meant
      as a long life queue, that functionality is provided by the hold
      queue.
      
      This patch fixes a problem identified by Seth where the retry queue
      could grow uncontrollably if an auditd instance did not connect to
      the kernel to drain the queues.  This commit fixes this by doing the
      following:
      
      * Make sure we always call auditd_reset() if we decide the connection
      with audit is really dead.  There were some cases in
      kauditd_hold_skb() where we did not reset the connection, this patch
      relocates the reset calls to kauditd_thread() so all the error
      conditions are caught and the connection reset.  As a side effect,
      this means we could move auditd_reset() and get rid of the forward
      definition at the top of kernel/audit.c.
      
      * We never checked the status of the auditd connection when
      processing the main audit queue which meant that the retry queue
      could grow unchecked.  This patch adds a call to auditd_reset()
      after the main queue has been processed if auditd is not connected,
      the auditd_reset() call will make sure the retry and hold queues are
      correctly managed/flushed so that the retry queue remains reasonable.
      
      Cc: <stable@vger.kernel.org> # 4.10.x-: 5b52330bReported-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      264d5096
  12. 09 4月, 2017 1 次提交
  13. 08 4月, 2017 2 次提交
  14. 05 4月, 2017 1 次提交
  15. 04 4月, 2017 1 次提交
  16. 02 4月, 2017 3 次提交
    • A
      bpf: introduce BPF_PROG_TEST_RUN command · 1cf1cae9
      Alexei Starovoitov 提交于
      development and testing of networking bpf programs is quite cumbersome.
      Despite availability of user space bpf interpreters the kernel is
      the ultimate authority and execution environment.
      Current test frameworks for TC include creation of netns, veth,
      qdiscs and use of various packet generators just to test functionality
      of a bpf program. XDP testing is even more complicated, since
      qemu needs to be started with gro/gso disabled and precise queue
      configuration, transferring of xdp program from host into guest,
      attaching to virtio/eth0 and generating traffic from the host
      while capturing the results from the guest.
      
      Moreover analyzing performance bottlenecks in XDP program is
      impossible in virtio environment, since cost of running the program
      is tiny comparing to the overhead of virtio packet processing,
      so performance testing can only be done on physical nic
      with another server generating traffic.
      
      Furthermore ongoing changes to user space control plane of production
      applications cannot be run on the test servers leaving bpf programs
      stubbed out for testing.
      
      Last but not least, the upstream llvm changes are validated by the bpf
      backend testsuite which has no ability to test the code generated.
      
      To improve this situation introduce BPF_PROG_TEST_RUN command
      to test and performance benchmark bpf programs.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cf1cae9
    • D
      bpf, verifier: fix rejection of unaligned access checks for map_value_adj · 79adffcd
      Daniel Borkmann 提交于
      Currently, the verifier doesn't reject unaligned access for map_value_adj
      register types. Commit 48461135 ("bpf: allow access into map value
      arrays") added logic to check_ptr_alignment() extending it from PTR_TO_PACKET
      to also PTR_TO_MAP_VALUE_ADJ, but for PTR_TO_MAP_VALUE_ADJ no enforcement
      is in place, because reg->id for PTR_TO_MAP_VALUE_ADJ reg types is never
      non-zero, meaning, we can cause BPF_H/_W/_DW-based unaligned access for
      architectures not supporting efficient unaligned access, and thus worst
      case could raise exceptions on some archs that are unable to correct the
      unaligned access or perform a different memory access to the actual
      requested one and such.
      
      i) Unaligned load with !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
         on r0 (map_value_adj):
      
         0: (bf) r2 = r10
         1: (07) r2 += -8
         2: (7a) *(u64 *)(r2 +0) = 0
         3: (18) r1 = 0x42533a00
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+11
          R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
         7: (61) r1 = *(u32 *)(r0 +0)
         8: (35) if r1 >= 0xb goto pc+9
          R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R1=inv,min_value=0,max_value=10 R10=fp
         9: (07) r0 += 3
        10: (79) r7 = *(u64 *)(r0 +0)
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R1=inv,min_value=0,max_value=10 R10=fp
        11: (79) r7 = *(u64 *)(r0 +2)
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R1=inv,min_value=0,max_value=10 R7=inv R10=fp
        [...]
      
      ii) Unaligned store with !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
          on r0 (map_value_adj):
      
         0: (bf) r2 = r10
         1: (07) r2 += -8
         2: (7a) *(u64 *)(r2 +0) = 0
         3: (18) r1 = 0x4df16a00
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+19
          R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
         7: (07) r0 += 3
         8: (7a) *(u64 *)(r0 +0) = 42
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
         9: (7a) *(u64 *)(r0 +2) = 43
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
        10: (7a) *(u64 *)(r0 -2) = 44
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
        [...]
      
      For the PTR_TO_PACKET type, reg->id is initially zero when skb->data
      was fetched, it later receives a reg->id from env->id_gen generator
      once another register with UNKNOWN_VALUE type was added to it via
      check_packet_ptr_add(). The purpose of this reg->id is twofold: i) it
      is used in find_good_pkt_pointers() for setting the allowed access
      range for regs with PTR_TO_PACKET of same id once verifier matched
      on data/data_end tests, and ii) for check_ptr_alignment() to determine
      that when not having efficient unaligned access and register with
      UNKNOWN_VALUE was added to PTR_TO_PACKET, that we're only allowed
      to access the content bytewise due to unknown unalignment. reg->id
      was never intended for PTR_TO_MAP_VALUE{,_ADJ} types and thus is
      always zero, the only marking is in PTR_TO_MAP_VALUE_OR_NULL that
      was added after 48461135 via 57a09bf0 ("bpf: Detect identical
      PTR_TO_MAP_VALUE_OR_NULL registers"). Above tests will fail for
      non-root environment due to prohibited pointer arithmetic.
      
      The fix splits register-type specific checks into their own helper
      instead of keeping them combined, so we don't run into a similar
      issue in future once we extend check_ptr_alignment() further and
      forget to add reg->type checks for some of the checks.
      
      Fixes: 48461135 ("bpf: allow access into map value arrays")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79adffcd
    • D
      bpf, verifier: fix alu ops against map_value{, _adj} register types · fce366a9
      Daniel Borkmann 提交于
      While looking into map_value_adj, I noticed that alu operations
      directly on the map_value() resp. map_value_adj() register (any
      alu operation on a map_value() register will turn it into a
      map_value_adj() typed register) are not sufficiently protected
      against some of the operations. Two non-exhaustive examples are
      provided that the verifier needs to reject:
      
       i) BPF_AND on r0 (map_value_adj):
      
        0: (bf) r2 = r10
        1: (07) r2 += -8
        2: (7a) *(u64 *)(r2 +0) = 0
        3: (18) r1 = 0xbf842a00
        5: (85) call bpf_map_lookup_elem#1
        6: (15) if r0 == 0x0 goto pc+2
         R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
        7: (57) r0 &= 8
        8: (7a) *(u64 *)(r0 +0) = 22
         R0=map_value_adj(ks=8,vs=48,id=0),min_value=0,max_value=8 R10=fp
        9: (95) exit
      
        from 6 to 9: R0=inv,min_value=0,max_value=0 R10=fp
        9: (95) exit
        processed 10 insns
      
      ii) BPF_ADD in 32 bit mode on r0 (map_value_adj):
      
        0: (bf) r2 = r10
        1: (07) r2 += -8
        2: (7a) *(u64 *)(r2 +0) = 0
        3: (18) r1 = 0xc24eee00
        5: (85) call bpf_map_lookup_elem#1
        6: (15) if r0 == 0x0 goto pc+2
         R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
        7: (04) (u32) r0 += (u32) 0
        8: (7a) *(u64 *)(r0 +0) = 22
         R0=map_value_adj(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
        9: (95) exit
      
        from 6 to 9: R0=inv,min_value=0,max_value=0 R10=fp
        9: (95) exit
        processed 10 insns
      
      Issue is, while min_value / max_value boundaries for the access
      are adjusted appropriately, we change the pointer value in a way
      that cannot be sufficiently tracked anymore from its origin.
      Operations like BPF_{AND,OR,DIV,MUL,etc} on a destination register
      that is PTR_TO_MAP_VALUE{,_ADJ} was probably unintended, in fact,
      all the test cases coming with 48461135 ("bpf: allow access
      into map value arrays") perform BPF_ADD only on the destination
      register that is PTR_TO_MAP_VALUE_ADJ.
      
      Only for UNKNOWN_VALUE register types such operations make sense,
      f.e. with unknown memory content fetched initially from a constant
      offset from the map value memory into a register. That register is
      then later tested against lower / upper bounds, so that the verifier
      can then do the tracking of min_value / max_value, and properly
      check once that UNKNOWN_VALUE register is added to the destination
      register with type PTR_TO_MAP_VALUE{,_ADJ}. This is also what the
      original use-case is solving. Note, tracking on what is being
      added is done through adjust_reg_min_max_vals() and later access
      to the map value enforced with these boundaries and the given offset
      from the insn through check_map_access_adj().
      
      Tests will fail for non-root environment due to prohibited pointer
      arithmetic, in particular in check_alu_op(), we bail out on the
      is_pointer_value() check on the dst_reg (which is false in root
      case as we allow for pointer arithmetic via env->allow_ptr_leaks).
      
      Similarly to PTR_TO_PACKET, one way to fix it is to restrict the
      allowed operations on PTR_TO_MAP_VALUE{,_ADJ} registers to 64 bit
      mode BPF_ADD. The test_verifier suite runs fine after the patch
      and it also rejects mentioned test cases.
      
      Fixes: 48461135 ("bpf: allow access into map value arrays")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fce366a9
  17. 28 3月, 2017 1 次提交
  18. 27 3月, 2017 1 次提交
  19. 25 3月, 2017 1 次提交
  20. 24 3月, 2017 1 次提交
    • J
      padata: avoid race in reordering · de5540d0
      Jason A. Donenfeld 提交于
      Under extremely heavy uses of padata, crashes occur, and with list
      debugging turned on, this happens instead:
      
      [87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33
      __list_add+0xae/0x130
      [87487.301868] list_add corruption. prev->next should be next
      (ffffb17abfc043d0), but was ffff8dba70872c80. (prev=ffff8dba70872b00).
      [87487.339011]  [<ffffffff9a53d075>] dump_stack+0x68/0xa3
      [87487.342198]  [<ffffffff99e119a1>] ? console_unlock+0x281/0x6d0
      [87487.345364]  [<ffffffff99d6b91f>] __warn+0xff/0x140
      [87487.348513]  [<ffffffff99d6b9aa>] warn_slowpath_fmt+0x4a/0x50
      [87487.351659]  [<ffffffff9a58b5de>] __list_add+0xae/0x130
      [87487.354772]  [<ffffffff9add5094>] ? _raw_spin_lock+0x64/0x70
      [87487.357915]  [<ffffffff99eefd66>] padata_reorder+0x1e6/0x420
      [87487.361084]  [<ffffffff99ef0055>] padata_do_serial+0xa5/0x120
      
      padata_reorder calls list_add_tail with the list to which its adding
      locked, which seems correct:
      
      spin_lock(&squeue->serial.lock);
      list_add_tail(&padata->list, &squeue->serial.list);
      spin_unlock(&squeue->serial.lock);
      
      This therefore leaves only place where such inconsistency could occur:
      if padata->list is added at the same time on two different threads.
      This pdata pointer comes from the function call to
      padata_get_next(pd), which has in it the following block:
      
      next_queue = per_cpu_ptr(pd->pqueue, cpu);
      padata = NULL;
      reorder = &next_queue->reorder;
      if (!list_empty(&reorder->list)) {
             padata = list_entry(reorder->list.next,
                                 struct padata_priv, list);
             spin_lock(&reorder->lock);
             list_del_init(&padata->list);
             atomic_dec(&pd->reorder_objects);
             spin_unlock(&reorder->lock);
      
             pd->processed++;
      
             goto out;
      }
      out:
      return padata;
      
      I strongly suspect that the problem here is that two threads can race
      on reorder list. Even though the deletion is locked, call to
      list_entry is not locked, which means it's feasible that two threads
      pick up the same padata object and subsequently call list_add_tail on
      them at the same time. The fix is thus be hoist that lock outside of
      that block.
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Acked-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      de5540d0
  21. 23 3月, 2017 6 次提交
    • P
      sched/clock, x86/perf: Fix "perf test tsc" · 698eff63
      Peter Zijlstra 提交于
      People reported that commit:
      
        5680d809 ("sched/clock: Provide better clock continuity")
      
      broke "perf test tsc".
      
      That commit added another offset to the reported clock value; so
      take that into account when computing the provided offset values.
      Reported-by: NAdrian Hunter <adrian.hunter@intel.com>
      Reported-by: NArnaldo Carvalho de Melo <acme@kernel.org>
      Tested-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 5680d809 ("sched/clock: Provide better clock continuity")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      698eff63
    • P
      sched/clock: Fix clear_sched_clock_stable() preempt wobbly · 71fdb70e
      Peter Zijlstra 提交于
      Paul reported a problems with clear_sched_clock_stable(). Since we run
      all of __clear_sched_clock_stable() from workqueue context, there's a
      preempt problem.
      
      Solve it by only running the static_key_disable() from workqueue.
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: fweisbec@gmail.com
      Link: http://lkml.kernel.org/r/20170313124621.GA3328@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      71fdb70e
    • M
      bpf: Add hash of maps support · bcc6b1b7
      Martin KaFai Lau 提交于
      This patch adds hash of maps support (hashmap->bpf_map).
      BPF_MAP_TYPE_HASH_OF_MAPS is added.
      
      A map-in-map contains a pointer to another map and lets call
      this pointer 'inner_map_ptr'.
      
      Notes on deleting inner_map_ptr from a hash map:
      
      1. For BPF_F_NO_PREALLOC map-in-map, when deleting
         an inner_map_ptr, the htab_elem itself will go through
         a rcu grace period and the inner_map_ptr resides
         in the htab_elem.
      
      2. For pre-allocated htab_elem (!BPF_F_NO_PREALLOC),
         when deleting an inner_map_ptr, the htab_elem may
         get reused immediately.  This situation is similar
         to the existing prealloc-ated use cases.
      
         However, the bpf_map_fd_put_ptr() calls bpf_map_put() which calls
         inner_map->ops->map_free(inner_map) which will go
         through a rcu grace period (i.e. all bpf_map's map_free
         currently goes through a rcu grace period).  Hence,
         the inner_map_ptr is still safe for the rcu reader side.
      
      This patch also includes BPF_MAP_TYPE_HASH_OF_MAPS to the
      check_map_prealloc() in the verifier.  preallocation is a
      must for BPF_PROG_TYPE_PERF_EVENT.  Hence, even we don't expect
      heavy updates to map-in-map, enforcing BPF_F_NO_PREALLOC for map-in-map
      is impossible without disallowing BPF_PROG_TYPE_PERF_EVENT from using
      map-in-map first.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bcc6b1b7
    • M
      bpf: Add array of maps support · 56f668df
      Martin KaFai Lau 提交于
      This patch adds a few helper funcs to enable map-in-map
      support (i.e. outer_map->inner_map).  The first outer_map type
      BPF_MAP_TYPE_ARRAY_OF_MAPS is also added in this patch.
      The next patch will introduce a hash of maps type.
      
      Any bpf map type can be acted as an inner_map.  The exception
      is BPF_MAP_TYPE_PROG_ARRAY because the extra level of
      indirection makes it harder to verify the owner_prog_type
      and owner_jited.
      
      Multi-level map-in-map is not supported (i.e. map->map is ok
      but not map->map->map).
      
      When adding an inner_map to an outer_map, it currently checks the
      map_type, key_size, value_size, map_flags, max_entries and ops.
      The verifier also uses those map's properties to do static analysis.
      map_flags is needed because we need to ensure BPF_PROG_TYPE_PERF_EVENT
      is using a preallocated hashtab for the inner_hash also.  ops and
      max_entries are needed to generate inlined map-lookup instructions.
      For simplicity reason, a simple '==' test is used for both map_flags
      and max_entries.  The equality of ops is implied by the equality of
      map_type.
      
      During outer_map creation time, an inner_map_fd is needed to create an
      outer_map.  However, the inner_map_fd's life time does not depend on the
      outer_map.  The inner_map_fd is merely used to initialize
      the inner_map_meta of the outer_map.
      
      Also, for the outer_map:
      
      * It allows element update and delete from syscall
      * It allows element lookup from bpf_prog
      
      The above is similar to the current fd_array pattern.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56f668df
    • M
      bpf: Fix and simplifications on inline map lookup · fad73a1a
      Martin KaFai Lau 提交于
      Fix in verifier:
      For the same bpf_map_lookup_elem() instruction (i.e. "call 1"),
      a broken case is "a different type of map could be used for the
      same lookup instruction". For example, an array in one case and a
      hashmap in another.  We have to resort to the old dynamic call behavior
      in this case.  The fix is to check for collision on insn_aux->map_ptr.
      If there is collision, don't inline the map lookup.
      
      Please see the "do_reg_lookup()" in test_map_in_map_kern.c in the later
      patch for how-to trigger the above case.
      
      Simplifications on array_map_gen_lookup():
      1. Calculate elem_size from map->value_size.  It removes the
         need for 'struct bpf_array' which makes the later map-in-map
         implementation easier.
      2. Remove the 'elem_size == 1' test
      
      Fixes: 81ed18ab ("bpf: add helper inlining infra and optimize map_array lookup")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fad73a1a
    • A
      bpf: fix hashmap extra_elems logic · 8c290e60
      Alexei Starovoitov 提交于
      In both kmalloc and prealloc mode the bpf_map_update_elem() is using
      per-cpu extra_elems to do atomic update when the map is full.
      There are two issues with it. The logic can be misused, since it allows
      max_entries+num_cpus elements to be present in the map. And alloc_extra_elems()
      at map creation time can fail percpu alloc for large map values with a warn:
      WARNING: CPU: 3 PID: 2752 at ../mm/percpu.c:892 pcpu_alloc+0x119/0xa60
      illegal size (32824) or align (8) for percpu allocation
      
      The fixes for both of these issues are different for kmalloc and prealloc modes.
      For prealloc mode allocate extra num_possible_cpus elements and store
      their pointers into extra_elems array instead of actual elements.
      Hence we can use these hidden(spare) elements not only when the map is full
      but during bpf_map_update_elem() that replaces existing element too.
      That also improves performance, since pcpu_freelist_pop/push is avoided.
      Unfortunately this approach cannot be used for kmalloc mode which needs
      to kfree elements after rcu grace period. Therefore switch it back to normal
      kmalloc even when full and old element exists like it was prior to
      commit 6c905981 ("bpf: pre-allocate hash map elements").
      
      Add tests to check for over max_entries and large map values.
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Fixes: 6c905981 ("bpf: pre-allocate hash map elements")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c290e60
  22. 21 3月, 2017 2 次提交
    • P
      audit: fix auditd/kernel connection state tracking · 5b52330b
      Paul Moore 提交于
      What started as a rather straightforward race condition reported by
      Dmitry using the syzkaller fuzzer ended up revealing some major
      problems with how the audit subsystem managed its netlink sockets and
      its connection with the userspace audit daemon.  Fixing this properly
      had quite the cascading effect and what we are left with is this rather
      large and complicated patch.  My initial goal was to try and decompose
      this patch into multiple smaller patches, but the way these changes
      are intertwined makes it difficult to split these changes into
      meaningful pieces that don't break or somehow make things worse for
      the intermediate states.
      
      The patch makes a number of changes, but the most significant are
      highlighted below:
      
      * The auditd tracking variables, e.g. audit_sock, are now gone and
      replaced by a RCU/spin_lock protected variable auditd_conn which is
      a structure containing all of the auditd tracking information.
      
      * We no longer track the auditd sock directly, instead we track it
      via the network namespace in which it resides and we use the audit
      socket associated with that namespace.  In spirit, this is what the
      code was trying to do prior to this patch (at least I think that is
      what the original authors intended), but it was done rather poorly
      and added a layer of obfuscation that only masked the underlying
      problems.
      
      * Big backlog queue cleanup, again.  In v4.10 we made some pretty big
      changes to how the audit backlog queues work, here we haven't changed
      the queue design so much as cleaned up the implementation.  Brought
      about by the locking changes, we've simplified kauditd_thread() quite
      a bit by consolidating the queue handling into a new helper function,
      kauditd_send_queue(), which allows us to eliminate a lot of very
      similar code and makes the looping logic in kauditd_thread() clearer.
      
      * All netlink messages sent to auditd are now sent via
      auditd_send_unicast_skb().  Other than just making sense, this makes
      the lock handling easier.
      
      * Change the audit_log_start() sleep behavior so that we never sleep
      on auditd events (unchanged) or if the caller is holding the
      audit_cmd_mutex (changed).  Previously we didn't sleep if the caller
      was auditd or if the message type fell between a certain range; the
      type check was a poor effort of doing what the cmd_mutex check now
      does.  Richard Guy Briggs originally proposed not sleeping the
      cmd_mutex owner several years ago but his patch wasn't acceptable
      at the time.  At least the idea lives on here.
      
      * A problem with the lost record counter has been resolved.  Steve
      Grubb and I both happened to notice this problem and according to
      some quick testing by Steve, this problem goes back quite some time.
      It's largely a harmless problem, although it may have left some
      careful sysadmins quite puzzled.
      
      Cc: <stable@vger.kernel.org> # 4.10.x-
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      5b52330b
    • R
      cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start() · 4296f23e
      Rafael J. Wysocki 提交于
      sugov_start() only initializes struct sugov_cpu per-CPU structures
      for shared policies, but it should do that for single-CPU policies too.
      
      That in particular makes the IO-wait boost mechanism work in the
      cases when cpufreq policies correspond to individual CPUs.
      
      Fixes: 21ca6d2c (cpufreq: schedutil: Add iowait boosting)
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: 4.9+ <stable@vger.kernel.org> # 4.9+
      4296f23e