1. 11 6月, 2017 3 次提交
  2. 07 6月, 2017 7 次提交
  3. 05 6月, 2017 1 次提交
  4. 03 6月, 2017 2 次提交
  5. 01 6月, 2017 5 次提交
    • A
      bpf: use different interpreter depending on required stack size · b870aa90
      Alexei Starovoitov 提交于
      16 __bpf_prog_run() interpreters for various stack sizes add .text
      but not a lot comparing to run-time stack savings
      
         text	   data	    bss	    dec	    hex	filename
        26350   10328     624   37302    91b6 kernel/bpf/core.o.before_split
        25777   10328     624   36729    8f79 kernel/bpf/core.o.after_split
        26970	  10328	    624	  37922	   9422	kernel/bpf/core.o.now
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b870aa90
    • A
      bpf: reconcile bpf_tail_call and stack_depth · 80a58d02
      Alexei Starovoitov 提交于
      The next set of patches will take advantage of stack_depth tracking,
      so make sure that the program that does bpf_tail_call() has
      stack depth large enough for the callee.
      We could have tracked the stack depth of the prog_array owner program
      and only allow insertion of the programs with stack depth less
      than the owner, but it will break existing applications.
      Some of them have trivial root bpf program that only does
      multiple bpf_tail_calls and at init time the prog array is empty.
      In the future we may add a flag to do such tracking optionally,
      but for now play simple and safe.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80a58d02
    • A
      bpf: teach verifier to track stack depth · 8726679a
      Alexei Starovoitov 提交于
      teach verifier to track bpf program stack depth
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8726679a
    • A
      bpf: split bpf core interpreter · f696b8f4
      Alexei Starovoitov 提交于
      split __bpf_prog_run() interpreter into stack allocation and execution parts.
      The code section shrinks which helps interpreter performance in some cases.
         text	   data	    bss	    dec	    hex	filename
        26350	  10328	    624	  37302	   91b6	kernel/bpf/core.o.before
        25777	  10328	    624	  36729	   8f79	kernel/bpf/core.o.after
      
      Very short programs got slower (due to extra function call):
      Before:
      test_bpf: #89 ALU64_ADD_K: 1 + 2 = 3 jited:0 7 PASS
      test_bpf: #90 ALU64_ADD_K: 3 + 0 = 3 jited:0 8 PASS
      test_bpf: #91 ALU64_ADD_K: 1 + 2147483646 = 2147483647 jited:0 7 PASS
      test_bpf: #92 ALU64_ADD_K: 4294967294 + 2 = 4294967296 jited:0 11 PASS
      test_bpf: #93 ALU64_ADD_K: 2147483646 + -2147483647 = -1 jited:0 7 PASS
      After:
      test_bpf: #89 ALU64_ADD_K: 1 + 2 = 3 jited:0 11 PASS
      test_bpf: #90 ALU64_ADD_K: 3 + 0 = 3 jited:0 11 PASS
      test_bpf: #91 ALU64_ADD_K: 1 + 2147483646 = 2147483647 jited:0 11 PASS
      test_bpf: #92 ALU64_ADD_K: 4294967294 + 2 = 4294967296 jited:0 14 PASS
      test_bpf: #93 ALU64_ADD_K: 2147483646 + -2147483647 = -1 jited:0 10 PASS
      
      Longer programs got faster:
      Before:
      test_bpf: #266 BPF_MAXINSNS: Ctx heavy transformations jited:0 20286 20513 PASS
      test_bpf: #267 BPF_MAXINSNS: Call heavy transformations jited:0 31853 31768 PASS
      test_bpf: #268 BPF_MAXINSNS: Jump heavy test jited:0 9815 PASS
      test_bpf: #269 BPF_MAXINSNS: Very long jump backwards jited:0 6 PASS
      test_bpf: #270 BPF_MAXINSNS: Edge hopping nuthouse jited:0 13959 PASS
      test_bpf: #271 BPF_MAXINSNS: Jump, gap, jump, ... jited:0 210 PASS
      test_bpf: #272 BPF_MAXINSNS: ld_abs+get_processor_id jited:0 21724 PASS
      test_bpf: #273 BPF_MAXINSNS: ld_abs+vlan_push/pop jited:0 19118 PASS
      After:
      test_bpf: #266 BPF_MAXINSNS: Ctx heavy transformations jited:0 19008 18827 PASS
      test_bpf: #267 BPF_MAXINSNS: Call heavy transformations jited:0 29238 28450 PASS
      test_bpf: #268 BPF_MAXINSNS: Jump heavy test jited:0 9485 PASS
      test_bpf: #269 BPF_MAXINSNS: Very long jump backwards jited:0 12 PASS
      test_bpf: #270 BPF_MAXINSNS: Edge hopping nuthouse jited:0 13257 PASS
      test_bpf: #271 BPF_MAXINSNS: Jump, gap, jump, ... jited:0 213 PASS
      test_bpf: #272 BPF_MAXINSNS: ld_abs+get_processor_id jited:0 19389 PASS
      test_bpf: #273 BPF_MAXINSNS: ld_abs+vlan_push/pop jited:0 19583 PASS
      
      For real world production programs the difference is noise.
      
      This patch is first step towards reducing interpreter stack consumption.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f696b8f4
    • A
      bpf: free up BPF_JMP | BPF_CALL | BPF_X opcode · 71189fa9
      Alexei Starovoitov 提交于
      free up BPF_JMP | BPF_CALL | BPF_X opcode to be used by actual
      indirect call by register and use kernel internal opcode to
      mark call instruction into bpf_tail_call() helper.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71189fa9
  6. 27 5月, 2017 3 次提交
  7. 26 5月, 2017 3 次提交
    • D
      bpf: fix wrong exposure of map_flags into fdinfo for lpm · a316338c
      Daniel Borkmann 提交于
      trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via
      attr->map_flags, since it does not support preallocation yet. We
      check the flag, but we never copy the flag into trie->map.map_flags,
      which is later on exposed into fdinfo and used by loaders such as
      iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test
      whether a pinned map has the same spec as the one from the BPF obj
      file and if not, bails out, which is currently the case for lpm
      since it exposes always 0 as flags.
      
      Also copy over flags in array_map_alloc() and stack_map_alloc().
      They always have to be 0 right now, but we should make sure to not
      miss to copy them over at a later point in time when we add actual
      flags for them to use.
      
      Fixes: b95a5c4d ("bpf: add a longest prefix match trie map implementation")
      Reported-by: NJarno Rajahalme <jarno@covalent.io>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a316338c
    • D
      bpf: properly reset caller saved regs after helper call and ld_abs/ind · a9789ef9
      Daniel Borkmann 提交于
      Currently, after performing helper calls, we clear all caller saved
      registers, that is r0 - r5 and fill r0 depending on struct bpf_func_proto
      specification. The way we reset these regs can affect pruning decisions
      in later paths, since we only reset register's imm to 0 and type to
      NOT_INIT. However, we leave out clearing of other variables such as id,
      min_value, max_value, etc, which can later on lead to pruning mismatches
      due to stale data.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9789ef9
    • D
      bpf: fix incorrect pruning decision when alignment must be tracked · 1ad2f583
      Daniel Borkmann 提交于
      Currently, when we enforce alignment tracking on direct packet access,
      the verifier lets the following program pass despite doing a packet
      write with unaligned access:
      
        0: (61) r2 = *(u32 *)(r1 +76)
        1: (61) r3 = *(u32 *)(r1 +80)
        2: (61) r7 = *(u32 *)(r1 +8)
        3: (bf) r0 = r2
        4: (07) r0 += 14
        5: (25) if r7 > 0x1 goto pc+4
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp
        6: (2d) if r0 > r3 goto pc+1
         R0=pkt(id=0,off=14,r=14) R1=ctx R2=pkt(id=0,off=0,r=14)
         R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp
        7: (63) *(u32 *)(r0 -4) = r0
        8: (b7) r0 = 0
        9: (95) exit
      
        from 6 to 8:
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp
        8: (b7) r0 = 0
        9: (95) exit
      
        from 5 to 10:
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=2 R10=fp
        10: (07) r0 += 1
        11: (05) goto pc-6
        6: safe                           <----- here, wrongly found safe
        processed 15 insns
      
      However, if we enforce a pruning mismatch by adding state into r8
      which is then being mismatched in states_equal(), we find that for
      the otherwise same program, the verifier detects a misaligned packet
      access when actually walking that path:
      
        0: (61) r2 = *(u32 *)(r1 +76)
        1: (61) r3 = *(u32 *)(r1 +80)
        2: (61) r7 = *(u32 *)(r1 +8)
        3: (b7) r8 = 1
        4: (bf) r0 = r2
        5: (07) r0 += 14
        6: (25) if r7 > 0x1 goto pc+4
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=0,max_value=1
         R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
        7: (2d) if r0 > r3 goto pc+1
         R0=pkt(id=0,off=14,r=14) R1=ctx R2=pkt(id=0,off=0,r=14)
         R3=pkt_end R7=inv,min_value=0,max_value=1
         R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
        8: (63) *(u32 *)(r0 -4) = r0
        9: (b7) r0 = 0
        10: (95) exit
      
        from 7 to 9:
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=0,max_value=1
         R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
        9: (b7) r0 = 0
        10: (95) exit
      
        from 6 to 11:
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=2
         R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
        11: (07) r0 += 1
        12: (b7) r8 = 0
        13: (05) goto pc-7                <----- mismatch due to r8
        7: (2d) if r0 > r3 goto pc+1
         R0=pkt(id=0,off=15,r=15) R1=ctx R2=pkt(id=0,off=0,r=15)
         R3=pkt_end R7=inv,min_value=2
         R8=imm0,min_value=0,max_value=0,min_align=2147483648 R10=fp
        8: (63) *(u32 *)(r0 -4) = r0
        misaligned packet access off 2+15+-4 size 4
      
      The reason why we fail to see it in states_equal() is that the
      third test in compare_ptrs_to_packet() ...
      
        if (old->off <= cur->off &&
            old->off >= old->range && cur->off >= cur->range)
                return true;
      
      ... will let the above pass. The situation we run into is that
      old->off <= cur->off (14 <= 15), meaning that prior walked paths
      went with smaller offset, which was later used in the packet
      access after successful packet range check and found to be safe
      already.
      
      For example: Given is R0=pkt(id=0,off=0,r=0). Adding offset 14
      as in above program to it, results in R0=pkt(id=0,off=14,r=0)
      before the packet range test. Now, testing this against R3=pkt_end
      with 'if r0 > r3 goto out' will transform R0 into R0=pkt(id=0,off=14,r=14)
      for the case when we're within bounds. A write into the packet
      at offset *(u32 *)(r0 -4), that is, 2 + 14 -4, is valid and
      aligned (2 is for NET_IP_ALIGN). After processing this with
      all fall-through paths, we later on check paths from branches.
      When the above skb->mark test is true, then we jump near the
      end of the program, perform r0 += 1, and jump back to the
      'if r0 > r3 goto out' test we've visited earlier already. This
      time, R0 is of type R0=pkt(id=0,off=15,r=0), and we'll prune
      that part because this time we'll have a larger safe packet
      range, and we already found that with off=14 all further insn
      were already safe, so it's safe as well with a larger off.
      However, the problem is that the subsequent write into the packet
      with 2 + 15 -4 is then unaligned, and not caught by the alignment
      tracking. Note that min_align, aux_off, and aux_off_align were
      all 0 in this example.
      
      Since we cannot tell at this time what kind of packet access was
      performed in the prior walk and what minimal requirements it has
      (we might do so in the future, but that requires more complexity),
      fix it to disable this pruning case for strict alignment for now,
      and let the verifier do check such paths instead. With that applied,
      the test cases pass and reject the program due to misalignment.
      
      Fixes: d1174416 ("bpf: Track alignment of register values in the verifier.")
      Reference: http://patchwork.ozlabs.org/patch/761909/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ad2f583
  8. 25 5月, 2017 1 次提交
    • T
      cpuset: consider dying css as offline · 41c25707
      Tejun Heo 提交于
      In most cases, a cgroup controller don't care about the liftimes of
      cgroups.  For the controller, a css becomes online when ->css_online()
      is called on it and offline when ->css_offline() is called.
      
      However, cpuset is special in that the user interface it exposes cares
      whether certain cgroups exist or not.  Combined with the RCU delay
      between cgroup removal and css offlining, this can lead to user
      visible behavior oddities where operations which should succeed after
      cgroup removals fail for some time period.  The effects of cgroup
      removals are delayed when seen from userland.
      
      This patch adds css_is_dying() which tests whether offline is pending
      and updates is_cpuset_online() so that the function returns false also
      while offline is pending.  This gets rid of the userland visible
      delays.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NTejun Heo <tj@kernel.org>
      41c25707
  9. 24 5月, 2017 1 次提交
  10. 23 5月, 2017 4 次提交
    • E
      ptrace: Properly initialize ptracer_cred on fork · c70d9d80
      Eric W. Biederman 提交于
      When I introduced ptracer_cred I failed to consider the weirdness of
      fork where the task_struct copies the old value by default.  This
      winds up leaving ptracer_cred set even when a process forks and
      the child process does not wind up being ptraced.
      
      Because ptracer_cred is not set on non-ptraced processes whose
      parents were ptraced this has broken the ability of the enlightenment
      window manager to start setuid children.
      
      Fix this by properly initializing ptracer_cred in ptrace_init_task
      
      This must be done with a little bit of care to preserve the current value
      of ptracer_cred when ptrace carries through fork.  Re-reading the
      ptracer_cred from the ptracing process at this point is inconsistent
      with how PT_PTRACE_CAP has been maintained all of these years.
      Tested-by: NTakashi Iwai <tiwai@suse.de>
      Fixes: 64b875f7 ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      c70d9d80
    • V
      kthread: Fix use-after-free if kthread fork fails · 4d6501dc
      Vegard Nossum 提交于
      If a kthread forks (e.g. usermodehelper since commit 1da5c46f) but
      fails in copy_process() between calling dup_task_struct() and setting
      p->set_child_tid, then the value of p->set_child_tid will be inherited
      from the parent and get prematurely freed by free_kthread_struct().
      
          kthread()
           - worker_thread()
              - process_one_work()
              |  - call_usermodehelper_exec_work()
              |     - kernel_thread()
              |        - _do_fork()
              |           - copy_process()
              |              - dup_task_struct()
              |                 - arch_dup_task_struct()
              |                    - tsk->set_child_tid = current->set_child_tid // implied
              |              - ...
              |              - goto bad_fork_*
              |              - ...
              |              - free_task(tsk)
              |                 - free_kthread_struct(tsk)
              |                    - kfree(tsk->set_child_tid)
              - ...
              - schedule()
                 - __schedule()
                    - wq_worker_sleeping()
                       - kthread_data(task)->flags // UAF
      
      The problem started showing up with commit 1da5c46f since it reused
      ->set_child_tid for the kthread worker data.
      
      A better long-term solution might be to get rid of the ->set_child_tid
      abuse. The comment in set_kthread_struct() also looks slightly wrong.
      Debugged-by: NJamie Iles <jamie.iles@oracle.com>
      Fixes: 1da5c46f ("kthread: Make struct kthread kmalloc'ed")
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jamie Iles <jamie.iles@oracle.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170509073959.17858-1-vegard.nossum@oracle.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      4d6501dc
    • P
      futex,rt_mutex: Fix rt_mutex_cleanup_proxy_lock() · 04dc1b2f
      Peter Zijlstra 提交于
      Markus reported that the glibc/nptl/tst-robustpi8 test was failing after
      commit:
      
        cfafcd11 ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()")
      
      The following trace shows the problem:
      
       ld-linux-x86-64-2161  [019] ....   410.760971: SyS_futex: 00007ffbeb76b028: 80000875  op=FUTEX_LOCK_PI
       ld-linux-x86-64-2161  [019] ...1   410.760972: lock_pi_update_atomic: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000875 ret=0
       ld-linux-x86-64-2165  [011] ....   410.760978: SyS_futex: 00007ffbeb76b028: 80000875  op=FUTEX_UNLOCK_PI
       ld-linux-x86-64-2165  [011] d..1   410.760979: do_futex: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000871 ret=0
       ld-linux-x86-64-2165  [011] ....   410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=0000
       ld-linux-x86-64-2161  [019] ....   410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=ETIMEDOUT
      
      Task 2165 does an UNLOCK_PI, assigning the lock to the waiter task 2161
      which then returns with -ETIMEDOUT. That wrecks the lock state, because now
      the owner isn't aware it acquired the lock and removes the pending robust
      list entry.
      
      If 2161 is killed, the robust list will not clear out this futex and the
      subsequent acquire on this futex will then (correctly) result in -ESRCH
      which is unexpected by glibc, triggers an internal assertion and dies.
      
      Task 2161			Task 2165
      
      rt_mutex_wait_proxy_lock()
         timeout();
         /* T2161 is still queued in  the waiter list */
         return -ETIMEDOUT;
      
      				futex_unlock_pi()
      				spin_lock(hb->lock);
      				rtmutex_unlock()
      				  remove_rtmutex_waiter(T2161);
      				   mark_lock_available();
      				/* Make the next waiter owner of the user space side */
      				futex_uval = 2161;
      				spin_unlock(hb->lock);
      spin_lock(hb->lock);
      rt_mutex_cleanup_proxy_lock()
        if (rtmutex_owner() !== current)
           ...
           return FAIL;
      ....
      return -ETIMEOUT;
      
      This means that rt_mutex_cleanup_proxy_lock() needs to call
      try_to_take_rt_mutex() so it can take over the rtmutex correctly which was
      assigned by the waker. If the rtmutex is owned by some other task then this
      call is harmless and just confirmes that the waiter is not able to acquire
      it.
      
      While there, fix what looks like a merge error which resulted in
      rt_mutex_cleanup_proxy_lock() having two calls to
      fixup_rt_mutex_waiters() and rt_mutex_wait_proxy_lock() not having any.
      Both should have one, since both potentially touch the waiter list.
      
      Fixes: 38d589f2 ("futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()")
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Bug-Spotted-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Darren Hart <dvhart@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
      Link: http://lkml.kernel.org/r/20170519154850.mlomgdsd26drq5j6@hirez.programming.kicks-ass.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      04dc1b2f
    • D
      net: Make IP alignment calulations clearer. · e4eda884
      David S. Miller 提交于
      The assignmnet:
      
      	ip_align = strict ? 2 : NET_IP_ALIGN;
      
      in compare_pkt_ptr_alignment() trips up Coverity because we can only
      get to this code when strict is true, therefore ip_align will always
      be 2 regardless of NET_IP_ALIGN's value.
      
      So just assign directly to '2' and explain the situation in the
      comment above.
      Reported-by: N"Gustavo A. R. Silva" <garsilva@embeddedor.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4eda884
  11. 19 5月, 2017 2 次提交
  12. 18 5月, 2017 8 次提交
    • D
      bpf: adjust verifier heuristics · 3c2ce60b
      Daniel Borkmann 提交于
      Current limits with regards to processing program paths do not
      really reflect today's needs anymore due to programs becoming
      more complex and verifier smarter, keeping track of more data
      such as const ALU operations, alignment tracking, spilling of
      PTR_TO_MAP_VALUE_ADJ registers, and other features allowing for
      smarter matching of what LLVM generates.
      
      This also comes with the side-effect that we result in fewer
      opportunities to prune search states and thus often need to do
      more work to prove safety than in the past due to different
      register states and stack layout where we mismatch. Generally,
      it's quite hard to determine what caused a sudden increase in
      complexity, it could be caused by something as trivial as a
      single branch somewhere at the beginning of the program where
      LLVM assigned a stack slot that is marked differently throughout
      other branches and thus causing a mismatch, where verifier
      then needs to prove safety for the whole rest of the program.
      Subsequently, programs with even less than half the insn size
      limit can get rejected. We noticed that while some programs
      load fine under pre 4.11, they get rejected due to hitting
      limits on more recent kernels. We saw that in the vast majority
      of cases (90+%) pruning failed due to register mismatches. In
      case of stack mismatches, majority of cases failed due to
      different stack slot types (invalid, spill, misc) rather than
      differences in spilled registers.
      
      This patch makes pruning more aggressive by also adding markers
      that sit at conditional jumps as well. Currently, we only mark
      jump targets for pruning. For example in direct packet access,
      these are usually error paths where we bail out. We found that
      adding these markers, it can reduce number of processed insns
      by up to 30%. Another option is to ignore reg->id in probing
      PTR_TO_MAP_VALUE_OR_NULL registers, which can help pruning
      slightly as well by up to 7% observed complexity reduction as
      stand-alone. Meaning, if a previous path with register type
      PTR_TO_MAP_VALUE_OR_NULL for map X was found to be safe, then
      in the current state a PTR_TO_MAP_VALUE_OR_NULL register for
      the same map X must be safe as well. Last but not least the
      patch also adds a scheduling point and bumps the current limit
      for instructions to be processed to a more adequate value.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c2ce60b
    • S
      kprobes: Document how optimized kprobes are removed from module unload · 545a0281
      Steven Rostedt (VMware) 提交于
      Thomas discovered a bug where the kprobe trace tests had a race
      condition where the kprobe_optimizer called from a delayed work queue
      that does the optimizing and "unoptimizing" of a kprobe, can try to
      modify the text after it has been freed by the init code.
      
      The kprobe trace selftest is a special case, and Thomas and myself
      investigated to see if there's a chance that this could also be a bug
      with module unloading, as the code is not obvious to how it handles
      this. After adding lots of printks, I figured it out. Thomas suggested
      that this should be commented so that others will not have to go
      through this exercise again.
      
      Link: http://lkml.kernel.org/r/20170516145835.3827d3aa@gandalf.local.homeAcked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      545a0281
    • S
      ftrace: Remove #ifdef from code and add clear_ftrace_function_probes() stub · 8a49f3e0
      Steven Rostedt (VMware) 提交于
      No need to add ugly #ifdefs in the code. Having a standard stub file is much
      prettier.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      8a49f3e0
    • N
      ftrace/instances: Clear function triggers when removing instances · a0e6369e
      Naveen N. Rao 提交于
      If instance directories are deleted while there are registered function
      triggers:
      
        # cd /sys/kernel/debug/tracing/instances
        # mkdir test
        # echo "schedule:enable_event:sched:sched_switch" > test/set_ftrace_filter
        # rmdir test
        Unable to handle kernel paging request for data at address 0x00000008
        Unable to handle kernel paging request for data at address 0x00000008
        Faulting instruction address: 0xc0000000021edde8
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048
        NUMA
        pSeries
        Modules linked in: iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp tun bridge stp llc kvm iptable_filter fuse binfmt_misc pseries_rng rng_core vmx_crypto ib_iser rdma_cm iw_cm ib_cm ib_core libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c multipath virtio_net virtio_blk virtio_pci crc32c_vpmsum virtio_ring virtio
        CPU: 8 PID: 8694 Comm: rmdir Not tainted 4.11.0-nnr+ #113
        task: c0000000bab52800 task.stack: c0000000baba0000
        NIP: c0000000021edde8 LR: c0000000021f0590 CTR: c000000002119620
        REGS: c0000000baba3870 TRAP: 0300   Not tainted  (4.11.0-nnr+)
        MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE>
          CR: 22002422  XER: 20000000
        CFAR: 00007fffabb725a8 DAR: 0000000000000008 DSISR: 40000000 SOFTE: 0
        GPR00: c00000000220f750 c0000000baba3af0 c000000003157e00 0000000000000000
        GPR04: 0000000000000040 00000000000000eb 0000000000000040 0000000000000000
        GPR08: 0000000000000000 0000000000000113 0000000000000000 c00000000305db98
        GPR12: c000000002119620 c00000000fd42c00 0000000000000000 0000000000000000
        GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        GPR20: 0000000000000000 0000000000000000 c0000000bab52e90 0000000000000000
        GPR24: 0000000000000000 00000000000000eb 0000000000000040 c0000000baba3bb0
        GPR28: c00000009cb06eb0 c0000000bab52800 c00000009cb06eb0 c0000000baba3bb0
        NIP [c0000000021edde8] ring_buffer_lock_reserve+0x8/0x4e0
        LR [c0000000021f0590] trace_event_buffer_lock_reserve+0xe0/0x1a0
        Call Trace:
        [c0000000baba3af0] [c0000000021f96c8] trace_event_buffer_commit+0x1b8/0x280 (unreliable)
        [c0000000baba3b60] [c00000000220f750] trace_event_buffer_reserve+0x80/0xd0
        [c0000000baba3b90] [c0000000021196b8] trace_event_raw_event_sched_switch+0x98/0x180
        [c0000000baba3c10] [c0000000029d9980] __schedule+0x6e0/0xab0
        [c0000000baba3ce0] [c000000002122230] do_task_dead+0x70/0xc0
        [c0000000baba3d10] [c0000000020ea9c8] do_exit+0x828/0xd00
        [c0000000baba3dd0] [c0000000020eaf70] do_group_exit+0x60/0x100
        [c0000000baba3e10] [c0000000020eb034] SyS_exit_group+0x24/0x30
        [c0000000baba3e30] [c00000000200bcec] system_call+0x38/0x54
        Instruction dump:
        60000000 60420000 7d244b78 7f63db78 4bffaa09 393efff8 793e0020 39200000
        4bfffecc 60420000 3c4c00f7 3842a020 <81230008> 2f890000 409e02f0 a14d0008
        ---[ end trace b917b8985d0e650b ]---
        Unable to handle kernel paging request for data at address 0x00000008
        Faulting instruction address: 0xc0000000021edde8
        Unable to handle kernel paging request for data at address 0x00000008
        Faulting instruction address: 0xc0000000021edde8
        Faulting instruction address: 0xc0000000021edde8
      
      To address this, let's clear all registered function probes before
      deleting the ftrace instance.
      
      Link: http://lkml.kernel.org/r/c5f1ca624043690bd94642bb6bffd3f2fc504035.1494956770.git.naveen.n.rao@linux.vnet.ibm.comReported-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      a0e6369e
    • N
    • T
      tracing/kprobes: Enforce kprobes teardown after testing · 30e7d894
      Thomas Gleixner 提交于
      Enabling the tracer selftest triggers occasionally the warning in
      text_poke(), which warns when the to be modified page is not marked
      reserved.
      
      The reason is that the tracer selftest installs kprobes on functions marked
      __init for testing. These probes are removed after the tests, but that
      removal schedules the delayed kprobes_optimizer work, which will do the
      actual text poke. If the work is executed after the init text is freed,
      then the warning triggers. The bug can be reproduced reliably when the work
      delay is increased.
      
      Flush the optimizer work and wait for the optimizing/unoptimizing lists to
      become empty before returning from the kprobes tracer selftest. That
      ensures that all operations which were queued due to the probes removal
      have completed.
      
      Link: http://lkml.kernel.org/r/20170516094802.76a468bb@gandalf.local.homeSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Cc: stable@vger.kernel.org
      Fixes: 6274de49 ("kprobes: Support delayed unoptimizing")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      30e7d894
    • S
      tracing: Move postpone selftests to core from early_initcall · b9ef0326
      Steven Rostedt 提交于
      I hit the following lockdep splat when booting with ftrace selftests
      enabled, as well as CONFIG_PREEMPT and LOCKDEP.
      
       Testing dynamic ftrace ops #1:
       (1 0 1 0 0)
       (1 1 2 0 0)
       (2 1 3 0 169)
       (2 2 4 0 50066)
       ------------[ cut here ]------------
       WARNING: CPU: 0 PID: 13 at kernel/rcu/srcutree.c:202 check_init_srcu_struct+0x60/0x70
       Modules linked in:
       CPU: 0 PID: 13 Comm: rcu_tasks_kthre Not tainted 4.12.0-rc1-test+ #587
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
       task: ffff880119628040 task.stack: ffffc900006a4000
       RIP: 0010:check_init_srcu_struct+0x60/0x70
       RSP: 0000:ffffc900006a7d98 EFLAGS: 00010246
       RAX: 0000000000000246 RBX: 0000000000000000 RCX: 0000000000000000
       RDX: ffff880119628040 RSI: 00000000ffffffff RDI: ffffffff81e5fb40
       RBP: ffffc900006a7e20 R08: 00000023b403c000 R09: 0000000000000001
       R10: ffffc900006a7e40 R11: 0000000000000000 R12: ffffffff81e5fb40
       R13: 0000000000000286 R14: ffff880119628040 R15: ffffc900006a7e98
       FS:  0000000000000000(0000) GS:ffff88011ea00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: ffff88011edff000 CR3: 0000000001e0f000 CR4: 00000000001406f0
       Call Trace:
        ? __synchronize_srcu+0x6e/0x140
        ? lock_acquire+0xdc/0x1d0
        ? ktime_get_mono_fast_ns+0x5d/0xb0
        synchronize_srcu+0x6f/0x110
        ? synchronize_srcu+0x6f/0x110
        rcu_tasks_kthread+0x20a/0x540
        kthread+0x114/0x150
        ? __rcu_read_unlock+0x70/0x70
        ? kthread_create_on_node+0x40/0x40
        ret_from_fork+0x2e/0x40
       Code: f6 83 70 06 00 00 03 49 89 c5 74 0d be 01 00 00 00 48 89 df e8 42 fa ff ff 4c 89 ee 4c 89 e7 e8 b7 42 75 00 5b 41 5c 41 5d 5d c3 <0f> ff eb aa 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
       ---[ end trace 5c3f4206ce50f6ac ]---
      
      What happens is that the selftests include a creating of a dynamically
      allocated ftrace_ops, which requires the use of synchronize_rcu_tasks()
      which uses srcu, and triggers the above warning.
      
      It appears that synchronize_rcu_tasks() is not set up at early_initcall(),
      but it is at core_initcall(). By moving the tests down to that location
      works out properly.
      
      Link: http://lkml.kernel.org/r/20170517111435.7388c033@gandalf.local.homeAcked-by: N"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      b9ef0326
    • W
      cgroup: Prevent kill_css() from being called more than once · 33c35aa4
      Waiman Long 提交于
      The kill_css() function may be called more than once under the condition
      that the css was killed but not physically removed yet followed by the
      removal of the cgroup that is hosting the css. This patch prevents any
      harmm from being done when that happens.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v4.5+
      33c35aa4