1. 17 12月, 2016 1 次提交
    • D
      bpf: fix regression on verifier pruning wrt map lookups · a08dd0da
      Daniel Borkmann 提交于
      Commit 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL
      registers") introduced a regression where existing programs stopped
      loading due to reaching the verifier's maximum complexity limit,
      whereas prior to this commit they were loading just fine; the affected
      program has roughly 2k instructions.
      
      What was found is that state pruning couldn't be performed effectively
      anymore due to mismatches of the verifier's register state, in particular
      in the id tracking. It doesn't mean that 57a09bf0 is incorrect per
      se, but rather that verifier needs to perform a lot more work for the
      same program with regards to involved map lookups.
      
      Since commit 57a09bf0 is only about tracking registers with type
      PTR_TO_MAP_VALUE_OR_NULL, the id is only needed to follow registers
      until they are promoted through pattern matching with a NULL check to
      either PTR_TO_MAP_VALUE or UNKNOWN_VALUE type. After that point, the
      id becomes irrelevant for the transitioned types.
      
      For UNKNOWN_VALUE, id is already reset to 0 via mark_reg_unknown_value(),
      but not so for PTR_TO_MAP_VALUE where id is becoming stale. It's even
      transferred further into other types that don't make use of it. Among
      others, one example is where UNKNOWN_VALUE is set on function call
      return with RET_INTEGER return type.
      
      states_equal() will then fall through the memcmp() on register state;
      note that the second memcmp() uses offsetofend(), so the id is part of
      that since d2a4dd37 ("bpf: fix state equivalence"). But the bisect
      pointed already to 57a09bf0, where we really reach beyond complexity
      limit. What I found was that states_equal() often failed in this
      case due to id mismatches in spilled regs with registers in type
      PTR_TO_MAP_VALUE. Unlike non-spilled regs, spilled regs just perform
      a memcmp() on their reg state and don't have any other optimizations
      in place, therefore also id was relevant in this case for making a
      pruning decision.
      
      We can safely reset id to 0 as well when converting to PTR_TO_MAP_VALUE.
      For the affected program, it resulted in a ~17 fold reduction of
      complexity and let the program load fine again. Selftest suite also
      runs fine. The only other place where env->id_gen is used currently is
      through direct packet access, but for these cases id is long living, thus
      a different scenario.
      
      Also, the current logic in mark_map_regs() is not fully correct when
      marking NULL branch with UNKNOWN_VALUE. We need to cache the destination
      reg's id in any case. Otherwise, once we marked that reg as UNKNOWN_VALUE,
      it's id is reset and any subsequent registers that hold the original id
      and are of type PTR_TO_MAP_VALUE_OR_NULL won't be marked UNKNOWN_VALUE
      anymore, since mark_map_reg() reuses the uncached regs[regno].id that
      was just overridden. Note, we don't need to cache it outside of
      mark_map_regs(), since it's called once on this_branch and the other
      time on other_branch, which are both two independent verifier states.
      A test case for this is added here, too.
      
      Fixes: 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a08dd0da
  2. 09 12月, 2016 2 次提交
  3. 08 12月, 2016 3 次提交
  4. 06 12月, 2016 4 次提交
    • D
      lockdep: Fix report formatting · f943fe0f
      Dmitry Vyukov 提交于
      Since commit:
      
        4bcc595c ("printk: reinstate KERN_CONT for printing continuation lines")
      
      printk() requires KERN_CONT to continue log messages. Lots of printk()
      in lockdep.c and print_ip_sym() don't have it. As the result lockdep
      reports are completely messed up.
      
      Add missing KERN_CONT and inline print_ip_sym() where necessary.
      
      Example of a messed up report:
      
        0-rc5+ #41 Not tainted
        -------------------------------------------------------
        syz-executor0/5036 is trying to acquire lock:
         (
        rtnl_mutex
        ){+.+.+.}
        , at:
        [<ffffffff86b3d6ac>] rtnl_lock+0x1c/0x20
        but task is already holding lock:
         (
        &net->packet.sklist_lock
        ){+.+...}
        , at:
        [<ffffffff873541a6>] packet_diag_dump+0x1a6/0x1920
        which lock already depends on the new lock.
        the existing dependency chain (in reverse order) is:
        -> #3
         (
        &net->packet.sklist_lock
        +.+...}
        ...
      
      Without this patch all scripts that parse kernel bug reports are broken.
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: andreyknvl@google.com
      Cc: aryabinin@virtuozzo.com
      Cc: joe@perches.com
      Cc: syzkaller@googlegroups.com
      Link: http://lkml.kernel.org/r/1480343083-48731-1-git-send-email-dvyukov@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f943fe0f
    • D
      perf/core: Remove invalid warning from list_update_cgroup_even()t · 8fc31ce8
      David Carrillo-Cisneros 提交于
      The warning introduced in commit:
      
        864c2357 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups")
      
      assumed that a cgroup switch always precedes list_del_event. This is
      not the case. Remove warning.
      
      Make sure that cpuctx->cgrp is NULL until a cgroup event is sched in
      or ctx->nr_cgroups == 0.
      Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Nilay Vaish <nilayvaish@gmail.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi V Shankar <ravi.v.shankar@intel.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1480841177-27299-1-git-send-email-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8fc31ce8
    • D
      bpf: add prog_digest and expose it via fdinfo/netlink · 7bd509e3
      Daniel Borkmann 提交于
      When loading a BPF program via bpf(2), calculate the digest over
      the program's instruction stream and store it in struct bpf_prog's
      digest member. This is done at a point in time before any instructions
      are rewritten by the verifier. Any unstable map file descriptor
      number part of the imm field will be zeroed for the hash.
      
      fdinfo example output for progs:
      
        # cat /proc/1590/fdinfo/5
        pos:          0
        flags:        02000002
        mnt_id:       11
        prog_type:    1
        prog_jited:   1
        prog_digest:  b27e8b06da22707513aa97363dfb11c7c3675d28
        memlock:      4096
      
      When programs are pinned and retrieved by an ELF loader, the loader
      can check the program's digest through fdinfo and compare it against
      one that was generated over the ELF file's program section to see
      if the program needs to be reloaded. Furthermore, this can also be
      exposed through other means such as netlink in case of a tc cls/act
      dump (or xdp in future), but also through tracepoints or other
      facilities to identify the program. Other than that, the digest can
      also serve as a base name for the work in progress kallsyms support
      of programs. The digest doesn't depend/select the crypto layer, since
      we need to keep dependencies to a minimum. iproute2 will get support
      for this facility.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bd509e3
    • G
      bpf: Preserve const register type on const OR alu ops · 3c839744
      Gianluca Borello 提交于
      Occasionally, clang (e.g. version 3.8.1) translates a sum between two
      constant operands using a BPF_OR instead of a BPF_ADD. The verifier is
      currently not handling this scenario, and the destination register type
      becomes UNKNOWN_VALUE even if it's still storing a constant. As a result,
      the destination register cannot be used as argument to a helper function
      expecting a ARG_CONST_STACK_*, limiting some use cases.
      
      Modify the verifier to handle this case, and add a few tests to make sure
      all combinations are supported, and stack boundaries are still verified
      even with BPF_OR.
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c839744
  5. 03 12月, 2016 2 次提交
  6. 02 12月, 2016 4 次提交
    • T
      bpf: BPF for lightweight tunnel infrastructure · 3a0af8fd
      Thomas Graf 提交于
      Registers new BPF program types which correspond to the LWT hooks:
        - BPF_PROG_TYPE_LWT_IN   => dst_input()
        - BPF_PROG_TYPE_LWT_OUT  => dst_output()
        - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
      
      The separate program types are required to differentiate between the
      capabilities each LWT hook allows:
      
       * Programs attached to dst_input() or dst_output() are restricted and
         may only read the data of an skb. This prevent modification and
         possible invalidation of already validated packet headers on receive
         and the construction of illegal headers while the IP headers are
         still being assembled.
      
       * Programs attached to lwtunnel_xmit() are allowed to modify packet
         content as well as prepending an L2 header via a newly introduced
         helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is
         invoked after the IP header has been assembled completely.
      
      All BPF programs receive an skb with L3 headers attached and may return
      one of the following error codes:
      
       BPF_OK - Continue routing as per nexthop
       BPF_DROP - Drop skb and return EPERM
       BPF_REDIRECT - Redirect skb to device as per redirect() helper.
                      (Only valid in lwtunnel_xmit() context)
      
      The return codes are binary compatible with their TC_ACT_
      relatives to ease compatibility.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a0af8fd
    • T
      locking/rtmutex: Use READ_ONCE() in rt_mutex_owner() · 1be5d4fa
      Thomas Gleixner 提交于
      While debugging the rtmutex unlock vs. dequeue race Will suggested to use
      READ_ONCE() in rt_mutex_owner() as it might race against the
      cmpxchg_release() in unlock_rt_mutex_safe().
      
      Will: "It's a minor thing which will most likely not matter in practice"
      
      Careful search did not unearth an actual problem in todays code, but it's
      better to be safe than surprised.
      Suggested-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: David Daney <ddaney@caviumnetworks.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20161130210030.431379999@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1be5d4fa
    • T
      locking/rtmutex: Prevent dequeue vs. unlock race · dbb26055
      Thomas Gleixner 提交于
      David reported a futex/rtmutex state corruption. It's caused by the
      following problem:
      
      CPU0		CPU1		CPU2
      
      l->owner=T1
      		rt_mutex_lock(l)
      		lock(l->wait_lock)
      		l->owner = T1 | HAS_WAITERS;
      		enqueue(T2)
      		boost()
      		  unlock(l->wait_lock)
      		schedule()
      
      				rt_mutex_lock(l)
      				lock(l->wait_lock)
      				l->owner = T1 | HAS_WAITERS;
      				enqueue(T3)
      				boost()
      				  unlock(l->wait_lock)
      				schedule()
      		signal(->T2)	signal(->T3)
      		lock(l->wait_lock)
      		dequeue(T2)
      		deboost()
      		  unlock(l->wait_lock)
      				lock(l->wait_lock)
      				dequeue(T3)
      				  ===> wait list is now empty
      				deboost()
      				 unlock(l->wait_lock)
      		lock(l->wait_lock)
      		fixup_rt_mutex_waiters()
      		  if (wait_list_empty(l)) {
      		    owner = l->owner & ~HAS_WAITERS;
      		    l->owner = owner
      		     ==> l->owner = T1
      		  }
      
      				lock(l->wait_lock)
      rt_mutex_unlock(l)		fixup_rt_mutex_waiters()
      				  if (wait_list_empty(l)) {
      				    owner = l->owner & ~HAS_WAITERS;
      cmpxchg(l->owner, T1, NULL)
       ===> Success (l->owner = NULL)
      				    l->owner = owner
      				     ==> l->owner = T1
      				  }
      
      That means the problem is caused by fixup_rt_mutex_waiters() which does the
      RMW to clear the waiters bit unconditionally when there are no waiters in
      the rtmutexes rbtree.
      
      This can be fatal: A concurrent unlock can release the rtmutex in the
      fastpath because the waiters bit is not set. If the cmpxchg() gets in the
      middle of the RMW operation then the previous owner, which just unlocked
      the rtmutex is set as the owner again when the write takes place after the
      successfull cmpxchg().
      
      The solution is rather trivial: verify that the owner member of the rtmutex
      has the waiters bit set before clearing it. This does not require a
      cmpxchg() or other atomic operations because the waiters bit can only be
      set and cleared with the rtmutex wait_lock held. It's also safe against the
      fast path unlock attempt. The unlock attempt via cmpxchg() will either see
      the bit set and take the slowpath or see the bit cleared and release it
      atomically in the fastpath.
      
      It's remarkable that the test program provided by David triggers on ARM64
      and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
      problem exists there as well. That refusal might explain that this got not
      discovered earlier despite the bug existing from day one of the rtmutex
      implementation more than 10 years ago.
      
      Thanks to David for meticulously instrumenting the code and providing the
      information which allowed to decode this subtle problem.
      Reported-by: NDavid Daney <ddaney@caviumnetworks.com>
      Tested-by: NDavid Daney <david.daney@cavium.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 23f78d4a ("[PATCH] pi-futex: rt mutex core")
      Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dbb26055
    • W
      audit: remove useless synchronize_net() · 60602982
      WANG Cong 提交于
      netlink kernel socket is protected by refcount, not RCU.
      Its rcv path is neither protected by RCU. So the synchronize_net()
      is just pointless.
      
      Cc: Richard Guy Briggs <rgb@redhat.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60602982
  7. 01 12月, 2016 1 次提交
  8. 30 11月, 2016 2 次提交
  9. 28 11月, 2016 3 次提交
  10. 26 11月, 2016 2 次提交
    • D
      bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands · f4324551
      Daniel Mack 提交于
      Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
      BPF_PROG_DETACH which allow attaching and detaching eBPF programs
      to a target.
      
      On the API level, the target could be anything that has an fd in
      userspace, hence the name of the field in union bpf_attr is called
      'target_fd'.
      
      When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
      expected to be a valid file descriptor of a cgroup v2 directory which
      has the bpf controller enabled. These are the only use-cases
      implemented by this patch at this point, but more can be added.
      
      If a program of the given type already exists in the given cgroup,
      the program is swapped automically, so userspace does not have to drop
      an existing program first before installing a new one, which would
      otherwise leave a gap in which no program is attached.
      
      For more information on the propagation logic to subcgroups, please
      refer to the bpf cgroup controller implementation.
      
      The API is guarded by CAP_NET_ADMIN.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4324551
    • D
      cgroup: add support for eBPF programs · 30070984
      Daniel Mack 提交于
      This patch adds two sets of eBPF program pointers to struct cgroup.
      One for such that are directly pinned to a cgroup, and one for such
      that are effective for it.
      
      To illustrate the logic behind that, assume the following example
      cgroup hierarchy.
      
        A - B - C
              \ D - E
      
      If only B has a program attached, it will be effective for B, C, D
      and E. If D then attaches a program itself, that will be effective for
      both D and E, and the program in B will only affect B and C. Only one
      program of a given type is effective for a cgroup.
      
      Attaching and detaching programs will be done through the bpf(2)
      syscall. For now, ingress and egress inet socket filtering are the
      only supported use-cases.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30070984
  11. 24 11月, 2016 1 次提交
    • M
      sched/autogroup: Fix 64-bit kernel nice level adjustment · 83929cce
      Mike Galbraith 提交于
      Michael Kerrisk reported:
      
      > Regarding the previous paragraph...  My tests indicate
      > that writing *any* value to the autogroup [nice priority level]
      > file causes the task group to get a lower priority.
      
      Because autogroup didn't call the then meaningless scale_load()...
      
      Autogroup nice level adjustment has been broken ever since load
      resolution was increased for 64-bit kernels.  Use scale_load() to
      scale group weight.
      
      Michael Kerrisk tested this patch to fix the problem:
      
      > Applied and tested against 4.9-rc6 on an Intel u7 (4 cores).
      > Test setup:
      >
      > Terminal window 1: running 40 CPU burner jobs
      > Terminal window 2: running 40 CPU burner jobs
      > Terminal window 1: running  1 CPU burner job
      >
      > Demonstrated that:
      > * Writing "0" to the autogroup file for TW1 now causes no change
      >   to the rate at which the process on the terminal consume CPU.
      > * Writing -20 to the autogroup file for TW1 caused those processes
      >   to get the lion's share of CPU while TW2 TW3 get a tiny amount.
      > * Writing -20 to the autogroup files for TW1 and TW3 allowed the
      >   process on TW3 to get as much CPU as it was getting as when
      >   the autogroup nice values for both terminals were 0.
      Reported-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Tested-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-man <linux-man@vger.kernel.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1479897217.4306.6.camel@gmx.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      83929cce
  12. 22 11月, 2016 3 次提交
  13. 21 11月, 2016 1 次提交
  14. 19 11月, 2016 1 次提交
  15. 18 11月, 2016 1 次提交
    • A
      netns: make struct pernet_operations::id unsigned int · c7d03a00
      Alexey Dobriyan 提交于
      Make struct pernet_operations::id unsigned.
      
      There are 2 reasons to do so:
      
      1)
      This field is really an index into an zero based array and
      thus is unsigned entity. Using negative value is out-of-bound
      access by definition.
      
      2)
      On x86_64 unsigned 32-bit data which are mixed with pointers
      via array indexing or offsets added or subtracted to pointers
      are preffered to signed 32-bit data.
      
      "int" being used as an array index needs to be sign-extended
      to 64-bit before being used.
      
      	void f(long *p, int i)
      	{
      		g(p[i]);
      	}
      
        roughly translates to
      
      	movsx	rsi, esi
      	mov	rdi, [rsi+...]
      	call 	g
      
      MOVSX is 3 byte instruction which isn't necessary if the variable is
      unsigned because x86_64 is zero extending by default.
      
      Now, there is net_generic() function which, you guessed it right, uses
      "int" as an array index:
      
      	static inline void *net_generic(const struct net *net, int id)
      	{
      		...
      		ptr = ng->ptr[id - 1];
      		...
      	}
      
      And this function is used a lot, so those sign extensions add up.
      
      Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
      messing with code generation):
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      
      Unfortunately some functions actually grow bigger.
      This is a semmingly random artefact of code generation with register
      allocator being used differently. gcc decides that some variable
      needs to live in new r8+ registers and every access now requires REX
      prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
      used which is longer than [r8]
      
      However, overall balance is in negative direction:
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      	function                                     old     new   delta
      	nfsd4_lock                                  3886    3959     +73
      	tipc_link_build_proto_msg                   1096    1140     +44
      	mac80211_hwsim_new_radio                    2776    2808     +32
      	tipc_mon_rcv                                1032    1058     +26
      	svcauth_gss_legacy_init                     1413    1429     +16
      	tipc_bcbase_select_primary                   379     392     +13
      	nfsd4_exchange_id                           1247    1260     +13
      	nfsd4_setclientid_confirm                    782     793     +11
      		...
      	put_client_renew_locked                      494     480     -14
      	ip_set_sockfn_get                            730     716     -14
      	geneve_sock_add                              829     813     -16
      	nfsd4_sequence_done                          721     703     -18
      	nlmclnt_lookup_host                          708     686     -22
      	nfsd4_lockt                                 1085    1063     -22
      	nfs_get_client                              1077    1050     -27
      	tcf_bpf_init                                1106    1076     -30
      	nfsd4_encode_fattr                          5997    5930     -67
      	Total: Before=154856051, After=154854321, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d03a00
  16. 17 11月, 2016 2 次提交
  17. 16 11月, 2016 5 次提交
    • M
      bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH · 8f844938
      Martin KaFai Lau 提交于
      Provide a LRU version of the existing BPF_MAP_TYPE_PERCPU_HASH
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f844938
    • M
      bpf: Add BPF_MAP_TYPE_LRU_HASH · 29ba732a
      Martin KaFai Lau 提交于
      Provide a LRU version of the existing BPF_MAP_TYPE_HASH.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29ba732a
    • M
      bpf: Refactor codes handling percpu map · fd91de7b
      Martin KaFai Lau 提交于
      Refactor the codes that populate the value
      of a htab_elem in a BPF_MAP_TYPE_PERCPU_HASH
      typed bpf_map.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd91de7b
    • M
      bpf: Add percpu LRU list · 961578b6
      Martin KaFai Lau 提交于
      Instead of having a common LRU list, this patch allows a
      percpu LRU list which can be selected by specifying a map
      attribute.  The map attribute will be added in the later
      patch.
      
      While the common use case for LRU is #reads >> #updates,
      percpu LRU list allows bpf prog to absorb unusual #updates
      under pathological case (e.g. external traffic facing machine which
      could be under attack).
      
      Each percpu LRU is isolated from each other.  The LRU nodes (including
      free nodes) cannot be moved across different LRU Lists.
      
      Here are the update performance comparison between
      common LRU list and percpu LRU list (the test code is
      at the last patch):
      
      [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
      ./map_perf_test 16 $i | awk '{r += $3}END{print r " updates"}'; done
       1 cpus: 2934082 updates
       4 cpus: 7391434 updates
       8 cpus: 6500576 updates
      
      [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
      ./map_perf_test 32 $i | awk '{r += $3}END{printr " updates"}'; done
        1 cpus: 2896553 updates
        4 cpus: 9766395 updates
        8 cpus: 17460553 updates
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      961578b6
    • M
      bpf: LRU List · 3a08c2fd
      Martin KaFai Lau 提交于
      Introduce bpf_lru_list which will provide LRU capability to
      the bpf_htab in the later patch.
      
      * General Thoughts:
      1. Target use case.  Read is more often than update.
         (i.e. bpf_lookup_elem() is more often than bpf_update_elem()).
         If bpf_prog does a bpf_lookup_elem() first and then an in-place
         update, it still counts as a read operation to the LRU list concern.
      2. It may be useful to think of it as a LRU cache
      3. Optimize the read case
         3.1 No lock in read case
         3.2 The LRU maintenance is only done during bpf_update_elem()
      4. If there is a percpu LRU list, it will lose the system-wise LRU
         property.  A completely isolated percpu LRU list has the best
         performance but the memory utilization is not ideal considering
         the work load may be imbalance.
      5. Hence, this patch starts the LRU implementation with a global LRU
         list with batched operations before accessing the global LRU list.
         As a LRU cache, #read >> #update/#insert operations, it will work well.
      6. There is a local list (for each cpu) which is named
         'struct bpf_lru_locallist'.  This local list is not used to sort
         the LRU property.  Instead, the local list is to batch enough
         operations before acquiring the lock of the global LRU list.  More
         details on this later.
      7. In the later patch, it allows a percpu LRU list by specifying a
         map-attribute for scalability reason and for use cases that need to
         prepare for the worst (and pathological) case like DoS attack.
         The percpu LRU list is completely isolated from each other and the
         LRU nodes (including free nodes) cannot be moved across the list.  The
         following description is for the global LRU list but mostly applicable
         to the percpu LRU list also.
      
      * Global LRU List:
      1. It has three sub-lists: active-list, inactive-list and free-list.
      2. The two list idea, active and inactive, is borrowed from the
         page cache.
      3. All nodes are pre-allocated and all sit at the free-list (of the
         global LRU list) at the beginning.  The pre-allocation reasoning
         is similar to the existing BPF_MAP_TYPE_HASH.  However,
         opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in
         the LRU map.
      
      * Active/Inactive List (of the global LRU list):
      1. The active list, as its name says it, maintains the active set of
         the nodes.  We can think of it as the working set or more frequently
         accessed nodes.  The access frequency is approximated by a ref-bit.
         The ref-bit is set during the bpf_lookup_elem().
      2. The inactive list, as its name also says it, maintains a less
         active set of nodes.  They are the candidates to be removed
         from the bpf_htab when we are running out of free nodes.
      3. The ordering of these two lists is acting as a rough clock.
         The tail of the inactive list is the older nodes and
         should be released first if the bpf_htab needs free element.
      
      * Rotating the Active/Inactive List (of the global LRU list):
      1. It is the basic operation to maintain the LRU property of
         the global list.
      2. The active list is only rotated when the inactive list is running
         low.  This idea is similar to the current page cache.
         Inactive running low is currently defined as
         "# of inactive < # of active".
      3. The active list rotation always starts from the tail.  It moves
         node without ref-bit set to the head of the inactive list.
         It moves node with ref-bit set back to the head of the active
         list and then clears its ref-bit.
      4. The inactive rotation is pretty simply.
         It walks the inactive list and moves the nodes back to the head of
         active list if its ref-bit is set. The ref-bit is cleared after moving
         to the active list.
         If the node does not have ref-bit set, it just leave it as it is
         because it is already in the inactive list.
      
      * Shrinking the Inactive List (of the global LRU list):
      1. Shrinking is the operation to get free nodes when the bpf_htab is
         full.
      2. It usually only shrinks the inactive list to get free nodes.
      3. During shrinking, it will walk the inactive list from the tail,
         delete the nodes without ref-bit set from bpf_htab.
      4. If no free node found after step (3), it will forcefully get
         one node from the tail of inactive or active list.  Forcefully is
         in the sense that it ignores the ref-bit.
      
      * Local List:
      1. Each CPU has a 'struct bpf_lru_locallist'.  The purpose is to
         batch enough operations before acquiring the lock of the
         global LRU.
      2. A local list has two sub-lists, free-list and pending-list.
      3. During bpf_update_elem(), it will try to get from the free-list
         of (the current CPU local list).
      4. If the local free-list is empty, it will acquire from the
         global LRU list.  The global LRU list can either satisfy it
         by its global free-list or by shrinking the global inactive
         list.  Since we have acquired the global LRU list lock,
         it will try to get at most LOCAL_FREE_TARGET elements
         to the local free list.
      5. When a new element is added to the bpf_htab, it will
         first sit at the pending-list (of the local list) first.
         The pending-list will be flushed to the global LRU list
         when it needs to acquire free nodes from the global list
         next time.
      
      * Lock Consideration:
      The LRU list has a lock (lru_lock).  Each bucket of htab has a
      lock (buck_lock).  If both locks need to be acquired together,
      the lock order is always lru_lock -> buck_lock and this only
      happens in the bpf_lru_list.c logic.
      
      In hashtab.c, both locks are not acquired together (i.e. one
      lock is always released first before acquiring another lock).
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a08c2fd
  18. 15 11月, 2016 2 次提交
    • D
      perf/core: Do not set cpuctx->cgrp for unscheduled cgroups · 864c2357
      David Carrillo-Cisneros 提交于
      Commit:
      
        db4a8356 ("perf/core: Set cgroup in CPU contexts for new cgroup events")
      
      failed to verify that event->cgrp is actually the scheduled cgroup
      in a CPU before setting cpuctx->cgrp. This patch fixes that.
      
      Now that there is a different path for scheduled and unscheduled
      cgroup, add a warning to catch when cpuctx->cgrp is still set after
      the last cgroup event has been unsheduled.
      
      To verify the bug:
      
        # Create 2 cgroups.
        mkdir /dev/cgroups/devices/g1
        mkdir /dev/cgroups/devices/g2
      
        # launch a task, bind it to a cpu and move it to g1
        CPU=2
        while :; do : ; done &
        P=$!
      
        taskset -pc $CPU $P
        echo $P > /dev/cgroups/devices/g1/tasks
      
        # monitor g2 (it runs no tasks) and observe output
        perf stat -e cycles -I 1000 -C $CPU -G g2
      
        #           time             counts unit events
           1.000091408          7,579,527      cycles                    g2
           2.000350111      <not counted>      cycles                    g2
           3.000589181      <not counted>      cycles                    g2
           4.000771428      <not counted>      cycles                    g2
      
        # note first line that displays that a task run in g2, despite
        # g2 having no tasks. This is because cpuctx->cgrp was wrongly
        # set when context of new event was installed.
        # After applying the fix we obtain the right output:
      
        perf stat -e cycles -I 1000 -C $CPU -G g2
        #           time             counts unit events
           1.000119615      <not counted>      cycles                    g2
           2.000389430      <not counted>      cycles                    g2
           3.000590962      <not counted>      cycles                    g2
      Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nilay Vaish <nilayvaish@gmail.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Link: http://lkml.kernel.org/r/1478026378-86083-1-git-send-email-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      864c2357
    • S
      ftrace: Add more checks for FTRACE_FL_DISABLED in processing ip records · 546fece4
      Steven Rostedt (Red Hat) 提交于
      When a module is first loaded and its function ip records are added to the
      ftrace list of functions to modify, they are set to DISABLED, as their text
      is still in a read only state. When the module is fully loaded, and can be
      updated, the flag is cleared, and if their's any functions that should be
      tracing them, it is updated at that moment.
      
      But there's several locations that do record accounting and should ignore
      records that are marked as disabled, or they can cause issues.
      
      Alexei already fixed one location, but others need to be addressed.
      
      Cc: stable@vger.kernel.org
      Fixes: b7ffffbb "ftrace: Add infrastructure for delayed enabling of module functions"
      Reported-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      546fece4