1. 11 6月, 2017 3 次提交
  2. 07 6月, 2017 7 次提交
  3. 05 6月, 2017 1 次提交
  4. 03 6月, 2017 2 次提交
  5. 01 6月, 2017 5 次提交
    • A
      bpf: use different interpreter depending on required stack size · b870aa90
      Alexei Starovoitov 提交于
      16 __bpf_prog_run() interpreters for various stack sizes add .text
      but not a lot comparing to run-time stack savings
      
         text	   data	    bss	    dec	    hex	filename
        26350   10328     624   37302    91b6 kernel/bpf/core.o.before_split
        25777   10328     624   36729    8f79 kernel/bpf/core.o.after_split
        26970	  10328	    624	  37922	   9422	kernel/bpf/core.o.now
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b870aa90
    • A
      bpf: reconcile bpf_tail_call and stack_depth · 80a58d02
      Alexei Starovoitov 提交于
      The next set of patches will take advantage of stack_depth tracking,
      so make sure that the program that does bpf_tail_call() has
      stack depth large enough for the callee.
      We could have tracked the stack depth of the prog_array owner program
      and only allow insertion of the programs with stack depth less
      than the owner, but it will break existing applications.
      Some of them have trivial root bpf program that only does
      multiple bpf_tail_calls and at init time the prog array is empty.
      In the future we may add a flag to do such tracking optionally,
      but for now play simple and safe.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80a58d02
    • A
      bpf: teach verifier to track stack depth · 8726679a
      Alexei Starovoitov 提交于
      teach verifier to track bpf program stack depth
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8726679a
    • A
      bpf: split bpf core interpreter · f696b8f4
      Alexei Starovoitov 提交于
      split __bpf_prog_run() interpreter into stack allocation and execution parts.
      The code section shrinks which helps interpreter performance in some cases.
         text	   data	    bss	    dec	    hex	filename
        26350	  10328	    624	  37302	   91b6	kernel/bpf/core.o.before
        25777	  10328	    624	  36729	   8f79	kernel/bpf/core.o.after
      
      Very short programs got slower (due to extra function call):
      Before:
      test_bpf: #89 ALU64_ADD_K: 1 + 2 = 3 jited:0 7 PASS
      test_bpf: #90 ALU64_ADD_K: 3 + 0 = 3 jited:0 8 PASS
      test_bpf: #91 ALU64_ADD_K: 1 + 2147483646 = 2147483647 jited:0 7 PASS
      test_bpf: #92 ALU64_ADD_K: 4294967294 + 2 = 4294967296 jited:0 11 PASS
      test_bpf: #93 ALU64_ADD_K: 2147483646 + -2147483647 = -1 jited:0 7 PASS
      After:
      test_bpf: #89 ALU64_ADD_K: 1 + 2 = 3 jited:0 11 PASS
      test_bpf: #90 ALU64_ADD_K: 3 + 0 = 3 jited:0 11 PASS
      test_bpf: #91 ALU64_ADD_K: 1 + 2147483646 = 2147483647 jited:0 11 PASS
      test_bpf: #92 ALU64_ADD_K: 4294967294 + 2 = 4294967296 jited:0 14 PASS
      test_bpf: #93 ALU64_ADD_K: 2147483646 + -2147483647 = -1 jited:0 10 PASS
      
      Longer programs got faster:
      Before:
      test_bpf: #266 BPF_MAXINSNS: Ctx heavy transformations jited:0 20286 20513 PASS
      test_bpf: #267 BPF_MAXINSNS: Call heavy transformations jited:0 31853 31768 PASS
      test_bpf: #268 BPF_MAXINSNS: Jump heavy test jited:0 9815 PASS
      test_bpf: #269 BPF_MAXINSNS: Very long jump backwards jited:0 6 PASS
      test_bpf: #270 BPF_MAXINSNS: Edge hopping nuthouse jited:0 13959 PASS
      test_bpf: #271 BPF_MAXINSNS: Jump, gap, jump, ... jited:0 210 PASS
      test_bpf: #272 BPF_MAXINSNS: ld_abs+get_processor_id jited:0 21724 PASS
      test_bpf: #273 BPF_MAXINSNS: ld_abs+vlan_push/pop jited:0 19118 PASS
      After:
      test_bpf: #266 BPF_MAXINSNS: Ctx heavy transformations jited:0 19008 18827 PASS
      test_bpf: #267 BPF_MAXINSNS: Call heavy transformations jited:0 29238 28450 PASS
      test_bpf: #268 BPF_MAXINSNS: Jump heavy test jited:0 9485 PASS
      test_bpf: #269 BPF_MAXINSNS: Very long jump backwards jited:0 12 PASS
      test_bpf: #270 BPF_MAXINSNS: Edge hopping nuthouse jited:0 13257 PASS
      test_bpf: #271 BPF_MAXINSNS: Jump, gap, jump, ... jited:0 213 PASS
      test_bpf: #272 BPF_MAXINSNS: ld_abs+get_processor_id jited:0 19389 PASS
      test_bpf: #273 BPF_MAXINSNS: ld_abs+vlan_push/pop jited:0 19583 PASS
      
      For real world production programs the difference is noise.
      
      This patch is first step towards reducing interpreter stack consumption.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f696b8f4
    • A
      bpf: free up BPF_JMP | BPF_CALL | BPF_X opcode · 71189fa9
      Alexei Starovoitov 提交于
      free up BPF_JMP | BPF_CALL | BPF_X opcode to be used by actual
      indirect call by register and use kernel internal opcode to
      mark call instruction into bpf_tail_call() helper.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71189fa9
  6. 26 5月, 2017 3 次提交
    • D
      bpf: fix wrong exposure of map_flags into fdinfo for lpm · a316338c
      Daniel Borkmann 提交于
      trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via
      attr->map_flags, since it does not support preallocation yet. We
      check the flag, but we never copy the flag into trie->map.map_flags,
      which is later on exposed into fdinfo and used by loaders such as
      iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test
      whether a pinned map has the same spec as the one from the BPF obj
      file and if not, bails out, which is currently the case for lpm
      since it exposes always 0 as flags.
      
      Also copy over flags in array_map_alloc() and stack_map_alloc().
      They always have to be 0 right now, but we should make sure to not
      miss to copy them over at a later point in time when we add actual
      flags for them to use.
      
      Fixes: b95a5c4d ("bpf: add a longest prefix match trie map implementation")
      Reported-by: NJarno Rajahalme <jarno@covalent.io>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a316338c
    • D
      bpf: properly reset caller saved regs after helper call and ld_abs/ind · a9789ef9
      Daniel Borkmann 提交于
      Currently, after performing helper calls, we clear all caller saved
      registers, that is r0 - r5 and fill r0 depending on struct bpf_func_proto
      specification. The way we reset these regs can affect pruning decisions
      in later paths, since we only reset register's imm to 0 and type to
      NOT_INIT. However, we leave out clearing of other variables such as id,
      min_value, max_value, etc, which can later on lead to pruning mismatches
      due to stale data.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9789ef9
    • D
      bpf: fix incorrect pruning decision when alignment must be tracked · 1ad2f583
      Daniel Borkmann 提交于
      Currently, when we enforce alignment tracking on direct packet access,
      the verifier lets the following program pass despite doing a packet
      write with unaligned access:
      
        0: (61) r2 = *(u32 *)(r1 +76)
        1: (61) r3 = *(u32 *)(r1 +80)
        2: (61) r7 = *(u32 *)(r1 +8)
        3: (bf) r0 = r2
        4: (07) r0 += 14
        5: (25) if r7 > 0x1 goto pc+4
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp
        6: (2d) if r0 > r3 goto pc+1
         R0=pkt(id=0,off=14,r=14) R1=ctx R2=pkt(id=0,off=0,r=14)
         R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp
        7: (63) *(u32 *)(r0 -4) = r0
        8: (b7) r0 = 0
        9: (95) exit
      
        from 6 to 8:
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp
        8: (b7) r0 = 0
        9: (95) exit
      
        from 5 to 10:
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=2 R10=fp
        10: (07) r0 += 1
        11: (05) goto pc-6
        6: safe                           <----- here, wrongly found safe
        processed 15 insns
      
      However, if we enforce a pruning mismatch by adding state into r8
      which is then being mismatched in states_equal(), we find that for
      the otherwise same program, the verifier detects a misaligned packet
      access when actually walking that path:
      
        0: (61) r2 = *(u32 *)(r1 +76)
        1: (61) r3 = *(u32 *)(r1 +80)
        2: (61) r7 = *(u32 *)(r1 +8)
        3: (b7) r8 = 1
        4: (bf) r0 = r2
        5: (07) r0 += 14
        6: (25) if r7 > 0x1 goto pc+4
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=0,max_value=1
         R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
        7: (2d) if r0 > r3 goto pc+1
         R0=pkt(id=0,off=14,r=14) R1=ctx R2=pkt(id=0,off=0,r=14)
         R3=pkt_end R7=inv,min_value=0,max_value=1
         R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
        8: (63) *(u32 *)(r0 -4) = r0
        9: (b7) r0 = 0
        10: (95) exit
      
        from 7 to 9:
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=0,max_value=1
         R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
        9: (b7) r0 = 0
        10: (95) exit
      
        from 6 to 11:
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=2
         R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
        11: (07) r0 += 1
        12: (b7) r8 = 0
        13: (05) goto pc-7                <----- mismatch due to r8
        7: (2d) if r0 > r3 goto pc+1
         R0=pkt(id=0,off=15,r=15) R1=ctx R2=pkt(id=0,off=0,r=15)
         R3=pkt_end R7=inv,min_value=2
         R8=imm0,min_value=0,max_value=0,min_align=2147483648 R10=fp
        8: (63) *(u32 *)(r0 -4) = r0
        misaligned packet access off 2+15+-4 size 4
      
      The reason why we fail to see it in states_equal() is that the
      third test in compare_ptrs_to_packet() ...
      
        if (old->off <= cur->off &&
            old->off >= old->range && cur->off >= cur->range)
                return true;
      
      ... will let the above pass. The situation we run into is that
      old->off <= cur->off (14 <= 15), meaning that prior walked paths
      went with smaller offset, which was later used in the packet
      access after successful packet range check and found to be safe
      already.
      
      For example: Given is R0=pkt(id=0,off=0,r=0). Adding offset 14
      as in above program to it, results in R0=pkt(id=0,off=14,r=0)
      before the packet range test. Now, testing this against R3=pkt_end
      with 'if r0 > r3 goto out' will transform R0 into R0=pkt(id=0,off=14,r=14)
      for the case when we're within bounds. A write into the packet
      at offset *(u32 *)(r0 -4), that is, 2 + 14 -4, is valid and
      aligned (2 is for NET_IP_ALIGN). After processing this with
      all fall-through paths, we later on check paths from branches.
      When the above skb->mark test is true, then we jump near the
      end of the program, perform r0 += 1, and jump back to the
      'if r0 > r3 goto out' test we've visited earlier already. This
      time, R0 is of type R0=pkt(id=0,off=15,r=0), and we'll prune
      that part because this time we'll have a larger safe packet
      range, and we already found that with off=14 all further insn
      were already safe, so it's safe as well with a larger off.
      However, the problem is that the subsequent write into the packet
      with 2 + 15 -4 is then unaligned, and not caught by the alignment
      tracking. Note that min_align, aux_off, and aux_off_align were
      all 0 in this example.
      
      Since we cannot tell at this time what kind of packet access was
      performed in the prior walk and what minimal requirements it has
      (we might do so in the future, but that requires more complexity),
      fix it to disable this pruning case for strict alignment for now,
      and let the verifier do check such paths instead. With that applied,
      the test cases pass and reject the program due to misalignment.
      
      Fixes: d1174416 ("bpf: Track alignment of register values in the verifier.")
      Reference: http://patchwork.ozlabs.org/patch/761909/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ad2f583
  7. 23 5月, 2017 1 次提交
  8. 18 5月, 2017 1 次提交
    • D
      bpf: adjust verifier heuristics · 3c2ce60b
      Daniel Borkmann 提交于
      Current limits with regards to processing program paths do not
      really reflect today's needs anymore due to programs becoming
      more complex and verifier smarter, keeping track of more data
      such as const ALU operations, alignment tracking, spilling of
      PTR_TO_MAP_VALUE_ADJ registers, and other features allowing for
      smarter matching of what LLVM generates.
      
      This also comes with the side-effect that we result in fewer
      opportunities to prune search states and thus often need to do
      more work to prove safety than in the past due to different
      register states and stack layout where we mismatch. Generally,
      it's quite hard to determine what caused a sudden increase in
      complexity, it could be caused by something as trivial as a
      single branch somewhere at the beginning of the program where
      LLVM assigned a stack slot that is marked differently throughout
      other branches and thus causing a mismatch, where verifier
      then needs to prove safety for the whole rest of the program.
      Subsequently, programs with even less than half the insn size
      limit can get rejected. We noticed that while some programs
      load fine under pre 4.11, they get rejected due to hitting
      limits on more recent kernels. We saw that in the vast majority
      of cases (90+%) pruning failed due to register mismatches. In
      case of stack mismatches, majority of cases failed due to
      different stack slot types (invalid, spill, misc) rather than
      differences in spilled registers.
      
      This patch makes pruning more aggressive by also adding markers
      that sit at conditional jumps as well. Currently, we only mark
      jump targets for pruning. For example in direct packet access,
      these are usually error paths where we bail out. We found that
      adding these markers, it can reduce number of processed insns
      by up to 30%. Another option is to ignore reg->id in probing
      PTR_TO_MAP_VALUE_OR_NULL registers, which can help pruning
      slightly as well by up to 7% observed complexity reduction as
      stand-alone. Meaning, if a previous path with register type
      PTR_TO_MAP_VALUE_OR_NULL for map X was found to be safe, then
      in the current state a PTR_TO_MAP_VALUE_OR_NULL register for
      the same map X must be safe as well. Last but not least the
      patch also adds a scheduling point and bumps the current limit
      for instructions to be processed to a more adequate value.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c2ce60b
  9. 12 5月, 2017 4 次提交
    • D
      bpf: Handle multiple variable additions into packet pointers in verifier. · 6832a333
      David S. Miller 提交于
      We must accumulate into reg->aux_off rather than use a plain assignment.
      
      Add a test for this situation to test_align.
      Reported-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6832a333
    • D
      bpf: Add strict alignment flag for BPF_PROG_LOAD. · e07b98d9
      David S. Miller 提交于
      Add a new field, "prog_flags", and an initial flag value
      BPF_F_STRICT_ALIGNMENT.
      
      When set, the verifier will enforce strict pointer alignment
      regardless of the setting of CONFIG_EFFICIENT_UNALIGNED_ACCESS.
      
      The verifier, in this mode, will also use a fixed value of "2" in
      place of NET_IP_ALIGN.
      
      This facilitates test cases that will exercise and validate this part
      of the verifier even when run on architectures where alignment doesn't
      matter.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      e07b98d9
    • D
      bpf: Do per-instruction state dumping in verifier when log_level > 1. · c5fc9692
      David S. Miller 提交于
      If log_level > 1, do a state dump every instruction and emit it in
      a more compact way (without a leading newline).
      
      This will facilitate more sophisticated test cases which inspect the
      verifier log for register state.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      c5fc9692
    • D
      bpf: Track alignment of register values in the verifier. · d1174416
      David S. Miller 提交于
      Currently if we add only constant values to pointers we can fully
      validate the alignment, and properly check if we need to reject the
      program on !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS architectures.
      
      However, once an unknown value is introduced we only allow byte sized
      memory accesses which is too restrictive.
      
      Add logic to track the known minimum alignment of register values,
      and propagate this state into registers containing pointers.
      
      The most common paradigm that makes use of this new logic is computing
      the transport header using the IP header length field.  For example:
      
      	struct ethhdr *ep = skb->data;
      	struct iphdr *iph = (struct iphdr *) (ep + 1);
      	struct tcphdr *th;
       ...
      	n = iph->ihl;
      	th = ((void *)iph + (n * 4));
      	port = th->dest;
      
      The existing code will reject the load of th->dest because it cannot
      validate that the alignment is at least 2 once "n * 4" is added the
      the packet pointer.
      
      In the new code, the register holding "n * 4" will have a reg->min_align
      value of 4, because any value multiplied by 4 will be at least 4 byte
      aligned.  (actually, the eBPF code emitted by the compiler in this case
      is most likely to use a shift left by 2, but the end result is identical)
      
      At the critical addition:
      
      	th = ((void *)iph + (n * 4));
      
      The register holding 'th' will start with reg->off value of 14.  The
      pointer addition will transform that reg into something that looks like:
      
      	reg->aux_off = 14
      	reg->aux_off_align = 4
      
      Next, the verifier will look at the th->dest load, and it will see
      a load offset of 2, and first check:
      
      	if (reg->aux_off_align % size)
      
      which will pass because aux_off_align is 4.  reg_off will be computed:
      
      	reg_off = reg->off;
       ...
      		reg_off += reg->aux_off;
      
      plus we have off==2, and it will thus check:
      
      	if ((NET_IP_ALIGN + reg_off + off) % size != 0)
      
      which evaluates to:
      
      	if ((NET_IP_ALIGN + 14 + 2) % size != 0)
      
      On strict alignment architectures, NET_IP_ALIGN is 2, thus:
      
      	if ((2 + 14 + 2) % size != 0)
      
      which passes.
      
      These pointer transformations and checks work regardless of whether
      the constant offset or the variable with known alignment is added
      first to the pointer register.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      d1174416
  10. 09 5月, 2017 2 次提交
  11. 01 5月, 2017 1 次提交
  12. 29 4月, 2017 1 次提交
  13. 27 4月, 2017 1 次提交
  14. 25 4月, 2017 2 次提交
  15. 18 4月, 2017 3 次提交
    • D
      bpf: fix checking xdp_adjust_head on tail calls · c2002f98
      Daniel Borkmann 提交于
      Commit 17bedab2 ("bpf: xdp: Allow head adjustment in XDP prog")
      added the xdp_adjust_head bit to the BPF prog in order to tell drivers
      that the program that is to be attached requires support for the XDP
      bpf_xdp_adjust_head() helper such that drivers not supporting this
      helper can reject the program. There are also drivers that do support
      the helper, but need to check for xdp_adjust_head bit in order to move
      packet metadata prepended by the firmware away for making headroom.
      
      For these cases, the current check for xdp_adjust_head bit is insufficient
      since there can be cases where the program itself does not use the
      bpf_xdp_adjust_head() helper, but tail calls into another program that
      uses bpf_xdp_adjust_head(). As such, the xdp_adjust_head bit is still
      set to 0. Since the first program has no control over which program it
      calls into, we need to assume that bpf_xdp_adjust_head() helper is used
      upon tail calls. Thus, for the very same reasons in cb_access, set the
      xdp_adjust_head bit to 1 when the main program uses tail calls.
      
      Fixes: 17bedab2 ("bpf: xdp: Allow head adjustment in XDP prog")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2002f98
    • D
      bpf: fix cb access in socket filter programs on tail calls · 6b1bb01b
      Daniel Borkmann 提交于
      Commit ff936a04 ("bpf: fix cb access in socket filter programs")
      added a fix for socket filter programs such that in i) AF_PACKET the
      20 bytes of skb->cb[] area gets zeroed before use in order to not leak
      data, and ii) socket filter programs attached to TCP/UDP sockets need
      to save/restore these 20 bytes since they are also used by protocol
      layers at that time.
      
      The problem is that bpf_prog_run_save_cb() and bpf_prog_run_clear_cb()
      only look at the actual attached program to determine whether to zero
      or save/restore the skb->cb[] parts. There can be cases where the
      actual attached program does not access the skb->cb[], but the program
      tail calls into another program which does access this area. In such
      a case, the zero or save/restore is currently not performed.
      
      Since the programs we tail call into are unknown at verification time
      and can dynamically change, we need to assume that whenever the attached
      program performs a tail call, that later programs could access the
      skb->cb[], and therefore we need to always set cb_access to 1.
      
      Fixes: ff936a04 ("bpf: fix cb access in socket filter programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b1bb01b
    • M
      bpf: lru: Lower the PERCPU_NR_SCANS from 16 to 4 · 695ba265
      Martin KaFai Lau 提交于
      After doing map_perf_test with a much bigger
      BPF_F_NO_COMMON_LRU map, the perf report shows a
      lot of time spent in rotating the inactive list (i.e.
      __bpf_lru_list_rotate_inactive):
      > map_perf_test 32 8 10000 1000000 | awk '{sum += $3}END{print sum}'
      19644783 (19M/s)
      > map_perf_test 32 8 10000000 10000000 |  awk '{sum += $3}END{print sum}'
      6283930 (6.28M/s)
      
      By inactive, it usually means the element is not in cache.  Hence,
      there is a need to tune the PERCPU_NR_SCANS value.
      
      This patch finds a better number of elements to
      scan during each list rotation.  The PERCPU_NR_SCANS (which
      is defined the same as PERCPU_FREE_TARGET) decreases
      from 16 elements to 4 elements.  This change only
      affects the BPF_F_NO_COMMON_LRU map.
      
      The test_lru_dist does not show meaningful difference
      between 16 and 4.  Our production L4 load balancer which uses
      the LRU map for conntrack-ing also shows little change in cache
      hit rate.  Since both benchmark and production data show no
      cache-hit difference, PERCPU_NR_SCANS is lowered from 16 to 4.
      We can consider making it configurable if we find a usecase
      later that shows another value works better and/or use
      a different rotation strategy.
      
      After this change:
      > map_perf_test 32 8 10000000 10000000 |  awk '{sum += $3}END{print sum}'
      9240324 (9.2M/s)
      
      i.e. 6.28M/s -> 9.2M/s
      
      The test_lru_dist has not shown meaningful difference:
      > test_lru_dist zipf.100k.a1_01.out 4000 1:
      nr_misses: 31575 (Before) vs 31566 (After)
      
      > test_lru_dist zipf.100k.a0_01.out 40000 1
      nr_misses: 67036 (Before) vs 67031 (After)
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      695ba265
  16. 12 4月, 2017 3 次提交