1. 09 7月, 2015 1 次提交
  2. 25 6月, 2015 1 次提交
  3. 22 6月, 2015 1 次提交
  4. 19 6月, 2015 1 次提交
  5. 07 6月, 2015 1 次提交
    • A
      bpf: make programs see skb->data == L2 for ingress and egress · 3431205e
      Alexei Starovoitov 提交于
      eBPF programs attached to ingress and egress qdiscs see inconsistent skb->data.
      For ingress L2 header is already pulled, whereas for egress it's present.
      This is known to program writers which are currently forced to use
      BPF_LL_OFF workaround.
      Since programs don't change skb internal pointers it is safe to do
      pull/push right around invocation of the program and earlier taps and
      later pt->func() will not be affected.
      Multiple taps via packet_rcv(), tpacket_rcv() are doing the same trick
      around run_filter/BPF_PROG_RUN even if skb_shared.
      
      This fix finally allows programs to use optimized LD_ABS/IND instructions
      without BPF_LL_OFF for higher performance.
      tc ingress + cls_bpf + samples/bpf/tcbpf1_kern.o
             w/o JIT   w/JIT
      before  20.5     23.6 Mpps
      after   21.8     26.6 Mpps
      
      Old programs with BPF_LL_OFF will still work as-is.
      
      We can now undo most of the earlier workaround commit:
      a166151c ("bpf: fix bpf helpers to use skb->mac_header relative offsets")
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3431205e
  6. 05 6月, 2015 2 次提交
    • T
      net: Add full IPv6 addresses to flow_keys · c3f83241
      Tom Herbert 提交于
      This patch adds full IPv6 addresses into flow_keys and uses them as
      input to the flow hash function. The implementation supports either
      IPv4 or IPv6 addresses in a union, and selector is used to determine
      how may words to input to jhash2.
      
      We also add flow_get_u32_dst and flow_get_u32_src functions which are
      used to get a u32 representation of the source and destination
      addresses. For IPv6, ipv6_addr_hash is called. These functions retain
      getting the legacy values of src and dst in flow_keys.
      
      With this patch, Ethertype and IP protocol are now included in the
      flow hash input.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3f83241
    • T
      net: Get skb hash over flow_keys structure · 42aecaa9
      Tom Herbert 提交于
      This patch changes flow hashing to use jhash2 over the flow_keys
      structure instead just doing jhash_3words over src, dst, and ports.
      This method will allow us take more input into the hashing function
      so that we can include full IPv6 addresses, VLAN, flow labels etc.
      without needing to resort to xor'ing which makes for a poor hash.
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42aecaa9
  7. 28 5月, 2015 1 次提交
    • W
      net_sched: invoke ->attach() after setting dev->qdisc · 86e363dc
      WANG Cong 提交于
      For mq qdisc, we add per tx queue qdisc to root qdisc
      for display purpose, however, that happens too early,
      before the new dev->qdisc is finally set, this causes
      q->list points to an old root qdisc which is going to be
      freed right before assigning with a new one.
      
      Fix this by moving ->attach() after setting dev->qdisc.
      
      For the record, this fixes the following crash:
      
       ------------[ cut here ]------------
       WARNING: CPU: 1 PID: 975 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
       list_del corruption. prev->next should be ffff8800d1998ae8, but was 6b6b6b6b6b6b6b6b
       CPU: 1 PID: 975 Comm: tc Not tainted 4.1.0-rc4+ #1019
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        0000000000000009 ffff8800d73fb928 ffffffff81a44e7f 0000000047574756
        ffff8800d73fb978 ffff8800d73fb968 ffffffff810790da ffff8800cfc4cd20
        ffffffff814e725b ffff8800d1998ae8 ffffffff82381250 0000000000000000
       Call Trace:
        [<ffffffff81a44e7f>] dump_stack+0x4c/0x65
        [<ffffffff810790da>] warn_slowpath_common+0x9c/0xb6
        [<ffffffff814e725b>] ? __list_del_entry+0x5a/0x98
        [<ffffffff81079162>] warn_slowpath_fmt+0x46/0x48
        [<ffffffff81820eb0>] ? dev_graft_qdisc+0x5e/0x6a
        [<ffffffff814e725b>] __list_del_entry+0x5a/0x98
        [<ffffffff814e72a7>] list_del+0xe/0x2d
        [<ffffffff81822f05>] qdisc_list_del+0x1e/0x20
        [<ffffffff81820cd1>] qdisc_destroy+0x30/0xd6
        [<ffffffff81822676>] qdisc_graft+0x11d/0x243
        [<ffffffff818233c1>] tc_get_qdisc+0x1a6/0x1d4
        [<ffffffff810b5eaf>] ? mark_lock+0x2e/0x226
        [<ffffffff817ff8f5>] rtnetlink_rcv_msg+0x181/0x194
        [<ffffffff817ff72e>] ? rtnl_lock+0x17/0x19
        [<ffffffff817ff72e>] ? rtnl_lock+0x17/0x19
        [<ffffffff817ff774>] ? __rtnl_unlock+0x17/0x17
        [<ffffffff81855dc6>] netlink_rcv_skb+0x4d/0x93
        [<ffffffff817ff756>] rtnetlink_rcv+0x26/0x2d
        [<ffffffff818544b2>] netlink_unicast+0xcb/0x150
        [<ffffffff81161db9>] ? might_fault+0x59/0xa9
        [<ffffffff81854f78>] netlink_sendmsg+0x4fa/0x51c
        [<ffffffff817d6e09>] sock_sendmsg_nosec+0x12/0x1d
        [<ffffffff817d8967>] sock_sendmsg+0x29/0x2e
        [<ffffffff817d8cf3>] ___sys_sendmsg+0x1b4/0x23a
        [<ffffffff8100a1b8>] ? native_sched_clock+0x35/0x37
        [<ffffffff810a1d83>] ? sched_clock_local+0x12/0x72
        [<ffffffff810a1fd4>] ? sched_clock_cpu+0x9e/0xb7
        [<ffffffff810def2a>] ? current_kernel_time+0xe/0x32
        [<ffffffff810b4bc5>] ? lock_release_holdtime.part.29+0x71/0x7f
        [<ffffffff810ddebf>] ? read_seqcount_begin.constprop.27+0x5f/0x76
        [<ffffffff810b6292>] ? trace_hardirqs_on_caller+0x17d/0x199
        [<ffffffff811b14d5>] ? __fget_light+0x50/0x78
        [<ffffffff817d9808>] __sys_sendmsg+0x42/0x60
        [<ffffffff817d9838>] SyS_sendmsg+0x12/0x1c
        [<ffffffff81a50e97>] system_call_fastpath+0x12/0x6f
       ---[ end trace ef29d3fb28e97ae7 ]---
      
      For long term, we probably need to clean up the qdisc_graft() code
      in case it hides other bugs like this.
      
      Fixes: 95dc1929 ("pkt_sched: give visibility to mq slave qdiscs")
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86e363dc
  8. 22 5月, 2015 1 次提交
    • D
      net: sched: fix call_rcu() race on classifier module unloads · c78e1746
      Daniel Borkmann 提交于
      Vijay reported that a loop as simple as ...
      
        while true; do
          tc qdisc add dev foo root handle 1: prio
          tc filter add dev foo parent 1: u32 match u32 0 0  flowid 1
          tc qdisc del dev foo root
          rmmod cls_u32
        done
      
      ... will panic the kernel. Moreover, he bisected the change
      apparently introducing it to 78fd1d0a ("netlink: Re-add
      locking to netlink_lookup() and seq walker").
      
      The removal of synchronize_net() from the netlink socket
      triggering the qdisc to be removed, seems to have uncovered
      an RCU resp. module reference count race from the tc API.
      Given that RCU conversion was done after e341694e ("netlink:
      Convert netlink_lookup() to use RCU protected hash table")
      which added the synchronize_net() originally, occasion of
      hitting the bug was less likely (not impossible though):
      
      When qdiscs that i) support attaching classifiers and,
      ii) have at least one of them attached, get deleted, they
      invoke tcf_destroy_chain(), and thus call into ->destroy()
      handler from a classifier module.
      
      After RCU conversion, all classifier that have an internal
      prio list, unlink them and initiate freeing via call_rcu()
      deferral.
      
      Meanhile, tcf_destroy() releases already reference to the
      tp->ops->owner module before the queued RCU callback handler
      has been invoked.
      
      Subsequent rmmod on the classifier module is then not prevented
      since all module references are already dropped.
      
      By the time, the kernel invokes the RCU callback handler from
      the module, that function address is then invalid.
      
      One way to fix it would be to add an rcu_barrier() to
      unregister_tcf_proto_ops() to wait for all pending call_rcu()s
      to complete.
      
      synchronize_rcu() is not appropriate as under heavy RCU
      callback load, registered call_rcu()s could be deferred
      longer than a grace period. In case we don't have any pending
      call_rcu()s, the barrier is allowed to return immediately.
      
      Since we came here via unregister_tcf_proto_ops(), there
      are no users of a given classifier anymore. Further nested
      call_rcu()s pointing into the module space are not being
      done anywhere.
      
      Only cls_bpf_delete_prog() may schedule a work item, to
      unlock pages eventually, but that is not in the range/context
      of cls_bpf anymore.
      
      Fixes: 25d8c0d5 ("net: rcu-ify tcf_proto")
      Fixes: 9888faef ("net: sched: cls_basic use RCU")
      Reported-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Tested-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c78e1746
  9. 15 5月, 2015 1 次提交
  10. 14 5月, 2015 6 次提交
  11. 13 5月, 2015 1 次提交
    • D
      net_sched: gred: add TCA_GRED_LIMIT attribute · a3eb95f8
      David Ward 提交于
      In a GRED qdisc, if the default "virtual queue" (VQ) does not have drop
      parameters configured, then packets for the default VQ are not subjected
      to RED and are only dropped if the queue is larger than the net_device's
      tx_queue_len. This behavior is useful for WRED mode, since these packets
      will still influence the calculated average queue length and (therefore)
      the drop probability for all of the other VQs. However, for some drivers
      tx_queue_len is zero. In other cases the user may wish to make the limit
      the same for all VQs (including the default VQ with no drop parameters).
      
      This change adds a TCA_GRED_LIMIT attribute to set the GRED queue limit,
      in bytes, during qdisc setup. (This limit is in bytes to be consistent
      with the drop parameters.) The default limit is the same as for a bfifo
      queue (tx_queue_len * psched_mtu). If the drop parameters of any VQ are
      configured with a smaller limit than the GRED queue limit, that VQ will
      still observe the smaller limit instead.
      Signed-off-by: NDavid Ward <david.ward@ll.mit.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3eb95f8
  12. 12 5月, 2015 2 次提交
  13. 11 5月, 2015 2 次提交
    • D
      net: sched: further simplify handle_ing · d2788d34
      Daniel Borkmann 提交于
      Ingress qdisc has no other purpose than calling into tc_classify()
      that executes attached classifier(s) and action(s).
      
      It has a 1:1 relationship to dev->ingress_queue. After having commit
      087c1a60 ("net: sched: run ingress qdisc without locks") removed
      the central ingress lock, one major contention point is gone.
      
      The extra indirection layers however, are not necessary for calling
      into ingress qdisc. pktgen calling locally into netif_receive_skb()
      with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
      E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.
      
      We can redirect the private classifier list to the netdev directly,
      without changing any classifier API bits (!) and execute on that from
      handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
      ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
      is also not applicable, ingress_cl_list provides similar behaviour.
      In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.
      
      One next possible step is the removal of the dev's ingress (dummy)
      netdev_queue, and to only have the list member in the netdevice
      itself.
      
      Note, the filter chain is RCU protected and individual filter elements
      are being kfree'd by sched subsystem after RCU grace period. RCU read
      lock is being held by __netif_receive_skb_core().
      
      Joint work with Alexei Starovoitov.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2788d34
    • E
      codel: add ce_threshold attribute · 80ba92fa
      Eric Dumazet 提交于
      For DCTCP or similar ECN based deployments on fabrics with shallow
      buffers, hosts are responsible for a good part of the buffering.
      
      This patch adds an optional ce_threshold to codel & fq_codel qdiscs,
      so that DCTCP can have feedback from queuing in the host.
      
      A DCTCP enabled egress port simply have a queue occupancy threshold
      above which ECT packets get CE mark.
      
      In codel language this translates to a sojourn time, so that one doesn't
      have to worry about bytes or bandwidth but delays.
      
      This makes the host an active participant in the health of the whole
      network.
      
      This also helps experimenting DCTCP in a setup without DCTCP compliant
      fabric.
      
      On following example, ce_threshold is set to 1ms, and we can see from
      'ldelay xxx us' that TCP is not trying to go around the 5ms codel
      target.
      
      Queue has more capacity to absorb inelastic bursts (say from UDP
      traffic), as queues are maintained to an optimal level.
      
      lpaa23:~# ./tc -s -d qd sh dev eth1
      qdisc mq 1: dev eth1 root
       Sent 87910654696 bytes 58065331 pkt (dropped 0, overlimits 0 requeues 42961)
       backlog 3108242b 364p requeues 42961
      qdisc codel 8063: dev eth1 parent 1:1 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 7363778701 bytes 4863809 pkt (dropped 0, overlimits 0 requeues 5503)
       rate 2348Mbit 193919pps backlog 255866b 46p requeues 5503
        count 0 lastcount 0 ldelay 1.0ms drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 72384
      qdisc codel 8064: dev eth1 parent 1:2 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 7636486190 bytes 5043942 pkt (dropped 0, overlimits 0 requeues 5186)
       rate 2319Mbit 191538pps backlog 207418b 64p requeues 5186
        count 0 lastcount 0 ldelay 694us drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 69873
      qdisc codel 8065: dev eth1 parent 1:3 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 11569360142 bytes 7641602 pkt (dropped 0, overlimits 0 requeues 5554)
       rate 3041Mbit 251096pps backlog 210446b 59p requeues 5554
        count 0 lastcount 0 ldelay 889us drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 37780
      ...
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Glenn Judd <glenn.judd@morganstanley.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80ba92fa
  14. 10 5月, 2015 1 次提交
  15. 04 5月, 2015 7 次提交
  16. 03 5月, 2015 1 次提交
  17. 30 4月, 2015 1 次提交
  18. 22 4月, 2015 1 次提交
  19. 18 4月, 2015 1 次提交
  20. 17 4月, 2015 1 次提交
    • A
      bpf: fix bpf helpers to use skb->mac_header relative offsets · a166151c
      Alexei Starovoitov 提交于
      For the short-term solution, lets fix bpf helper functions to use
      skb->mac_header relative offsets instead of skb->data in order to
      get the same eBPF programs with cls_bpf and act_bpf work on ingress
      and egress qdisc path. We need to ensure that mac_header is set
      before calling into programs. This is effectively the first option
      from below referenced discussion.
      
      More long term solution for LD_ABS|LD_IND instructions will be more
      intrusive but also more beneficial than this, and implemented later
      as it's too risky at this point in time.
      
      I.e., we plan to look into the option of moving skb_pull() out of
      eth_type_trans() and into netif_receive_skb() as has been suggested
      as second option. Meanwhile, this solution ensures ingress can be
      used with eBPF, too, and that we won't run into ABI troubles later.
      For dealing with negative offsets inside eBPF helper functions,
      we've implemented bpf_skb_clone_unwritable() to test for unwriteable
      headers.
      
      Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
      Fixes: 608cd71a ("tc: bpf: generalize pedit action")
      Fixes: 91bc4822 ("tc: bpf: add checksum helpers")
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a166151c
  21. 14 4月, 2015 1 次提交
    • D
      net: use jump label patching for ingress qdisc in __netif_receive_skb_core · 4577139b
      Daniel Borkmann 提交于
      Even if we make use of classifier and actions from the egress
      path, we're going into handle_ing() executing additional code
      on a per-packet cost for ingress qdisc, just to realize that
      nothing is attached on ingress.
      
      Instead, this can just be blinded out as a no-op entirely with
      the use of a static key. On input fast-path, we already make
      use of static keys in various places, e.g. skb time stamping,
      in RPS, etc. It makes sense to not waste time when we're assured
      that no ingress qdisc is attached anywhere.
      
      Enabling/disabling of that code path is being done via two
      helpers, namely net_{inc,dec}_ingress_queue(), that are being
      invoked under RTNL mutex when a ingress qdisc is being either
      initialized or destructed.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4577139b
  22. 08 4月, 2015 1 次提交
    • B
      netem: Fixes byte backlog accounting for the first of two chained netem instances · 0ad2a836
      Beshay, Joseph 提交于
      Fixes byte backlog accounting for the first of two chained netem instances.
      Bytes backlog reported now corresponds to the number of queued packets.
      
      When two netem instances are chained, for instance to apply rate and queue
      limitation followed by packet delay, the number of backlogged bytes reported
      by the first netem instance is wrong. It reports the sum of bytes in the queues
      of the first and second netem. The first netem reports the correct number of
      backlogged packets but not bytes. This is shown in the example below.
      
      Consider a chain of two netem schedulers created using the following commands:
      
      $ tc -s qdisc replace dev veth2 root handle 1:0 netem rate 10000kbit limit 100
      $ tc -s qdisc add dev veth2 parent 1:0 handle 2: netem delay 50ms
      
      Start an iperf session to send packets out on the specified interface and
      monitor the backlog using tc:
      
      $ tc -s qdisc show dev veth2
      
      Output using unpatched netem:
      	qdisc netem 1: root refcnt 2 limit 100 rate 10000Kbit
      	 Sent 98422639 bytes 65434 pkt (dropped 123, overlimits 0 requeues 0)
      	 backlog 172694b 73p requeues 0
      	qdisc netem 2: parent 1: limit 1000 delay 50.0ms
      	 Sent 98422639 bytes 65434 pkt (dropped 0, overlimits 0 requeues 0)
      	 backlog 63588b 42p requeues 0
      
      The interface used to produce this output has an MTU of 1500. The output for
      backlogged bytes behind netem 1 is 172694b. This value is not correct. Consider
      the total number of sent bytes and packets. By dividing the number of sent
      bytes by the number of sent packets, we get an average packet size of ~=1504.
      If we divide the number of backlogged bytes by packets, we get ~=2365. This is
      due to the first netem incorrectly counting the 63588b which are in netem 2's
      queue as being in its own queue. To verify this is the case, we subtract them
      from the reported value and divide by the number of packets as follows:
      	172694 - 63588 = 109106 bytes actualled backlogged in netem 1
      	109106 / 73 packets ~= 1494 bytes (which matches our MTU)
      
      The root cause is that the byte accounting is not done at the
      same time with packet accounting. The solution is to update the backlog value
      every time the packet queue is updated.
      Signed-off-by: NJoseph D Beshay <joseph.beshay@utdallas.edu>
      Acked-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ad2a836
  23. 02 4月, 2015 1 次提交
  24. 21 3月, 2015 1 次提交
    • D
      act_bpf: add initial eBPF support for actions · a8cb5f55
      Daniel Borkmann 提交于
      This work extends the "classic" BPF programmable tc action by extending
      its scope also to native eBPF code!
      
      Together with commit e2e9b654 ("cls_bpf: add initial eBPF support
      for programmable classifiers") this adds the facility to implement fully
      flexible classifier and actions for tc that can be implemented in a C
      subset in user space, "safely" loaded into the kernel, and being run in
      native speed when JITed.
      
      Also, since eBPF maps can be shared between eBPF programs, it offers the
      possibility that cls_bpf and act_bpf can share data 1) between themselves
      and 2) between user space applications. That means that, f.e. customized
      runtime statistics can be collected in user space, but also more importantly
      classifier and action behaviour could be altered based on map input from
      the user space application.
      
      For the remaining details on the workflow and integration, see the cls_bpf
      commit e2e9b654. Preliminary iproute2 part can be found under [1].
      
        [1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpf-actSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8cb5f55
  25. 18 3月, 2015 1 次提交
    • D
      act_bpf: allow non-default TC_ACT opcodes as BPF exec outcome · ced585c8
      Daniel Borkmann 提交于
      Revisiting commit d23b8ad8 ("tc: add BPF based action") with regards
      to eBPF support, I was thinking that it might be better to improve
      return semantics from a BPF program invoked through BPF_PROG_RUN().
      
      Currently, in case filter_res is 0, we overwrite the default action
      opcode with TC_ACT_SHOT. A default action opcode configured through tc's
      m_bpf can be: TC_ACT_RECLASSIFY, TC_ACT_PIPE, TC_ACT_SHOT, TC_ACT_UNSPEC,
      TC_ACT_OK.
      
      In cls_bpf, we have the possibility to overwrite the default class
      associated with the classifier in case filter_res is _not_ 0xffffffff
      (-1).
      
      That allows us to fold multiple [e]BPF programs into a single one, where
      they would otherwise need to be defined as a separate classifier with
      its own classid, needlessly redoing parsing work, etc.
      
      Similarly, we could do better in act_bpf: Since above TC_ACT* opcodes
      are exported to UAPI anyway, we reuse them for return-code-to-tc-opcode
      mapping, where we would allow above possibilities. Thus, like in cls_bpf,
      a filter_res of 0xffffffff (-1) means that the configured _default_ action
      is used. Any unkown return code from the BPF program would fail in
      tcf_bpf() with TC_ACT_UNSPEC.
      
      Should we one day want to make use of TC_ACT_STOLEN or TC_ACT_QUEUED,
      which both have the same semantics, we have the option to either use
      that as a default action (filter_res of 0xffffffff) or non-default BPF
      return code.
      
      All that will allow us to transparently use tcf_bpf() for both BPF
      flavours.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ced585c8
  26. 13 3月, 2015 1 次提交
    • D
      cls_bpf: do eBPF invocation under non-bh RCU lock variant for maps · 54720df1
      Daniel Borkmann 提交于
      Currently, it is possible in cls_bpf to access eBPF maps only under
      rcu_read_lock_bh() variants: while on ingress side, that is, handle_ing(),
      the classifier would be called from __netif_receive_skb_core() under
      rcu_read_lock(); on egress side, however, it's rcu_read_lock_bh() via
      __dev_queue_xmit().
      
      This rcu/rcu_bh mix doesn't work together with eBPF maps as they require
      soley to be called under rcu_read_lock(). eBPF maps could also be shared
      among various other eBPF programs (possibly even with other eBPF program
      types, f.e. tracing) and user space processes, so any context is assumed.
      
      Therefore, a possible fix for cls_bpf is to wrap/nest eBPF program
      invocation under non-bh RCU lock variant.
      
      Fixes: e2e9b654 ("cls_bpf: add initial eBPF support for programmable classifiers")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54720df1