1. 04 5月, 2015 5 次提交
  2. 03 5月, 2015 1 次提交
  3. 30 4月, 2015 1 次提交
  4. 18 4月, 2015 1 次提交
  5. 17 4月, 2015 1 次提交
    • A
      bpf: fix bpf helpers to use skb->mac_header relative offsets · a166151c
      Alexei Starovoitov 提交于
      For the short-term solution, lets fix bpf helper functions to use
      skb->mac_header relative offsets instead of skb->data in order to
      get the same eBPF programs with cls_bpf and act_bpf work on ingress
      and egress qdisc path. We need to ensure that mac_header is set
      before calling into programs. This is effectively the first option
      from below referenced discussion.
      
      More long term solution for LD_ABS|LD_IND instructions will be more
      intrusive but also more beneficial than this, and implemented later
      as it's too risky at this point in time.
      
      I.e., we plan to look into the option of moving skb_pull() out of
      eth_type_trans() and into netif_receive_skb() as has been suggested
      as second option. Meanwhile, this solution ensures ingress can be
      used with eBPF, too, and that we won't run into ABI troubles later.
      For dealing with negative offsets inside eBPF helper functions,
      we've implemented bpf_skb_clone_unwritable() to test for unwriteable
      headers.
      
      Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
      Fixes: 608cd71a ("tc: bpf: generalize pedit action")
      Fixes: 91bc4822 ("tc: bpf: add checksum helpers")
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a166151c
  6. 14 4月, 2015 1 次提交
    • D
      net: use jump label patching for ingress qdisc in __netif_receive_skb_core · 4577139b
      Daniel Borkmann 提交于
      Even if we make use of classifier and actions from the egress
      path, we're going into handle_ing() executing additional code
      on a per-packet cost for ingress qdisc, just to realize that
      nothing is attached on ingress.
      
      Instead, this can just be blinded out as a no-op entirely with
      the use of a static key. On input fast-path, we already make
      use of static keys in various places, e.g. skb time stamping,
      in RPS, etc. It makes sense to not waste time when we're assured
      that no ingress qdisc is attached anywhere.
      
      Enabling/disabling of that code path is being done via two
      helpers, namely net_{inc,dec}_ingress_queue(), that are being
      invoked under RTNL mutex when a ingress qdisc is being either
      initialized or destructed.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4577139b
  7. 08 4月, 2015 1 次提交
    • B
      netem: Fixes byte backlog accounting for the first of two chained netem instances · 0ad2a836
      Beshay, Joseph 提交于
      Fixes byte backlog accounting for the first of two chained netem instances.
      Bytes backlog reported now corresponds to the number of queued packets.
      
      When two netem instances are chained, for instance to apply rate and queue
      limitation followed by packet delay, the number of backlogged bytes reported
      by the first netem instance is wrong. It reports the sum of bytes in the queues
      of the first and second netem. The first netem reports the correct number of
      backlogged packets but not bytes. This is shown in the example below.
      
      Consider a chain of two netem schedulers created using the following commands:
      
      $ tc -s qdisc replace dev veth2 root handle 1:0 netem rate 10000kbit limit 100
      $ tc -s qdisc add dev veth2 parent 1:0 handle 2: netem delay 50ms
      
      Start an iperf session to send packets out on the specified interface and
      monitor the backlog using tc:
      
      $ tc -s qdisc show dev veth2
      
      Output using unpatched netem:
      	qdisc netem 1: root refcnt 2 limit 100 rate 10000Kbit
      	 Sent 98422639 bytes 65434 pkt (dropped 123, overlimits 0 requeues 0)
      	 backlog 172694b 73p requeues 0
      	qdisc netem 2: parent 1: limit 1000 delay 50.0ms
      	 Sent 98422639 bytes 65434 pkt (dropped 0, overlimits 0 requeues 0)
      	 backlog 63588b 42p requeues 0
      
      The interface used to produce this output has an MTU of 1500. The output for
      backlogged bytes behind netem 1 is 172694b. This value is not correct. Consider
      the total number of sent bytes and packets. By dividing the number of sent
      bytes by the number of sent packets, we get an average packet size of ~=1504.
      If we divide the number of backlogged bytes by packets, we get ~=2365. This is
      due to the first netem incorrectly counting the 63588b which are in netem 2's
      queue as being in its own queue. To verify this is the case, we subtract them
      from the reported value and divide by the number of packets as follows:
      	172694 - 63588 = 109106 bytes actualled backlogged in netem 1
      	109106 / 73 packets ~= 1494 bytes (which matches our MTU)
      
      The root cause is that the byte accounting is not done at the
      same time with packet accounting. The solution is to update the backlog value
      every time the packet queue is updated.
      Signed-off-by: NJoseph D Beshay <joseph.beshay@utdallas.edu>
      Acked-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ad2a836
  8. 02 4月, 2015 1 次提交
  9. 21 3月, 2015 1 次提交
    • D
      act_bpf: add initial eBPF support for actions · a8cb5f55
      Daniel Borkmann 提交于
      This work extends the "classic" BPF programmable tc action by extending
      its scope also to native eBPF code!
      
      Together with commit e2e9b654 ("cls_bpf: add initial eBPF support
      for programmable classifiers") this adds the facility to implement fully
      flexible classifier and actions for tc that can be implemented in a C
      subset in user space, "safely" loaded into the kernel, and being run in
      native speed when JITed.
      
      Also, since eBPF maps can be shared between eBPF programs, it offers the
      possibility that cls_bpf and act_bpf can share data 1) between themselves
      and 2) between user space applications. That means that, f.e. customized
      runtime statistics can be collected in user space, but also more importantly
      classifier and action behaviour could be altered based on map input from
      the user space application.
      
      For the remaining details on the workflow and integration, see the cls_bpf
      commit e2e9b654. Preliminary iproute2 part can be found under [1].
      
        [1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpf-actSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8cb5f55
  10. 18 3月, 2015 1 次提交
    • D
      act_bpf: allow non-default TC_ACT opcodes as BPF exec outcome · ced585c8
      Daniel Borkmann 提交于
      Revisiting commit d23b8ad8 ("tc: add BPF based action") with regards
      to eBPF support, I was thinking that it might be better to improve
      return semantics from a BPF program invoked through BPF_PROG_RUN().
      
      Currently, in case filter_res is 0, we overwrite the default action
      opcode with TC_ACT_SHOT. A default action opcode configured through tc's
      m_bpf can be: TC_ACT_RECLASSIFY, TC_ACT_PIPE, TC_ACT_SHOT, TC_ACT_UNSPEC,
      TC_ACT_OK.
      
      In cls_bpf, we have the possibility to overwrite the default class
      associated with the classifier in case filter_res is _not_ 0xffffffff
      (-1).
      
      That allows us to fold multiple [e]BPF programs into a single one, where
      they would otherwise need to be defined as a separate classifier with
      its own classid, needlessly redoing parsing work, etc.
      
      Similarly, we could do better in act_bpf: Since above TC_ACT* opcodes
      are exported to UAPI anyway, we reuse them for return-code-to-tc-opcode
      mapping, where we would allow above possibilities. Thus, like in cls_bpf,
      a filter_res of 0xffffffff (-1) means that the configured _default_ action
      is used. Any unkown return code from the BPF program would fail in
      tcf_bpf() with TC_ACT_UNSPEC.
      
      Should we one day want to make use of TC_ACT_STOLEN or TC_ACT_QUEUED,
      which both have the same semantics, we have the option to either use
      that as a default action (filter_res of 0xffffffff) or non-default BPF
      return code.
      
      All that will allow us to transparently use tcf_bpf() for both BPF
      flavours.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ced585c8
  11. 13 3月, 2015 1 次提交
    • D
      cls_bpf: do eBPF invocation under non-bh RCU lock variant for maps · 54720df1
      Daniel Borkmann 提交于
      Currently, it is possible in cls_bpf to access eBPF maps only under
      rcu_read_lock_bh() variants: while on ingress side, that is, handle_ing(),
      the classifier would be called from __netif_receive_skb_core() under
      rcu_read_lock(); on egress side, however, it's rcu_read_lock_bh() via
      __dev_queue_xmit().
      
      This rcu/rcu_bh mix doesn't work together with eBPF maps as they require
      soley to be called under rcu_read_lock(). eBPF maps could also be shared
      among various other eBPF programs (possibly even with other eBPF program
      types, f.e. tracing) and user space processes, so any context is assumed.
      
      Therefore, a possible fix for cls_bpf is to wrap/nest eBPF program
      invocation under non-bh RCU lock variant.
      
      Fixes: e2e9b654 ("cls_bpf: add initial eBPF support for programmable classifiers")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54720df1
  12. 10 3月, 2015 2 次提交
    • W
      net_sched: fix struct tc_u_hnode layout in u32 · 5778d39d
      WANG Cong 提交于
      We dynamically allocate divisor+1 entries for ->ht[] in tc_u_hnode:
      
        ht = kzalloc(sizeof(*ht) + divisor*sizeof(void *), GFP_KERNEL);
      
      So ->ht is supposed to be the last field of this struct, however
      this is broken, since an rcu head is appended after it.
      
      Fixes: 1ce87720 ("net: sched: make cls_u32 lockless")
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5778d39d
    • C
      net_sched: destroy proto tp when all filters are gone · 1e052be6
      Cong Wang 提交于
      Kernel automatically creates a tp for each
      (kind, protocol, priority) tuple, which has handle 0,
      when we add a new filter, but it still is left there
      after we remove our own, unless we don't specify the
      handle (literally means all the filters under
      the tuple). For example this one is left:
      
        # tc filter show dev eth0
        filter parent 8001: protocol arp pref 49152 basic
      
      The user-space is hard to clean up these for kernel
      because filters like u32 are organized in a complex way.
      So kernel is responsible to remove it after all filters
      are gone.  Each type of filter has its own way to
      store the filters, so each type has to provide its
      way to check if all filters are gone.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e052be6
  13. 06 3月, 2015 2 次提交
  14. 02 3月, 2015 1 次提交
    • D
      cls_bpf: add initial eBPF support for programmable classifiers · e2e9b654
      Daniel Borkmann 提交于
      This work extends the "classic" BPF programmable tc classifier by
      extending its scope also to native eBPF code!
      
      This allows for user space to implement own custom, 'safe' C like
      classifiers (or whatever other frontend language LLVM et al may
      provide in future), that can then be compiled with the LLVM eBPF
      backend to an eBPF elf file. The result of this can be loaded into
      the kernel via iproute2's tc. In the kernel, they can be JITed on
      major archs and thus run in native performance.
      
      Simple, minimal toy example to demonstrate the workflow:
      
        #include <linux/ip.h>
        #include <linux/if_ether.h>
        #include <linux/bpf.h>
      
        #include "tc_bpf_api.h"
      
        __section("classify")
        int cls_main(struct sk_buff *skb)
        {
          return (0x800 << 16) | load_byte(skb, ETH_HLEN + __builtin_offsetof(struct iphdr, tos));
        }
      
        char __license[] __section("license") = "GPL";
      
      The classifier can then be compiled into eBPF opcodes and loaded
      via tc, for example:
      
        clang -O2 -emit-llvm -c cls.c -o - | llc -march=bpf -filetype=obj -o cls.o
        tc filter add dev em1 parent 1: bpf cls.o [...]
      
      As it has been demonstrated, the scope can even reach up to a fully
      fledged flow dissector (similarly as in samples/bpf/sockex2_kern.c).
      
      For tc, maps are allowed to be used, but from kernel context only,
      in other words, eBPF code can keep state across filter invocations.
      In future, we perhaps may reattach from a different application to
      those maps e.g., to read out collected statistics/state.
      
      Similarly as in socket filters, we may extend functionality for eBPF
      classifiers over time depending on the use cases. For that purpose,
      cls_bpf programs are using BPF_PROG_TYPE_SCHED_CLS program type, so
      we can allow additional functions/accessors (e.g. an ABI compatible
      offset translation to skb fields/metadata). For an initial cls_bpf
      support, we allow the same set of helper functions as eBPF socket
      filters, but we could diverge at some point in time w/o problem.
      
      I was wondering whether cls_bpf and act_bpf could share C programs,
      I can imagine that at some point, we introduce i) further common
      handlers for both (or even beyond their scope), and/or if truly needed
      ii) some restricted function space for each of them. Both can be
      abstracted easily through struct bpf_verifier_ops in future.
      
      The context of cls_bpf versus act_bpf is slightly different though:
      a cls_bpf program will return a specific classid whereas act_bpf a
      drop/non-drop return code, latter may also in future mangle skbs.
      That said, we can surely have a "classify" and "action" section in
      a single object file, or considered mentioned constraint add a
      possibility of a shared section.
      
      The workflow for getting native eBPF running from tc [1] is as
      follows: for f_bpf, I've added a slightly modified ELF parser code
      from Alexei's kernel sample, which reads out the LLVM compiled
      object, sets up maps (and dynamically fixes up map fds) if any, and
      loads the eBPF instructions all centrally through the bpf syscall.
      
      The resulting fd from the loaded program itself is being passed down
      to cls_bpf, which looks up struct bpf_prog from the fd store, and
      holds reference, so that it stays available also after tc program
      lifetime. On tc filter destruction, it will then drop its reference.
      
      Moreover, I've also added the optional possibility to annotate an
      eBPF filter with a name (e.g. path to object file, or something
      else if preferred) so that when tc dumps currently installed filters,
      some more context can be given to an admin for a given instance (as
      opposed to just the file descriptor number).
      
      Last but not least, bpf_prog_get() and bpf_prog_put() needed to be
      exported, so that eBPF can be used from cls_bpf built as a module.
      Thanks to 60a3b225 ("net: bpf: make eBPF interpreter images
      read-only") I think this is of no concern since anything wanting to
      alter eBPF opcode after verification stage would crash the kernel.
      
        [1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2e9b654
  15. 23 2月, 2015 1 次提交
  16. 21 2月, 2015 1 次提交
  17. 05 2月, 2015 4 次提交
    • E
      pkt_sched: fq: better control of DDOS traffic · 06eb395f
      Eric Dumazet 提交于
      FQ has a fast path for skb attached to a socket, as it does not
      have to compute a flow hash. But for other packets, FQ being non
      stochastic means that hosts exposed to random Internet traffic
      can allocate million of flows structure (104 bytes each) pretty
      easily. Not only host can OOM, but lookup in RB trees can take
      too much cpu and memory resources.
      
      This patch adds a new attribute, orphan_mask, that is adding
      possibility of having a stochastic hash for orphaned skb.
      
      Its default value is 1024 slots, to mimic SFQ behavior.
      
      Note: This does not apply to locally generated TCP traffic,
      and no locally generated traffic will share a flow structure
      with another perfect or stochastic flow.
      
      This patch also handles the specific case of SYNACK messages:
      
      They are attached to the listener socket, and therefore all map
      to a single hash bucket. If listener have set SO_MAX_PACING_RATE,
      hoping to have new accepted socket inherit this rate, SYNACK
      might be paced and even dropped.
      
      This is very similar to an internal patch Google have used more
      than one year.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06eb395f
    • E
      tcp: do not pace pure ack packets · 98781965
      Eric Dumazet 提交于
      When we added pacing to TCP, we decided to let sch_fq take care
      of actual pacing.
      
      All TCP had to do was to compute sk->pacing_rate using simple formula:
      
      sk->pacing_rate = 2 * cwnd * mss / rtt
      
      It works well for senders (bulk flows), but not very well for receivers
      or even RPC :
      
      cwnd on the receiver can be less than 10, rtt can be around 100ms, so we
      can end up pacing ACK packets, slowing down the sender.
      
      Really, only the sender should pace, according to its own logic.
      
      Instead of adding a new bit in skb, or call yet another flow
      dissection, we tweak skb->truesize to a small value (2), and
      we instruct sch_fq to use new helper and not pace pure ack.
      
      Note this also helps TCP small queue, as ack packets present
      in qdisc/NIC do not prevent sending a data packet (RPC workload)
      
      This helps to reduce tx completion overhead, ack packets can use regular
      sock_wfree() instead of tcp_wfree() which is a bit more expensive.
      
      This has no impact in the case packets are sent to loopback interface,
      as we do not coalesce ack packets (were we would detect skb->truesize
      lie)
      
      In case netem (with a delay) is used, skb_orphan_partial() also sets
      skb->truesize to 1.
      
      This patch is a combination of two patches we used for about one year at
      Google.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98781965
    • I
      cls_api.c: Fix dumping of non-existing actions' stats. · b057df24
      Ignacy Gawędzki 提交于
      In tcf_exts_dump_stats(), ensure that exts->actions is not empty before
      accessing the first element of that list and calling tcf_action_copy_stats()
      on it.  This fixes some random segvs when adding filters of type "basic" with
      no particular action.
      
      This also fixes the dumping of those "no-action" filters, which more often
      than not made calls to tcf_action_copy_stats() fail and consequently netlink
      attributes added by the caller to be removed by a call to nla_nest_cancel().
      
      Fixes: 33be6271 ("net_sched: act: use standard struct list_head")
      Signed-off-by: NIgnacy Gawędzki <ignacy.gawedzki@green-communications.fr>
      Acked-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b057df24
    • K
      pkt_sched: fq: avoid hang when quantum 0 · 3725a269
      Kenneth Klette Jonassen 提交于
      Configuring fq with quantum 0 hangs the system, presumably because of a
      non-interruptible infinite loop. Either way quantum 0 does not make sense.
      
      Reproduce with:
      sudo tc qdisc add dev lo root fq quantum 0 initial_quantum 0
      ping 127.0.0.1
      Signed-off-by: NKenneth Klette Jonassen <kennetkl@ifi.uio.no>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3725a269
  18. 29 1月, 2015 2 次提交
  19. 27 1月, 2015 4 次提交
  20. 24 1月, 2015 1 次提交
  21. 20 1月, 2015 1 次提交
    • F
      net: sched: Introduce connmark action · 22a5dc0e
      Felix Fietkau 提交于
      This tc action allows you to retrieve the connection tracking mark
      This action has been used heavily by openwrt for a few years now.
      
      There are known limitations currently:
      
      doesn't work for initial packets, since we only query the ct table.
        Fine given use case is for returning packets
      
      no implicit defrag.
        frags should be rare so fix later..
      
      won't work for more complex tasks, e.g. lookup of other extensions
        since we have no means to store results
      
      we still have a 2nd lookup later on via normal conntrack path.
      This shouldn't break anything though since skb->nfct isn't altered.
      
      V2:
      remove unnecessary braces (Jiri)
      change the action identifier to 14 (Jiri)
      Fix some stylistic issues caught by checkpatch
      V3:
      Move module params to bottom (Cong)
      Get rid of tcf_hashinfo_init and friends and conform to newer API (Cong)
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NFelix Fietkau <nbd@openwrt.org>
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22a5dc0e
  22. 18 1月, 2015 2 次提交
  23. 14 1月, 2015 2 次提交
  24. 13 1月, 2015 1 次提交
  25. 07 1月, 2015 1 次提交