1. 04 12月, 2015 1 次提交
    • E
      net_sched: fix qdisc_tree_decrease_qlen() races · 4eaf3b84
      Eric Dumazet 提交于
      qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
      devices.
      
      One problem is that it updates sch->q.qlen and sch->qstats.drops
      on the mq/mqprio root qdisc, while it should not : Daniele
      reported underflows errors :
      [  681.774821] PAX: sch->q.qlen: 0 n: 1
      [  681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
      [  681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G           O    4.2.6.201511282239-1-grsec #1
      [  681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
      [  681.774956]  ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
      [  681.774959]  ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
      [  681.774960]  ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
      [  681.774962] Call Trace:
      [  681.774967]  [<ffffffffa95d2810>] dump_stack+0x4c/0x7f
      [  681.774970]  [<ffffffffa91a44f4>] report_size_overflow+0x34/0x50
      [  681.774972]  [<ffffffffa94d17e2>] qdisc_tree_decrease_qlen+0x152/0x160
      [  681.774976]  [<ffffffffc02694b1>] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
      [  681.774978]  [<ffffffffc02680a0>] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
      [  681.774980]  [<ffffffffa94cd92d>] __qdisc_run+0x4d/0x1d0
      [  681.774983]  [<ffffffffa949b2b2>] net_tx_action+0xc2/0x160
      [  681.774985]  [<ffffffffa90664c1>] __do_softirq+0xf1/0x200
      [  681.774987]  [<ffffffffa90665ee>] run_ksoftirqd+0x1e/0x30
      [  681.774989]  [<ffffffffa90896b0>] smpboot_thread_fn+0x150/0x260
      [  681.774991]  [<ffffffffa9089560>] ? sort_range+0x40/0x40
      [  681.774992]  [<ffffffffa9085fe4>] kthread+0xe4/0x100
      [  681.774994]  [<ffffffffa9085f00>] ? kthread_worker_fn+0x170/0x170
      [  681.774995]  [<ffffffffa95d8d1e>] ret_from_fork+0x3e/0x70
      
      mq/mqprio have their own ways to report qlen/drops by folding stats on
      all their queues, with appropriate locking.
      
      A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
      without proper locking : concurrent qdisc updates could corrupt the list
      that qdisc_match_from_root() parses to find a qdisc given its handle.
      
      Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
      qdisc_tree_decrease_qlen() can use to abort its tree traversal,
      as soon as it meets a mq/mqprio qdisc children.
      
      Second problem can be fixed by RCU protection.
      Qdisc are already freed after RCU grace period, so qdisc_list_add() and
      qdisc_list_del() simply have to use appropriate rcu list variants.
      
      A future patch will add a per struct netdev_queue list anchor, so that
      qdisc_tree_decrease_qlen() can have more efficient lookups.
      Reported-by: NDaniele Fucini <dfucini@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Cong Wang <cwang@twopensource.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4eaf3b84
  2. 09 11月, 2015 2 次提交
  3. 04 11月, 2015 1 次提交
  4. 11 10月, 2015 2 次提交
  5. 09 10月, 2015 1 次提交
    • P
      net/sched: make sch_blackhole.c explicitly non-modular · 075640e3
      Paul Gortmaker 提交于
      The Kconfig currently controlling compilation of this code is:
      
      net/sched/Kconfig:menuconfig NET_SCHED
      net/sched/Kconfig:      bool "QoS and/or fair queueing"
      
      ...meaning that it currently is not being built as a module by anyone.
      
      Lets remove the modular code that is essentially orphaned, so that
      when reading the driver there is no doubt it is builtin-only.
      
      Since module_init translates to device_initcall in the non-modular
      case, the init ordering remains unchanged with this commit.  We can
      change to one of the other priority initcalls (subsys?) at any later
      date, if desired.
      
      We also delete the MODULE_LICENSE tag since all that information
      is already contained at the top of the file in the comments.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      075640e3
  6. 08 10月, 2015 1 次提交
  7. 05 10月, 2015 2 次提交
  8. 03 10月, 2015 2 次提交
    • D
      sched, bpf: add helper for retrieving routing realms · c46646d0
      Daniel Borkmann 提交于
      Using routing realms as part of the classifier is quite useful, it
      can be viewed as a tag for one or multiple routing entries (think of
      an analogy to net_cls cgroup for processes), set by user space routing
      daemons or via iproute2 as an indicator for traffic classifiers and
      later on processed in the eBPF program.
      
      Unlike actions, the classifier can inspect device flags and enable
      netif_keep_dst() if necessary. tc actions don't have that possibility,
      but in case people know what they are doing, it can be used from there
      as well (e.g. via devs that must keep dsts by design anyway).
      
      If a realm is set, the handler returns the non-zero realm. User space
      can set the full 32bit realm for the dst.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c46646d0
    • E
      tcp: attach SYNACK messages to request sockets instead of listener · ca6fb065
      Eric Dumazet 提交于
      If a listen backlog is very big (to avoid syncookies), then
      the listener sk->sk_wmem_alloc is the main source of false
      sharing, as we need to touch it twice per SYNACK re-transmit
      and TX completion.
      
      (One SYN packet takes listener lock once, but up to 6 SYNACK
      are generated)
      
      By attaching the skb to the request socket, we remove this
      source of contention.
      
      Tested:
      
       listen(fd, 10485760); // single listener (no SO_REUSEPORT)
       16 RX/TX queue NIC
       Sustain a SYNFLOOD attack of ~320,000 SYN per second,
       Sending ~1,400,000 SYNACK per second.
       Perf profiles now show listener spinlock being next bottleneck.
      
          20.29%  [kernel]  [k] queued_spin_lock_slowpath
          10.06%  [kernel]  [k] __inet_lookup_established
           5.12%  [kernel]  [k] reqsk_timer_handler
           3.22%  [kernel]  [k] get_next_timer_interrupt
           3.00%  [kernel]  [k] tcp_make_synack
           2.77%  [kernel]  [k] ipt_do_table
           2.70%  [kernel]  [k] run_timer_softirq
           2.50%  [kernel]  [k] ip_finish_output
           2.04%  [kernel]  [k] cascade
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca6fb065
  9. 25 9月, 2015 1 次提交
  10. 24 9月, 2015 3 次提交
  11. 19 9月, 2015 3 次提交
  12. 18 9月, 2015 3 次提交
    • E
      sch_dsmark: improve memory locality · 47bbbb30
      Eric Dumazet 提交于
      Memory placement in sch_dsmark is silly : Better place mask/value
      in the same cache line.
      
      Also, we can embed small arrays in the first cache line and
      remove a potential cache miss.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47bbbb30
    • A
      bpf: add bpf_redirect() helper · 27b29f63
      Alexei Starovoitov 提交于
      Existing bpf_clone_redirect() helper clones skb before redirecting
      it to RX or TX of destination netdev.
      Introduce bpf_redirect() helper that does that without cloning.
      
      Benchmarked with two hosts using 10G ixgbe NICs.
      One host is doing line rate pktgen.
      Another host is configured as:
      $ tc qdisc add dev $dev ingress
      $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
         action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
      so it receives the packet on $dev and immediately xmits it on $dev + 1
      The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
      that does bpf_clone_redirect() and performance is 2.0 Mpps
      
      $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
         action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
      which is using bpf_redirect() - 2.4 Mpps
      
      and using cls_bpf with integrated actions as:
      $ tc filter add dev $dev root pref 10 \
        bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
      performance is 2.5 Mpps
      
      To summarize:
      u32+act_bpf using clone_redirect - 2.0 Mpps
      u32+act_bpf using redirect - 2.4 Mpps
      cls_bpf using redirect - 2.5 Mpps
      
      For comparison linux bridge in this setup is doing 2.1 Mpps
      and ixgbe rx + drop in ip_rcv - 7.8 Mpps
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27b29f63
    • D
      cls_bpf: introduce integrated actions · 045efa82
      Daniel Borkmann 提交于
      Often cls_bpf classifier is used with single action drop attached.
      Optimize this use case and let cls_bpf return both classid and action.
      For backwards compatibility reasons enable this feature under
      TCA_BPF_FLAG_ACT_DIRECT flag.
      
      Then more interesting programs like the following are easier to write:
      int cls_bpf_prog(struct __sk_buff *skb)
      {
        /* classify arp, ip, ipv6 into different traffic classes
         * and drop all other packets
         */
        switch (skb->protocol) {
        case htons(ETH_P_ARP):
          skb->tc_classid = 1;
          break;
        case htons(ETH_P_IP):
          skb->tc_classid = 2;
          break;
        case htons(ETH_P_IPV6):
          skb->tc_classid = 3;
          break;
        default:
          return TC_ACT_SHOT;
        }
      
        return TC_ACT_OK;
      }
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      045efa82
  13. 02 9月, 2015 1 次提交
  14. 29 8月, 2015 1 次提交
  15. 28 8月, 2015 4 次提交
  16. 27 8月, 2015 5 次提交
  17. 26 8月, 2015 1 次提交
  18. 19 8月, 2015 1 次提交
  19. 18 8月, 2015 3 次提交
    • T
      net: Change pseudohdr argument of inet_proto_csum_replace* to be a bool · 4b048d6d
      Tom Herbert 提交于
      inet_proto_csum_replace4,2,16 take a pseudohdr argument which indicates
      the checksum field carries a pseudo header. This argument should be a
      boolean instead of an int.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b048d6d
    • D
      netfilter: nf_conntrack: add direction support for zones · deedb590
      Daniel Borkmann 提交于
      This work adds a direction parameter to netfilter zones, so identity
      separation can be performed only in original/reply or both directions
      (default). This basically opens up the possibility of doing NAT with
      conflicting IP address/port tuples from multiple, isolated tenants
      on a host (e.g. from a netns) without requiring each tenant to NAT
      twice resp. to use its own dedicated IP address to SNAT to, meaning
      overlapping tuples can be made unique with the zone identifier in
      original direction, where the NAT engine will then allocate a unique
      tuple in the commonly shared default zone for the reply direction.
      In some restricted, local DNAT cases, also port redirection could be
      used for making the reply traffic unique w/o requiring SNAT.
      
      The consensus we've reached and discussed at NFWS and since the initial
      implementation [1] was to directly integrate the direction meta data
      into the existing zones infrastructure, as opposed to the ct->mark
      approach we proposed initially.
      
      As we pass the nf_conntrack_zone object directly around, we don't have
      to touch all call-sites, but only those, that contain equality checks
      of zones. Thus, based on the current direction (original or reply),
      we either return the actual id, or the default NF_CT_DEFAULT_ZONE_ID.
      CT expectations are direction-agnostic entities when expectations are
      being compared among themselves, so we can only use the identifier
      in this case.
      
      Note that zone identifiers can not be included into the hash mix
      anymore as they don't contain a "stable" value that would be equal
      for both directions at all times, f.e. if only zone->id would
      unconditionally be xor'ed into the table slot hash, then replies won't
      find the corresponding conntracking entry anymore.
      
      If no particular direction is specified when configuring zones, the
      behaviour is exactly as we expect currently (both directions).
      
      Support has been added for the CT netlink interface as well as the
      x_tables raw CT target, which both already offer existing interfaces
      to user space for the configuration of zones.
      
      Below a minimal, simplified collision example (script in [2]) with
      netperf sessions:
      
        +--- tenant-1 ---+   mark := 1
        |    netperf     |--+
        +----------------+  |                CT zone := mark [ORIGINAL]
         [ip,sport] := X   +--------------+  +--- gateway ---+
                           | mark routing |--|     SNAT      |-- ... +
                           +--------------+  +---------------+       |
        +--- tenant-2 ---+  |                                     ~~~|~~~
        |    netperf     |--+                +-----------+           |
        +----------------+   mark := 2       | netserver |------ ... +
         [ip,sport] := X                     +-----------+
                                              [ip,port] := Y
      On the gateway netns, example:
      
        iptables -t raw -A PREROUTING -j CT --zone mark --zone-dir ORIGINAL
        iptables -t nat -A POSTROUTING -o <dev> -j SNAT --to-source <ip> --random-fully
      
        iptables -t mangle -A PREROUTING -m conntrack --ctdir ORIGINAL -j CONNMARK --save-mark
        iptables -t mangle -A POSTROUTING -m conntrack --ctdir REPLY -j CONNMARK --restore-mark
      
      conntrack dump from gateway netns:
      
        netperf -H 10.1.1.2 -t TCP_STREAM -l60 -p12865,5555 from each tenant netns
      
        tcp 6 431995 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=1
                                 src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=1024
                     [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1
      
        tcp 6 431994 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=2
                                 src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=5555
                     [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=1
      
        tcp 6 299 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=39438 dport=33768 zone-orig=1
                              src=10.1.1.2 dst=10.1.1.1 sport=33768 dport=39438
                     [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1
      
        tcp 6 300 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=32889 dport=40206 zone-orig=2
                              src=10.1.1.2 dst=10.1.1.1 sport=40206 dport=32889
                     [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=2
      
      Taking this further, test script in [2] creates 200 tenants and runs
      original-tuple colliding netperf sessions each. A conntrack -L dump in
      the gateway netns also confirms 200 overlapping entries, all in ESTABLISHED
      state as expected.
      
      I also did run various other tests with some permutations of the script,
      to mention some: SNAT in random/random-fully/persistent mode, no zones (no
      overlaps), static zones (original, reply, both directions), etc.
      
        [1] http://thread.gmane.org/gmane.comp.security.firewalls.netfilter.devel/57412/
        [2] https://paste.fedoraproject.org/242835/65657871/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      deedb590
    • P
      net: sch_generic: react upon IFF_NO_QUEUE flag · 4b469955
      Phil Sutter 提交于
      Handle IFF_NO_QUEUE as alternative to tx_queue_len being zero.
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b469955
  20. 11 8月, 2015 1 次提交
  21. 04 8月, 2015 1 次提交
    • D
      act_bpf: properly support late binding of bpf action to a classifier · a5c90b29
      Daniel Borkmann 提交于
      Since the introduction of the BPF action in d23b8ad8 ("tc: add BPF
      based action"), late binding was not working as expected. I.e. setting
      the action part for a classifier only via 'bpf index <num>', where <num>
      is the index of an existing action, is being rejected by the kernel due
      to other missing parameters.
      
      It doesn't make sense to require these parameters such as BPF opcodes
      etc, as they are not going to be used anyway: in this case, they're just
      allocated/parsed and then freed again w/o doing anything meaningful.
      
      Instead, parse and verify the remaining parameters *after* the test on
      tcf_hash_check(), when we really know that we're dealing with creation
      of a new action or replacement of an existing one and where late binding
      is thus irrelevant.
      
      After patch, test case is now working:
      
        FOO="1,6 0 0 4294967295,"
        tc actions add action bpf bytecode "$FOO"
        tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action bpf index 1
        tc actions show action bpf
          action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
          index 1 ref 2 bind 1
        tc filter show dev foo
          filter protocol all pref 49152 bpf
          filter protocol all pref 49152 bpf handle 0x1 flowid 1:1 bytecode '1,6 0 0 4294967295'
          action order 1: bpf bytecode '1,6 0 0 4294967295' default-action pipe
          index 1 ref 2 bind 1
      
      Late binding of a BPF action can be useful for preloading maps (e.g. before
      they hit traffic) in case of eBPF programs, or to share a single eBPF action
      with multiple classifiers.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5c90b29