1. 02 6月, 2015 2 次提交
  2. 01 6月, 2015 1 次提交
  3. 31 5月, 2015 1 次提交
  4. 28 5月, 2015 1 次提交
  5. 27 5月, 2015 1 次提交
  6. 26 5月, 2015 1 次提交
  7. 22 5月, 2015 3 次提交
    • F
      net: sched: pkt_cls: remove unused macros from uapi · bd5850d3
      Florian Westphal 提交于
      Jamal points out that this header also contains kernel internal magic that
      cannot be used from userspace for anything meaningful.
      
      Lets remove what the kernel doesn't use anymore and wrap remainder with
      __KERNEL__.
      Suggested-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Suggested-by: NAlexei Starovoitov <alexei.starovoitov@gmail.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd5850d3
    • M
      tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info · 2efd055c
      Marcelo Ricardo Leitner 提交于
      This patch tracks the total number of inbound and outbound segments on a
      TCP socket. One may use this number to have an idea on connection
      quality when compared against the retransmissions.
      
      RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut
      
      These are a 32bit field each and can be fetched both from TCP_INFO
      getsockopt() if one has a handle on a TCP socket, or from inet_diag
      netlink facility (iproute2/ss patch will follow)
      
      Note that tp->segs_out was placed near tp->snd_nxt for good data
      locality and minimal performance impact, while tp->segs_in was placed
      near tp->bytes_received for the same reason.
      
      Join work with Eric Dumazet.
      
      Note that received SYN are accounted on the listener, but sent SYNACK
      are not accounted.
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2efd055c
    • A
      bpf: allow bpf programs to tail-call other bpf programs · 04fd61ab
      Alexei Starovoitov 提交于
      introduce bpf_tail_call(ctx, &jmp_table, index) helper function
      which can be used from BPF programs like:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        bpf_tail_call(ctx, &jmp_table, index);
        ...
      }
      that is roughly equivalent to:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        if (jmp_table[index])
          return (*jmp_table[index])(ctx);
        ...
      }
      The important detail that it's not a normal call, but a tail call.
      The kernel stack is precious, so this helper reuses the current
      stack frame and jumps into another BPF program without adding
      extra call frame.
      It's trivially done in interpreter and a bit trickier in JITs.
      In case of x64 JIT the bigger part of generated assembler prologue
      is common for all programs, so it is simply skipped while jumping.
      Other JITs can do similar prologue-skipping optimization or
      do stack unwind before jumping into the next program.
      
      bpf_tail_call() arguments:
      ctx - context pointer
      jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
      index - index in the jump table
      
      Since all BPF programs are idenitified by file descriptor, user space
      need to populate the jmp_table with FDs of other BPF programs.
      If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
      and program execution continues as normal.
      
      New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
      populate this jmp_table array with FDs of other bpf programs.
      Programs can share the same jmp_table array or use multiple jmp_tables.
      
      The chain of tail calls can form unpredictable dynamic loops therefore
      tail_call_cnt is used to limit the number of calls and currently is set to 32.
      
      Use cases:
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      
      ==========
      - simplify complex programs by splitting them into a sequence of small programs
      
      - dispatch routine
        For tracing and future seccomp the program may be triggered on all system
        calls, but processing of syscall arguments will be different. It's more
        efficient to implement them as:
        int syscall_entry(struct seccomp_data *ctx)
        {
           bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
           ... default: process unknown syscall ...
        }
        int sys_write_event(struct seccomp_data *ctx) {...}
        int sys_read_event(struct seccomp_data *ctx) {...}
        syscall_jmp_table[__NR_write] = sys_write_event;
        syscall_jmp_table[__NR_read] = sys_read_event;
      
        For networking the program may call into different parsers depending on
        packet format, like:
        int packet_parser(struct __sk_buff *skb)
        {
           ... parse L2, L3 here ...
           __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
           bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
           ... default: process unknown protocol ...
        }
        int parse_tcp(struct __sk_buff *skb) {...}
        int parse_udp(struct __sk_buff *skb) {...}
        ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
        ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
      
      - for TC use case, bpf_tail_call() allows to implement reclassify-like logic
      
      - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
        are atomic, so user space can build chains of BPF programs on the fly
      
      Implementation details:
      =======================
      - high performance of bpf_tail_call() is the goal.
        It could have been implemented without JIT changes as a wrapper on top of
        BPF_PROG_RUN() macro, but with two downsides:
        . all programs would have to pay performance penalty for this feature and
          tail call itself would be slower, since mandatory stack unwind, return,
          stack allocate would be done for every tailcall.
        . tailcall would be limited to programs running preempt_disabled, since
          generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
          need to be either global per_cpu variable accessed by helper and by wrapper
          or global variable protected by locks.
      
        In this implementation x64 JIT bypasses stack unwind and jumps into the
        callee program after prologue.
      
      - bpf_prog_array_compatible() ensures that prog_type of callee and caller
        are the same and JITed/non-JITed flag is the same, since calling JITed
        program from non-JITed is invalid, since stack frames are different.
        Similarly calling kprobe type program from socket type program is invalid.
      
      - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
        abstraction, its user space API and all of verifier logic.
        It's in the existing arraymap.c file, since several functions are
        shared with regular array map.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04fd61ab
  8. 20 5月, 2015 1 次提交
  9. 16 5月, 2015 1 次提交
  10. 15 5月, 2015 1 次提交
  11. 14 5月, 2015 5 次提交
    • P
      netfilter: add netfilter ingress hook after handle_ing() under unique static key · e687ad60
      Pablo Neira 提交于
      This patch adds the Netfilter ingress hook just after the existing tc ingress
      hook, that seems to be the consensus solution for this.
      
      Note that the Netfilter hook resides under the global static key that enables
      ingress filtering. Nonetheless, Netfilter still also has its own static key for
      minimal impact on the existing handle_ing().
      
      * Without this patch:
      
      Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
        16086246pps 7721Mb/sec (7721398080bps) errors: 100000000
      
          42.46%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
          25.92%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
           7.81%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
           5.62%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
           2.70%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
           2.34%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
           1.44%  kpktgend_0   [kernel.kallsyms]   [k] __build_skb
      
      * With this patch:
      
      Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
        16090536pps 7723Mb/sec (7723457280bps) errors: 100000000
      
          41.23%  kpktgend_0      [kernel.kallsyms]  [k] __netif_receive_skb_core
          26.57%  kpktgend_0      [kernel.kallsyms]  [k] kfree_skb
           7.72%  kpktgend_0      [pktgen]           [k] pktgen_thread_worker
           5.55%  kpktgend_0      [kernel.kallsyms]  [k] ip_rcv
           2.78%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_internal
           2.06%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_sk
           1.43%  kpktgend_0      [kernel.kallsyms]  [k] __build_skb
      
      * Without this patch + tc ingress:
      
              tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                      u32 match ip dst 4.3.2.1/32
      
      Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
        10788648pps 5178Mb/sec (5178551040bps) errors: 100000000
      
          40.99%  kpktgend_0   [kernel.kallsyms]  [k] __netif_receive_skb_core
          17.50%  kpktgend_0   [kernel.kallsyms]  [k] kfree_skb
          11.77%  kpktgend_0   [cls_u32]          [k] u32_classify
           5.62%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify_compat
           5.18%  kpktgend_0   [pktgen]           [k] pktgen_thread_worker
           3.23%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify
           2.97%  kpktgend_0   [kernel.kallsyms]  [k] ip_rcv
           1.83%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_internal
           1.50%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_sk
           0.99%  kpktgend_0   [kernel.kallsyms]  [k] __build_skb
      
      * With this patch + tc ingress:
      
              tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                      u32 match ip dst 4.3.2.1/32
      
      Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
        10743194pps 5156Mb/sec (5156733120bps) errors: 100000000
      
          42.01%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
          17.78%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
          11.70%  kpktgend_0   [cls_u32]           [k] u32_classify
           5.46%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify_compat
           5.16%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
           2.98%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
           2.84%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify
           1.96%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
           1.57%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
      
      Note that the results are very similar before and after.
      
      I can see gcc gets the code under the ingress static key out of the hot path.
      Then, on that cold branch, it generates the code to accomodate the netfilter
      ingress static key. My explanation for this is that this reduces the pressure
      on the instruction cache for non-users as the new code is out of the hot path,
      and it comes with minimal impact for tc ingress users.
      
      Using gcc version 4.8.4 on:
      
      Architecture:          x86_64
      CPU op-mode(s):        32-bit, 64-bit
      Byte Order:            Little Endian
      CPU(s):                8
      [...]
      L1d cache:             16K
      L1i cache:             64K
      L2 cache:              2048K
      L3 cache:              8192K
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e687ad60
    • J
      geneve: add initial netdev driver for GENEVE tunnels · 2d07dc79
      John W. Linville 提交于
      This is an initial implementation of a netdev driver for GENEVE
      tunnels.  This implementation uses a fixed UDP port, and only supports
      point-to-point links with specific partner endpoints.  Only IPv4
      links are supported at this time.
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d07dc79
    • W
      packet: rollover statistics · a9b63918
      Willem de Bruijn 提交于
      Rollover indicates exceptional conditions. Export a counter to inform
      socket owners of this state.
      
      If no socket with sufficient room is found, rollover fails. Also count
      these events.
      
      Finally, also count when flows are rolled over early thanks to huge
      flow detection, to validate its correctness.
      
      Tested:
        Read counters in bench_rollover on all other tests in the patchset
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9b63918
    • J
      tc: introduce Flower classifier · 77b9900e
      Jiri Pirko 提交于
      This patch introduces a flow-based filter. So far, the very essential
      packet fields are supported.
      
      This patch is only the first step. There is a lot of potential performance
      improvements possible to implement. Also a lot of features are missing
      now. They will be addressed in follow-up patches.
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77b9900e
    • F
      net: sched: use counter to break reclassify loops · e578d9c0
      Florian Westphal 提交于
      Seems all we want here is to avoid endless 'goto reclassify' loop.
      tc_classify_compat even resets this counter when something other
      than TC_ACT_RECLASSIFY is returned, so this skb-counter doesn't
      break hypothetical loops induced by something other than perpetual
      TC_ACT_RECLASSIFY return values.
      
      skb_act_clone is now identical to skb_clone, so just use that.
      
      Tested with following (bogus) filter:
      tc filter add dev eth0 parent ffff: \
       protocol ip u32 match u32 0 0 police rate 10Kbit burst \
       64000 mtu 1500 action reclassify
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e578d9c0
  12. 13 5月, 2015 1 次提交
    • D
      net_sched: gred: add TCA_GRED_LIMIT attribute · a3eb95f8
      David Ward 提交于
      In a GRED qdisc, if the default "virtual queue" (VQ) does not have drop
      parameters configured, then packets for the default VQ are not subjected
      to RED and are only dropped if the queue is larger than the net_device's
      tx_queue_len. This behavior is useful for WRED mode, since these packets
      will still influence the calculated average queue length and (therefore)
      the drop probability for all of the other VQs. However, for some drivers
      tx_queue_len is zero. In other cases the user may wish to make the limit
      the same for all VQs (including the default VQ with no drop parameters).
      
      This change adds a TCA_GRED_LIMIT attribute to set the GRED queue limit,
      in bytes, during qdisc setup. (This limit is in bytes to be consistent
      with the drop parameters.) The default limit is the same as for a bfifo
      queue (tx_queue_len * psched_mtu). If the drop parameters of any VQ are
      configured with a smaller limit than the GRED queue limit, that VQ will
      still observe the smaller limit instead.
      Signed-off-by: NDavid Ward <david.ward@ll.mit.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3eb95f8
  13. 11 5月, 2015 2 次提交
    • A
      bonding: add netlink support for sys prio, actor sys mac, and port key · 171a42c3
      Andy Gospodarek 提交于
      Adds netlink support for the following bonding options:
      * BOND_OPT_AD_ACTOR_SYS_PRIO
      * BOND_OPT_AD_ACTOR_SYSTEM
      * BOND_OPT_AD_USER_PORT_KEY
      
      When setting the actor system mac address we assume the netlink message
      contains a binary mac and not a string representation of a mac.
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      [jt: completed the setting side of the netlink attributes]
      Signed-off-by: NJonathan Toppins <jtoppins@cumulusnetworks.com>
      Signed-off-by: NNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      171a42c3
    • E
      codel: add ce_threshold attribute · 80ba92fa
      Eric Dumazet 提交于
      For DCTCP or similar ECN based deployments on fabrics with shallow
      buffers, hosts are responsible for a good part of the buffering.
      
      This patch adds an optional ce_threshold to codel & fq_codel qdiscs,
      so that DCTCP can have feedback from queuing in the host.
      
      A DCTCP enabled egress port simply have a queue occupancy threshold
      above which ECT packets get CE mark.
      
      In codel language this translates to a sojourn time, so that one doesn't
      have to worry about bytes or bandwidth but delays.
      
      This makes the host an active participant in the health of the whole
      network.
      
      This also helps experimenting DCTCP in a setup without DCTCP compliant
      fabric.
      
      On following example, ce_threshold is set to 1ms, and we can see from
      'ldelay xxx us' that TCP is not trying to go around the 5ms codel
      target.
      
      Queue has more capacity to absorb inelastic bursts (say from UDP
      traffic), as queues are maintained to an optimal level.
      
      lpaa23:~# ./tc -s -d qd sh dev eth1
      qdisc mq 1: dev eth1 root
       Sent 87910654696 bytes 58065331 pkt (dropped 0, overlimits 0 requeues 42961)
       backlog 3108242b 364p requeues 42961
      qdisc codel 8063: dev eth1 parent 1:1 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 7363778701 bytes 4863809 pkt (dropped 0, overlimits 0 requeues 5503)
       rate 2348Mbit 193919pps backlog 255866b 46p requeues 5503
        count 0 lastcount 0 ldelay 1.0ms drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 72384
      qdisc codel 8064: dev eth1 parent 1:2 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 7636486190 bytes 5043942 pkt (dropped 0, overlimits 0 requeues 5186)
       rate 2319Mbit 191538pps backlog 207418b 64p requeues 5186
        count 0 lastcount 0 ldelay 694us drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 69873
      qdisc codel 8065: dev eth1 parent 1:3 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 11569360142 bytes 7641602 pkt (dropped 0, overlimits 0 requeues 5554)
       rate 3041Mbit 251096pps backlog 210446b 59p requeues 5554
        count 0 lastcount 0 ldelay 889us drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 37780
      ...
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Glenn Judd <glenn.judd@morganstanley.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80ba92fa
  14. 10 5月, 2015 3 次提交
  15. 06 5月, 2015 4 次提交
  16. 05 5月, 2015 1 次提交
  17. 03 5月, 2015 1 次提交
  18. 02 5月, 2015 1 次提交
  19. 30 4月, 2015 4 次提交
  20. 24 4月, 2015 1 次提交
  21. 22 4月, 2015 2 次提交
  22. 21 4月, 2015 1 次提交
    • M
      KVM: PPC: Book3S HV: Add fast real-mode H_RANDOM implementation. · e928e9cb
      Michael Ellerman 提交于
      Some PowerNV systems include a hardware random-number generator.
      This HWRNG is present on POWER7+ and POWER8 chips and is capable of
      generating one 64-bit random number every microsecond.  The random
      numbers are produced by sampling a set of 64 unstable high-frequency
      oscillators and are almost completely entropic.
      
      PAPR defines an H_RANDOM hypercall which guests can use to obtain one
      64-bit random sample from the HWRNG.  This adds a real-mode
      implementation of the H_RANDOM hypercall.  This hypercall was
      implemented in real mode because the latency of reading the HWRNG is
      generally small compared to the latency of a guest exit and entry for
      all the threads in the same virtual core.
      
      Userspace can detect the presence of the HWRNG and the H_RANDOM
      implementation by querying the KVM_CAP_PPC_HWRNG capability.  The
      H_RANDOM hypercall implementation will only be invoked when the guest
      does an H_RANDOM hypercall if userspace first enables the in-kernel
      H_RANDOM implementation using the KVM_CAP_PPC_ENABLE_HCALL capability.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      e928e9cb
  23. 20 4月, 2015 1 次提交
    • A
      target: Version 2 of TCMU ABI · 0ad46af8
      Andy Grover 提交于
      The initial version of TCMU (in 3.18) does not properly handle
      bidirectional SCSI commands -- those with both an in and out buffer. In
      looking to fix this it also became clear that TCMU's support for adding
      new types of entries (opcodes) to the command ring was broken. We need
      to fix this now, so that future issues can be handled properly by adding
      new opcodes.
      
      We make the most of this ABI break by enabling bidi cmd handling within
      TCMP_OP_CMD opcode. Add an iov_bidi_cnt field to tcmu_cmd_entry.req.
      This enables TCMU to describe bidi commands, but further kernel work is
      needed for full bidi support.
      
      Enlarge tcmu_cmd_entry_hdr by 32 bits by pulling in cmd_id and __pad1. Turn
      __pad1 into two 8 bit flags fields, for kernel-set and userspace-set flags,
      "kflags" and "uflags" respectively.
      
      Update version fields so userspace can tell the interface is changed.
      
      Update tcmu-design.txt with details of how new stuff works:
      - Specify an additional requirement for userspace to set UNKNOWN_OP
        (bit 0) in hdr.uflags for unknown/unhandled opcodes.
      - Define how Data-In and Data-Out fields are described in req.iov[]
      
      Changed in v2:
      - Change name of SKIPPED bit to UNKNOWN bit
      - PAD op does not set the bit any more
      - Change len_op helper functions to take just len_op, not the whole struct
      - Change version to 2 in missed spots, and use defines
      - Add 16 unused bytes to cmd_entry.req, in case additional SAM cmd
        parameters need to be included
      - Add iov_dif_cnt field to specify buffers used for DIF info in iov[]
      - Rearrange fields to naturally align cdb_off
      - Handle if userspace sets UNKNOWN_OP by indicating failure of the cmd
      - Wrap some overly long UPDATE_HEAD lines
      
      (Add missing req.iov_bidi_cnt + req.iov_dif_cnt zeroing - Ilias)
      Signed-off-by: NAndy Grover <agrover@redhat.com>
      Reviewed-by: NIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      0ad46af8