1. 31 7月, 2018 1 次提交
    • A
      bpf: Support bpf_get_socket_cookie in more prog types · d692f113
      Andrey Ignatov 提交于
      bpf_get_socket_cookie() helper can be used to identify skb that
      correspond to the same socket.
      
      Though socket cookie can be useful in many other use-cases where socket is
      available in program context. Specifically BPF_PROG_TYPE_CGROUP_SOCK_ADDR
      and BPF_PROG_TYPE_SOCK_OPS programs can benefit from it so that one of
      them can augment a value in a map prepared earlier by other program for
      the same socket.
      
      The patch adds support to call bpf_get_socket_cookie() from
      BPF_PROG_TYPE_CGROUP_SOCK_ADDR and BPF_PROG_TYPE_SOCK_OPS.
      
      It doesn't introduce new helpers. Instead it reuses same helper name
      bpf_get_socket_cookie() but adds support to this helper to accept
      `struct bpf_sock_addr` and `struct bpf_sock_ops`.
      
      Documentation in bpf.h is changed in a way that should not break
      automatic generation of markdown.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d692f113
  2. 25 7月, 2018 2 次提交
    • N
      net/sched: add skbprio scheduler · aea5f654
      Nishanth Devarajan 提交于
      Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets
      according to their skb->priority field. Under congestion, already-enqueued lower
      priority packets will be dropped to make space available for higher priority
      packets. Skbprio was conceived as a solution for denial-of-service defenses that
      need to route packets with different priorities as a means to overcome DoS
      attacks.
      
      v5
      *Do not reference qdisc_dev(sch)->tx_queue_len for setting limit. Instead set
      default sch->limit to 64.
      
      v4
      *Drop Documentation/networking/sch_skbprio.txt doc file to move it to tc man
      page for Skbprio, in iproute2.
      
      v3
      *Drop max_limit parameter in struct skbprio_sched_data and instead use
      sch->limit.
      
      *Reference qdisc_dev(sch)->tx_queue_len only once, during initialisation for
      qdisc (previously being referenced every time qdisc changes).
      
      *Move qdisc's detailed description from in-code to Documentation/networking.
      
      *When qdisc is saturated, enqueue incoming packet first before dequeueing
      lowest priority packet in queue - improves usage of call stack registers.
      
      *Introduce and use overlimit stat to keep track of number of dropped packets.
      
      v2
      *Use skb->priority field rather than DS field. Rename queueing discipline as
      SKB Priority Queue (previously Gatekeeper Priority Queue).
      
      *Queueing discipline is made classful to expose Skbprio's internal priority
      queues.
      Signed-off-by: NNishanth Devarajan <ndev2021@gmail.com>
      Reviewed-by: NSachin Paryani <sachin.paryani@gmail.com>
      Reviewed-by: NCody Doucette <doucette@bu.edu>
      Reviewed-by: NMichel Machado <michel@digirati.com.br>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aea5f654
    • H
      net: phy: add GBit master / slave error detection · b8f8c8eb
      Heiner Kallweit 提交于
      Certain PHY's have issues when operating in GBit slave mode and can
      be forced to master mode. Examples are RTL8211C, also the Micrel PHY
      driver has a DT setting to force master mode.
      If two such chips are link partners the autonegotiation will fail.
      Standard defines a self-clearing on read, latched-high bit to
      indicate this error. Check this bit to inform the user.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8f8c8eb
  3. 24 7月, 2018 4 次提交
    • K
      rds: Extend RDS API for IPv6 support · b7ff8b10
      Ka-Cheong Poon 提交于
      There are many data structures (RDS socket options) used by RDS apps
      which use a 32 bit integer to store IP address. To support IPv6,
      struct in6_addr needs to be used. To ensure backward compatibility, a
      new data structure is introduced for each of those data structures
      which use a 32 bit integer to represent an IP address. And new socket
      options are introduced to use those new structures. This means that
      existing apps should work without a problem with the new RDS module.
      For apps which want to use IPv6, those new data structures and socket
      options can be used. IPv4 mapped address is used to represent IPv4
      address in the new data structures.
      
      v4: Revert changes to SO_RDS_TRANSPORT
      Signed-off-by: NKa-Cheong Poon <ka-cheong.poon@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7ff8b10
    • J
      net: sched: introduce chain object to uapi · 32a4f5ec
      Jiri Pirko 提交于
      Allow user to create, destroy, get and dump chain objects. Do that by
      extending rtnl commands by the chain-specific ones. User will now be
      able to explicitly create or destroy chains (so far this was done only
      automatically according the filter/act needs and refcounting). Also, the
      user will receive notification about any chain creation or destuction.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32a4f5ec
    • K
      net/smc: provide smc mode in smc_diag.c · c601171d
      Karsten Graul 提交于
      Rename field diag_fallback into diag_mode and set the smc mode of a
      connection explicitly.
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c601171d
    • N
      net: bridge: add support for backup port · 2756f68c
      Nikolay Aleksandrov 提交于
      This patch adds a new port attribute - IFLA_BRPORT_BACKUP_PORT, which
      allows to set a backup port to be used for known unicast traffic if the
      port has gone carrier down. The backup pointer is rcu protected and set
      only under RTNL, a counter is maintained so when deleting a port we know
      how many other ports reference it as a backup and we remove it from all.
      Also the pointer is in the first cache line which is hot at the time of
      the check and thus in the common case we only add one more test.
      The backup port will be used only for the non-flooding case since
      it's a part of the bridge and the flooded packets will be forwarded to it
      anyway. To remove the forwarding just send a 0/non-existing backup port.
      This is used to avoid numerous scalability problems when using MLAG most
      notably if we have thousands of fdbs one would need to change all of them
      on port carrier going down which takes too long and causes a storm of fdb
      notifications (and again when the port comes back up). In a Multi-chassis
      Link Aggregation setup usually hosts are connected to two different
      switches which act as a single logical switch. Those switches usually have
      a control and backup link between them called peerlink which might be used
      for communication in case a host loses connectivity to one of them.
      We need a fast way to failover in case a host port goes down and currently
      none of the solutions (like bond) cannot fulfill the requirements because
      the participating ports are actually the "master" devices and must have the
      same peerlink as their backup interface and at the same time all of them
      must participate in the bridge device. As Roopa noted it's normal practice
      in routing called fast re-route where a precalculated backup path is used
      when the main one is down.
      Another use case of this is with EVPN, having a single vxlan device which
      is backup of every port. Due to the nature of master devices it's not
      currently possible to use one device as a backup for many and still have
      all of them participate in the bridge (which is master itself).
      More detailed information about MLAG is available at the link below.
      https://docs.cumulusnetworks.com/display/DOCS/Multi-Chassis+Link+Aggregation+-+MLAG
      
      Further explanation and a diagram by Roopa:
      Two switches acting in a MLAG pair are connected by the peerlink
      interface which is a bridge port.
      
      the config on one of the switches looks like the below. The other
      switch also has a similar config.
      eth0 is connected to one port on the server. And the server is
      connected to both switches.
      
      br0 -- team0---eth0
            |
            -- switch-peerlink
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2756f68c
  4. 20 7月, 2018 2 次提交
  5. 18 7月, 2018 2 次提交
  6. 17 7月, 2018 1 次提交
  7. 15 7月, 2018 1 次提交
    • A
      bpf: Add BPF_SOCK_OPS_TCP_LISTEN_CB · f333ee0c
      Andrey Ignatov 提交于
      Add new TCP-BPF callback that is called on listen(2) right after socket
      transition to TCP_LISTEN state.
      
      It fills the gap for listening sockets in TCP-BPF. For example BPF
      program can set BPF_SOCK_OPS_STATE_CB_FLAG when socket becomes listening
      and track later transition from TCP_LISTEN to TCP_CLOSE with
      BPF_SOCK_OPS_STATE_CB callback.
      
      Before there was no way to do it with TCP-BPF and other options were
      much harder to work with. E.g. socket state tracking can be done with
      tracepoints (either raw or regular) but they can't be attached to cgroup
      and their lifetime has to be managed separately.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f333ee0c
  8. 14 7月, 2018 4 次提交
  9. 13 7月, 2018 5 次提交
  10. 11 7月, 2018 5 次提交
    • T
      sched: Add Common Applications Kept Enhanced (cake) qdisc · 046f6fd5
      Toke Høiland-Jørgensen 提交于
      sch_cake targets the home router use case and is intended to squeeze the
      most bandwidth and latency out of even the slowest ISP links and routers,
      while presenting an API simple enough that even an ISP can configure it.
      
      Example of use on a cable ISP uplink:
      
      tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter
      
      To shape a cable download link (ifb and tc-mirred setup elided)
      
      tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash
      
      CAKE is filled with:
      
      * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
        derived Flow Queuing system, which autoconfigures based on the bandwidth.
      * A novel "triple-isolate" mode (the default) which balances per-host
        and per-flow FQ even through NAT.
      * An deficit based shaper, that can also be used in an unlimited mode.
      * 8 way set associative hashing to reduce flow collisions to a minimum.
      * A reasonable interpretation of various diffserv latency/loss tradeoffs.
      * Support for zeroing diffserv markings for entering and exiting traffic.
      * Support for interacting well with Docsis 3.0 shaper framing.
      * Extensive support for DSL framing types.
      * Support for ack filtering.
      * Extensive statistics for measuring, loss, ecn markings, latency
        variation.
      
      A paper describing the design of CAKE is available at
      https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE
      International Symposium on Local and Metropolitan Area Networks (LANMAN).
      
      This patch adds the base shaper and packet scheduler, while subsequent
      commits add the optional (configurable) features. The full userspace API
      and most data structures are included in this commit, but options not
      understood in the base version will be ignored.
      
      Various versions baking have been available as an out of tree build for
      kernel versions going back to 3.10, as the embedded router world has been
      running a few years behind mainline Linux. A stable version has been
      generally available on lede-17.01 and later.
      
      sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
      in the sqm-scripts, with sane defaults and vastly simpler configuration.
      
      CAKE's principal author is Jonathan Morton, with contributions from
      Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
      Ryan Mounce, Tony Ambardar, Dean Scarff, Nils Andreas Svee, Dave Täht,
      and Loganaden Velvindron.
      
      Testing from Pete Heist, Georgios Amanakis, and the many other members of
      the cake@lists.bufferbloat.net mailing list.
      
      tc -s qdisc show dev eth2
       qdisc cake 8017: root refcnt 2 bandwidth 1Gbit diffserv3 triple-isolate split-gso rtt 100.0ms noatm overhead 38 mpu 84
       Sent 51504294511 bytes 37724591 pkt (dropped 6, overlimits 64958695 requeues 12)
        backlog 0b 0p requeues 12
        memory used: 1053008b of 15140Kb
        capacity estimate: 970Mbit
        min/max network layer size:           28 /    1500
        min/max overhead-adjusted size:       84 /    1538
        average network hdr offset:           14
                          Bulk  Best Effort        Voice
         thresh      62500Kbit        1Gbit      250Mbit
         target          5.0ms        5.0ms        5.0ms
         interval      100.0ms      100.0ms      100.0ms
         pk_delay          5us          5us          6us
         av_delay          3us          2us          2us
         sp_delay          2us          1us          1us
         backlog            0b           0b           0b
         pkts          3164050     25030267      9530280
         bytes      3227519915  35396974782  12879808898
         way_inds            0            8            0
         way_miss           21          366           25
         way_cols            0            0            0
         drops               5            0            1
         marks               0            0            0
         ack_drop            0            0            0
         sp_flows            1            3            0
         bk_flows            0            1            1
         un_flows            0            0            0
         max_len         68130        68130        68130
      Tested-by: NPete Heist <peteheist@gmail.com>
      Tested-by: NGeorgios Amanakis <gamanakis@gmail.com>
      Signed-off-by: NDave Taht <dave.taht@gmail.com>
      Signed-off-by: NToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      046f6fd5
    • M
      rseq: Remove unused types_32_64.h uapi header · 4f4c0acd
      Mathieu Desnoyers 提交于
      This header was introduced in the 4.18 merge window, and rseq does
      not need it anymore. Nuke it before the final release.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709195155.7654-6-mathieu.desnoyers@efficios.com
      4f4c0acd
    • M
      rseq: uapi: Declare rseq_cs field as union, update includes · ec9c82e0
      Mathieu Desnoyers 提交于
      Declaring the rseq_cs field as a union between __u64 and two __u32
      allows both 32-bit and 64-bit kernels to read the full __u64, and
      therefore validate that a 32-bit user-space cleared the upper 32
      bits, thus ensuring a consistent behavior between native 32-bit
      kernels and 32-bit compat tasks on 64-bit kernels.
      
      Check that the rseq_cs value read is < TASK_SIZE.
      
      The asm/byteorder.h header needs to be included by rseq.h, now
      that it is not using linux/types_32_64.h anymore.
      
      Considering that only __32 and __u64 types are declared in linux/rseq.h,
      the linux/types.h header should always be included for both kernel and
      user-space code: including stdint.h is just for u64 and u32, which are
      not used in this header at all.
      
      Use copy_from_user()/clear_user() to interact with a 64-bit field,
      because arm32 does not implement 64-bit __get_user, and ppc32 does not
      64-bit get_user. Considering that the rseq_cs pointer does not need to
      be loaded/stored with single-copy atomicity from the kernel anymore, we
      can simply use copy_from_user()/clear_user().
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709195155.7654-5-mathieu.desnoyers@efficios.com
      ec9c82e0
    • M
      rseq: uapi: Update uapi comments · 0fb9a1ab
      Mathieu Desnoyers 提交于
      Update rseq uapi header comments to reflect that user-space need to do
      thread-local loads/stores from/to the struct rseq fields.
      
      As a consequence of this added requirement, the kernel does not need
      to perform loads/stores with single-copy atomicity.
      
      Update the comment associated to the "flags" fields to describe
      more accurately that it's only useful to facilitate single-stepping
      through rseq critical sections with debuggers.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709195155.7654-4-mathieu.desnoyers@efficios.com
      0fb9a1ab
    • M
      rseq: Use __u64 for rseq_cs fields, validate user inputs · e96d7135
      Mathieu Desnoyers 提交于
      Change the rseq ABI so rseq_cs start_ip, post_commit_offset and abort_ip
      fields are seen as 64-bit fields by both 32-bit and 64-bit kernels rather
      that ignoring the 32 upper bits on 32-bit kernels. This ensures we have a
      consistent behavior for a 32-bit binary executed on 32-bit kernels and in
      compat mode on 64-bit kernels.
      
      Validating the value of abort_ip field to be below TASK_SIZE ensures the
      kernel don't return to an invalid address when returning to userspace
      after an abort. I don't fully trust each architecture code to consistently
      deal with invalid return addresses.
      
      Validating the value of the start_ip and post_commit_offset fields
      prevents overflow on arithmetic performed on those values, used to
      check whether abort_ip is within the rseq critical section.
      
      If validation fails, the process is killed with a segmentation fault.
      
      When the signature encountered before abort_ip does not match the expected
      signature, return -EINVAL rather than -EPERM to be consistent with other
      input validation return codes from rseq_get_rseq_cs().
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-api@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709195155.7654-2-mathieu.desnoyers@efficios.com
      e96d7135
  11. 10 7月, 2018 1 次提交
  12. 08 7月, 2018 1 次提交
  13. 07 7月, 2018 1 次提交
  14. 05 7月, 2018 4 次提交
  15. 04 7月, 2018 6 次提交
    • J
      net/sched: Make etf report drops on error_queue · 4b15c707
      Jesus Sanchez-Palencia 提交于
      Use the socket error queue for reporting dropped packets if the
      socket has enabled that feature through the SO_TXTIME API.
      
      Packets are dropped either on enqueue() if they aren't accepted by the
      qdisc or on dequeue() if the system misses their deadline. Those are
      reported as different errors so applications can react accordingly.
      
      Userspace can retrieve the errors through the socket error queue and the
      corresponding cmsg interfaces. A struct sock_extended_err* is used for
      returning the error data, and the packet's timestamp can be retrieved by
      adding both ee_data and ee_info fields as e.g.:
      
          ((__u64) serr->ee_data << 32) + serr->ee_info
      
      This feature is disabled by default and must be explicitly enabled by
      applications. Enabling it can bring some overhead for the Tx cycles
      of the application.
      Signed-off-by: NJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b15c707
    • J
      net/sched: Add HW offloading capability to ETF · 88cab771
      Jesus Sanchez-Palencia 提交于
      Add infra so etf qdisc supports HW offload of time-based transmission.
      
      For hw offload, the time sorted list is still used, so packets are
      dequeued always in order of txtime.
      
      Example:
      
      $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
                 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
      
      $ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 100000 \
      	   clockid CLOCK_REALTIME
      
      In this example, the Qdisc will use HW offload for the control of the
      transmission time through the network adapter. The hrtimer used for
      packets scheduling inside the qdisc will use the clockid CLOCK_REALTIME
      as reference and packets leave the Qdisc "delta" (100000) nanoseconds
      before their transmission time. Because this will be using HW offload and
      since dynamic clocks are not supported by the hrtimer, the system clock
      and the PHC clock must be synchronized for this mode to behave as
      expected.
      Signed-off-by: NJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88cab771
    • V
      net/sched: Introduce the ETF Qdisc · 25db26a9
      Vinicius Costa Gomes 提交于
      The ETF (Earliest TxTime First) qdisc uses the information added
      earlier in this series (the socket option SO_TXTIME and the new
      role of sk_buff->tstamp) to schedule packets transmission based
      on absolute time.
      
      For some workloads, just bandwidth enforcement is not enough, and
      precise control of the transmission of packets is necessary.
      
      Example:
      
      $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
                 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
      
      $ tc qdisc add dev enp2s0 parent 100:1 etf delta 100000 \
                 clockid CLOCK_TAI
      
      In this example, the Qdisc will provide SW best-effort for the control
      of the transmission time to the network adapter, the time stamp in the
      socket will be in reference to the clockid CLOCK_TAI and packets
      will leave the qdisc "delta" (100000) nanoseconds before its transmission
      time.
      
      The ETF qdisc will buffer packets sorted by their txtime. It will drop
      packets on enqueue() if their skbuff clockid does not match the clock
      reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped
      if it expires while being enqueued.
      
      The qdisc also supports the SO_TXTIME deadline mode. For this mode, it
      will dequeue a packet as soon as possible and change the skb timestamp
      to 'now' during etf_dequeue().
      
      Note that both the qdisc's and the SO_TXTIME ABIs allow for a clockid
      to be configured, but it's been decided that usage of CLOCK_TAI should
      be enforced until we decide to allow for other clockids to be used.
      The rationale here is that PTP times are usually in the TAI scale, thus
      no other clocks should be necessary. For now, the qdisc will return
      EINVAL if any clocks other than CLOCK_TAI are used.
      Signed-off-by: NJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: NVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25db26a9
    • R
      net: Add a new socket option for a future transmit time. · 80b14dee
      Richard Cochran 提交于
      This patch introduces SO_TXTIME. User space enables this option in
      order to pass a desired future transmit time in a CMSG when calling
      sendmsg(2). The argument to this socket option is a 8-bytes long struct
      provided by the uapi header net_tstamp.h defined as:
      
      struct sock_txtime {
      	clockid_t 	clockid;
      	u32		flags;
      };
      
      Note that new fields were added to struct sock by filling a 2-bytes
      hole found in the struct. For that reason, neither the struct size or
      number of cachelines were altered.
      Signed-off-by: NRichard Cochran <rcochran@linutronix.de>
      Signed-off-by: NJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80b14dee
    • Q
      net:sched: add action inheritdsfield to skbedit · e7e3728b
      Qiaobin Fu 提交于
      The new action inheritdsfield copies the field DS of
      IPv4 and IPv6 packets into skb->priority. This enables
      later classification of packets based on the DS field.
      
      v5:
      *Update the drop counter for TC_ACT_SHOT
      
      v4:
      *Not allow setting flags other than the expected ones.
      
      *Allow dumping the pure flags.
      
      v3:
      *Use optional flags, so that it won't break old versions of tc.
      
      *Allow users to set both SKBEDIT_F_PRIORITY and SKBEDIT_F_INHERITDSFIELD flags.
      
      v2:
      *Fix the style issue
      
      *Move the code from skbmod to skbedit
      
      Original idea by Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NQiaobin Fu <qiaobinf@bu.edu>
      Reviewed-by: NMichel Machado <michel@digirati.com.br>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7e3728b
    • X
      sctp: add spp_ipv6_flowlabel and spp_dscp for sctp_paddrparams · 0b0dce7a
      Xin Long 提交于
      spp_ipv6_flowlabel and spp_dscp are added in sctp_paddrparams in
      this patch so that users could set sctp_sock/asoc/transport dscp
      and flowlabel with spp_flags SPP_IPV6_FLOWLABEL or SPP_DSCP by
      SCTP_PEER_ADDR_PARAMS , as described section 8.1.12 in RFC6458.
      
      As said in last patch, it uses '| 0x100000' or '|0x1' to mark
      flowlabel or dscp is set,  so that their values could be set
      to 0.
      
      Note that to guarantee that an old app built with old kernel
      headers could work on the newer kernel, the param's check in
      sctp_g/setsockopt_peer_addr_params() is also improved, which
      follows the way that sctp_g/setsockopt_delayed_ack() or some
      other sockopts' process that accept two types of params does.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b0dce7a