1. 11 1月, 2016 25 次提交
    • D
      net, sched: add clsact qdisc · 1f211a1b
      Daniel Borkmann 提交于
      This work adds a generalization of the ingress qdisc as a qdisc holding
      only classifiers. The clsact qdisc works on ingress, but also on egress.
      In both cases, it's execution happens without taking the qdisc lock, and
      the main difference for the egress part compared to prior version of [1]
      is that this can be applied with _any_ underlying real egress qdisc (also
      classless ones).
      
      Besides solving the use-case of [1], that is, allowing for more programmability
      on assigning skb->priority for the mqprio case that is supported by most
      popular 10G+ NICs, it also opens up a lot more flexibility for other tc
      applications. The main work on classification can already be done at clsact
      egress time if the use-case allows and state stored for later retrieval
      f.e. again in skb->priority with major/minors (which is checked by most
      classful qdiscs before consulting tc_classify()) and/or in other skb fields
      like skb->tc_index for some light-weight post-processing to get to the
      eventual classid in case of a classful qdisc. Another use case is that
      the clsact egress part allows to have a central egress counterpart to
      the ingress classifiers, so that classifiers can easily share state (e.g.
      in cls_bpf via eBPF maps) for ingress and egress.
      
      Currently, default setups like mq + pfifo_fast would require for this to
      use, for example, prio qdisc instead (to get a tc_classify() run) and to
      duplicate the egress classifier for each queue. With clsact, it allows
      for leaving the setup as is, it can additionally assign skb->priority to
      put the skb in one of pfifo_fast's bands and it can share state with maps.
      Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
      w/o the need to perform a skb_dst_force() to hold on to it any longer. In
      lwt case, we can also use this facility to setup dst metadata via cls_bpf
      (bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
      that (case of IFF_NO_QUEUE devices, for example).
      
      The realization can be done without any changes to the scheduler core
      framework. All it takes is that we have two a-priori defined minors/child
      classes, where we can mux between ingress and egress classifier list
      (dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
      dev->_tx to avoid extra cacheline miss for moderate loads). The egress
      part is a bit similar modelled to handle_ing() and patched to a noop in
      case the functionality is not used. Both handlers are now called
      sch_handle_ingress() and sch_handle_egress(), code sharing among the two
      doesn't seem practical as there are various minor differences in both
      paths, so that making them conditional in a single handler would rather
      slow things down.
      
      Full compatibility to ingress qdisc is provided as well. Since both
      piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
      per netdevice, and thus ingress qdisc specific behaviour can be retained
      for user space. This means, either a user does 'tc qdisc add dev foo ingress'
      and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
      alternative, where both, ingress and egress classifier can be configured
      as in the below example. ingress qdisc supports attaching classifier to any
      minor number whereas clsact has two fixed minors for muxing between the
      lists, therefore to not break user space setups, they are better done as
      two separate qdiscs.
      
      I decided to extend the sch_ingress module with clsact functionality so
      that commonly used code can be reused, the module is being aliased with
      sch_clsact so that it can be auto-loaded properly. Alternative would have been
      to add a flag when initializing ingress to alter its behaviour plus aliasing
      to a different name (as it's more than just ingress). However, the first would
      end up, based on the flag, choosing the new/old behaviour by calling different
      function implementations to handle each anyway, the latter would require to
      register ingress qdisc once again under different alias. So, this really begs
      to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
      by its own that share callbacks used by both.
      
      Example, adding qdisc:
      
         # tc qdisc add dev foo clsact
         # tc qdisc show dev foo
         qdisc mq 0: root
         qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
         qdisc clsact ffff: parent ffff:fff1
      
      Adding filters (deleting, etc works analogous by specifying ingress/egress):
      
         # tc filter add dev foo ingress bpf da obj bar.o sec ingress
         # tc filter add dev foo egress  bpf da obj bar.o sec egress
         # tc filter show dev foo ingress
         filter protocol all pref 49152 bpf
         filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
         # tc filter show dev foo egress
         filter protocol all pref 49152 bpf
         filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
      
      A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
      show an empty list for clsact. Either using the parent names (ingress/egress)
      or specifying the full major/minor will then show the related filter lists.
      
      Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
      
        [1] http://patchwork.ozlabs.org/patch/512949/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f211a1b
    • A
      ethernet: amd: au1000: Remove pointless warning · ede55997
      Andrew Lunn 提交于
      The warning about being able to read any MDIO device, not just the
      attached ethernet devices PHY applies to all MDIO drivers. So remove
      it. This also removes a reference to a member in phy_device which has
      moved.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ede55997
    • A
      staging: netlogic: Fix build error due to missed API change · 3fe01e24
      Andrew Lunn 提交于
      Fix a number of build errors due to moving the phy_map and centralizing
      interrupt allocation.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fe01e24
    • G
      net: ethernet: faraday: Use phy_find_first() instead of open coding it · e574f398
      Guenter Roeck 提交于
      Use phy_find_first() to find the first phy device instead of
      open coding it.
      
      Cc: Andrew Lunn <andrew@lunn.ch>
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Acked-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e574f398
    • G
      net: ethernet: broadcom: Fix build errors · ee64f08e
      Guenter Roeck 提交于
      Commit 7f854420 ("phy: Add API for {un}registering an mdio device to
      a bus") introduces an API to access mii_bus structures, but missed to
      update the sb1250 driver. This results in the following build error.
      
      drivers/net/ethernet/broadcom/sb1250-mac.c: In function 'sbmac_mii_probe':
      drivers/net/ethernet/broadcom/sb1250-mac.c:2360:24: error:
      	'struct mii_bus' has no member named 'phy_map'
      
      Use phy_find_first() instead of open coding it.
      
      Commit 2220943a ("phy: Centralise print about attached phy") introduces
      the following build error.
      
      drivers/net/ethernet/broadcom/sb1250-mac.c: In function 'sbmac_mii_probe':
      drivers/net/ethernet/broadcom/sb1250-mac.c:2383:20: error: 'phydev' undeclared
      
      Fixes: 7f854420 ("phy: Add API for {un}registering an mdio device to a bus")
      Fixes: 2220943a ("phy: Centralise print about attached phy")
      Cc: Andrew Lunn <andrew@lunn.ch>
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Acked-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee64f08e
    • D
      Merge branch 'mdio-device-fixes' · 5c721d56
      David S. Miller 提交于
      Andrew Lunn says:
      
      ====================
      Fix breakage from mdio device
      
      These two patches fix MIPS platforms which got broken by
      the recent mdio device patchset.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c721d56
    • A
      net: ethernet-rgmii.c: Fix breakage from moving phdev bus · 0c129bf7
      Andrew Lunn 提交于
      The mdio device patches moved the bus member in phy_device into a
      substructure. This driver got missed. Fix it.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c129bf7
    • A
      net: lantiq_etop.c: Use helper to find first phy · 2a4fc4ea
      Andrew Lunn 提交于
      Make use of the helper to find the first phy device.
      This also fixes the compile breakage.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a4fc4ea
    • R
      stmmac: Don't exit mdio registration when mdio subnode is not found in the DTS · 6c672c9b
      Romain Perier 提交于
      Originally, most of the platforms using this driver did not define an mdio subnode
      in the devicetree. Commit e34d65 ("stmmac: create of compatible mdio bus for stmmac driver")
      introduced a backward compatibily issue by using of_mdiobus_register explicitly
      with an mdio subnode. This patch fixes the issue by calling the function
      mdiobus_register, when mdio subnode is not found. The driver is now compatible
      with both modes.
      
      Fixes: e34d6569 ("stmmac: create of compatible mdio bus for stmmac driver")
      Signed-off-by: NRomain Perier <romain.perier@gmail.com>
      Tested-by: NPhil Reid <preid@electromag.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c672c9b
    • D
      Merge branch 'bpf-next' · 749f7df1
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      BPF update
      
      Fixes a csum issue on ingress. As mentioned previously, net-next
      seems just fine imho. Later on, will follow up with couple of
      replacements like ovs_skb_postpush_rcsum() etc.
      
      Thanks!
      
      v1 -> v2:
        - Added patch 1 with helper
        - Implemented Hannes' idea to just use csum_partial, thanks!
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      749f7df1
    • D
      bpf: add skb_postpush_rcsum and fix dev_forward_skb occasions · f8ffad69
      Daniel Borkmann 提交于
      Add a small helper skb_postpush_rcsum() and fix up redirect locations
      that need CHECKSUM_COMPLETE fixups on ingress. dev_forward_skb() expects
      a proper csum that covers also Ethernet header, f.e. since 2c26d34b
      ("net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding"), we
      also do skb_postpull_rcsum() after pulling Ethernet header off via
      eth_type_trans().
      
      When using eBPF in a netns setup f.e. with vxlan in collect metadata mode,
      I can trigger the following csum issue with an IPv6 setup:
      
        [  505.144065] dummy1: hw csum failure
        [...]
        [  505.144108] Call Trace:
        [  505.144112]  <IRQ>  [<ffffffff81372f08>] dump_stack+0x44/0x5c
        [  505.144134]  [<ffffffff81607cea>] netdev_rx_csum_fault+0x3a/0x40
        [  505.144142]  [<ffffffff815fee3f>] __skb_checksum_complete+0xcf/0xe0
        [  505.144149]  [<ffffffff816f0902>] nf_ip6_checksum+0xb2/0x120
        [  505.144161]  [<ffffffffa08c0e0e>] icmpv6_error+0x17e/0x328 [nf_conntrack_ipv6]
        [  505.144170]  [<ffffffffa0898eca>] ? ip6t_do_table+0x2fa/0x645 [ip6_tables]
        [  505.144177]  [<ffffffffa08c0725>] ? ipv6_get_l4proto+0x65/0xd0 [nf_conntrack_ipv6]
        [  505.144189]  [<ffffffffa06c9a12>] nf_conntrack_in+0xc2/0x5a0 [nf_conntrack]
        [  505.144196]  [<ffffffffa08c039c>] ipv6_conntrack_in+0x1c/0x20 [nf_conntrack_ipv6]
        [  505.144204]  [<ffffffff8164385d>] nf_iterate+0x5d/0x70
        [  505.144210]  [<ffffffff816438d6>] nf_hook_slow+0x66/0xc0
        [  505.144218]  [<ffffffff816bd302>] ipv6_rcv+0x3f2/0x4f0
        [  505.144225]  [<ffffffff816bca40>] ? ip6_make_skb+0x1b0/0x1b0
        [  505.144232]  [<ffffffff8160b77b>] __netif_receive_skb_core+0x36b/0x9a0
        [  505.144239]  [<ffffffff8160bdc8>] ? __netif_receive_skb+0x18/0x60
        [  505.144245]  [<ffffffff8160bdc8>] __netif_receive_skb+0x18/0x60
        [  505.144252]  [<ffffffff8160ccff>] process_backlog+0x9f/0x140
        [  505.144259]  [<ffffffff8160c4a5>] net_rx_action+0x145/0x320
        [...]
      
      What happens is that on ingress, we push Ethernet header back in, either
      from cls_bpf or right before skb_do_redirect(), but without updating csum.
      The "hw csum failure" can be fixed by using the new skb_postpush_rcsum()
      helper for the dev_forward_skb() case to correct the csum diff again.
      
      Thanks to Hannes Frederic Sowa for the csum_partial() idea!
      
      Fixes: 3896d655 ("bpf: introduce bpf_clone_redirect() helper")
      Fixes: 27b29f63 ("bpf: add bpf_redirect() helper")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8ffad69
    • D
      net, sched: add skb_at_tc_ingress helper · fdc5432a
      Daniel Borkmann 提交于
      Add a skb_at_tc_ingress() as this will be needed elsewhere as well and
      can hide the ugly ifdef.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fdc5432a
    • D
      Merge branch 'tcp-keepalive-namespaceify' · 4156afaf
      David S. Miller 提交于
      Nikolay Borisov says:
      
      ====================
      Namespaceify tcp keepalive machinery
      
      The following patch series enables the tcp keepalive mechanism
      to be configured per net namespace. This is especially useful
      if you have multiple containers hosted on one node and one of
      them is under DoS-  in such situations one thing which could
      be done is to configure the tcp keepalive settings such that
      connections for that particular container are being reset
      faster.
      
      Another scenario where not being able to control those knob
      comes per container is problematic is occurs the value of
      net.netfilter.nf_conntrack_tcp_timeout_established is set
      below the keepalive interval, in such situations the server won't
      send an RST packet resulting in applications not trying to
      reconnect and stale connection waiting. Changing the global
      keepalive value is a possible solution but it might interfere
      with other containers.
      
      The three patches gradually convert each of the affected knobs
      to be per netns. I thought it would be easier for review than
      put everything in one patch. If people deem it more appropriate
      to squash everything in one patch (maybe after review) I'd
      be more than happy to do it.
      
      The patches have been compile-tested on 4.4 and functionally
      tested on 3.12 and they work as expected.
      
      These are based off 4.4-rc8
      ====================
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4156afaf
    • N
      ipv4: Namespecify the tcp_keepalive_intvl sysctl knob · b840d15d
      Nikolay Borisov 提交于
      This is the final part required to namespaceify the tcp
      keep alive mechanism.
      Signed-off-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b840d15d
    • N
      ipv4: Namespecify tcp_keepalive_probes sysctl knob · 9bd6861b
      Nikolay Borisov 提交于
      This is required to have full tcp keepalive mechanism namespace
      support.
      Signed-off-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9bd6861b
    • N
      ipv4: Namespaceify tcp_keepalive_time sysctl knob · 13b287e8
      Nikolay Borisov 提交于
      Different net namespaces might have different requirements as to
      the keepalive time of tcp sockets. This might be required in cases
      where different firewall rules are in place which require tcp
      timeout sockets to be increased/decreased independently of the host.
      Signed-off-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13b287e8
    • D
      Merge branch 'mlxsw-layer2-multicast' · d3517f19
      David S. Miller 提交于
      Jiri Pirko says:
      
      ====================
      mlxsw: Adding layer 2 multicast
      
      Elad says:
      
      This patchset add Linux hardware reflection for L2 multicast offload and add
      MC support in mlxsw. For every bridge MDB entry insertion, either by IGMP
      snooping or by static insertion/removal, a switchdev ops is been called.
      In mlxsw, a new multicast group (MID) is been created and ports are assigned.
      When all ports are removed, the multicast group is been deleted.
      
      ---
      v1->v2:
      - GFP_ATOMIC->GFP_KERNEL change in patch 7/8
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3517f19
    • E
      switchdev: Adding IGMP snooping documentation · 4f5590f8
      Elad Raz 提交于
      Signed-off-by: NElad Raz <eladr@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f5590f8
    • E
      mlxsw: Adding layer 2 multicast support · 3a49b4fd
      Elad Raz 提交于
      Add SWITCHDEV_OBJ_ID_PORT_MDB switchdev ops support. On first MDB insertion
      creates a new multicast group (MID) and add members port to the MID. Also
      add new MDB entry for the flooding-domain (fid-vid) and link the MDB entry
      to the newly constructed MC group.
      Signed-off-by: NElad Raz <eladr@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a49b4fd
    • E
      mlxsw: Adding VID to FID translatation · e4b6f693
      Elad Raz 提交于
      Adding a generic function that translate VID to FID.
      Signed-off-by: NElad Raz <eladr@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4b6f693
    • E
    • E
      mlxsw: reg: Adding SMID register · fabe5483
      Elad Raz 提交于
      Adding back SMID register definition and packing. For each MC group a new
      SMID entry will be generated.
      Signed-off-by: NElad Raz <eladr@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fabe5483
    • E
      mlxsw: reg: Add definition of multicast record for SFD register · 5230b25f
      Elad Raz 提交于
      Multicast-related records have specific format in SFD register.
      Signed-off-by: NElad Raz <eladr@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5230b25f
    • E
      bridge: Reflect MDB entries to hardware · f1fecb1d
      Elad Raz 提交于
      Offload MDB changes per port to hardware
      Signed-off-by: NElad Raz <eladr@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1fecb1d
    • E
      switchdev: Adding MDB entry offload · 4d41e125
      Elad Raz 提交于
      Define HW multicast entry: MAC and VID.
      Using a MAC address simplifies support for both IPV4 and IPv6.
      Signed-off-by: NElad Raz <eladr@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d41e125
  2. 10 1月, 2016 4 次提交
  3. 09 1月, 2016 11 次提交