1. 03 12月, 2016 24 次提交
  2. 02 12月, 2016 16 次提交
    • S
      sock: reset sk_err for ICMP packets read from error queue · 83a1a1a7
      Soheil Hassas Yeganeh 提交于
      Only when ICMP packets are enqueued onto the error queue,
      sk_err is also set. Before f5f99309 (sock: do not set sk_err
      in sock_dequeue_err_skb), a subsequent error queue read
      would set sk_err to the next error on the queue, or 0 if empty.
      As no error types other than ICMP set this field, sk_err should
      not be modified upon dequeuing them.
      
      Only for ICMP errors, reset the (racy) sk_err. Some applications,
      like traceroute, rely on it and go into a futile busy POLLERR
      loop otherwise.
      
      In principle, sk_err has to be set while an ICMP error is queued.
      Testing is_icmp_err_skb(skb_next) approximates this without
      requiring a full queue walk. Applications that receive both ICMP
      and other errors cannot rely on this legacy behavior, as other
      errors do not set sk_err in the first place.
      
      Fixes: f5f99309 (sock: do not set sk_err in sock_dequeue_err_skb)
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83a1a1a7
    • D
      Merge branch 'lwt-bpf' · f577e22c
      David S. Miller 提交于
      Thomas Graf says:
      
      ====================
      bpf: BPF for lightweight tunnel encapsulation
      
      This series implements BPF program invocation from dst entries via the
      lightweight tunnels infrastructure. The BPF program can be attached to
      lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and see an L3
      skb as context. Programs attached to input and output are read-only.
      Programs attached to lwtunnel_xmit() can modify and redirect, push headers
      and redirect packets.
      
      The facility can be used to:
       - Collect statistics and generate sampling data for a subset of traffic
         based on the dst utilized by the packet thus allowing to extend the
         existing realms.
       - Apply additional per route/dst filters to prohibit certain outgoing
         or incoming packets based on BPF filters. In particular, this allows
         to maintain per dst custom state across multiple packets in BPF maps
         and apply filters based on statistics and behaviour observed over time.
       - Attachment of L2 headers at transmit where resolving the L2 address
         is not required.
       - Possibly many more.
      
      v3 -> v4:
       - Bumped LWT_BPF_MAX_HEADROOM from 128 to 256 (Alexei)
       - Renamed bpf_skb_push() helper to bpf_skb_change_head() to relate to
         existing bpf_skb_change_tail() helper (Alexei/Daniel)
       - Added check in __bpf_redirect_common() to verify that program added a
         link header before redirecting to a l2 device. Adding the check to
         lwt-bpf code was considered but dropped due to massive code required
         due to retrieval of net_device via per-cpu redirect buffer. A test
         case was added to cover the scenario when a program directs to an l2
         device without adding an appropriate l2 header.
         (Alexei)
       - Prohibited access to tc_classid (Daniel)
       - Collapsed bpf_verifier_ops instance for lwt in/out as they are
         identical (Daniel)
       - Some cosmetic changes
      
      v2 -> v3:
       - Added real world sample lwt_len_hist_kern.c which demonstrates how to
         collect a histogram on packet sizes for all packets flowing through
         a number of routes.
       - Restricted output to be read-only. Since the header can no longer
         be modified, the rerouting functionality has been removed again.
       - Added test case which cover destructive modification of packet data.
      
      v1 -> v2:
       - Added new BPF_LWT_REROUTE return code for program to indicate
         that new route lookup should be performed. Suggested by Tom.
       - New sample to illustrate rerouting
       - New patch 05: Recursion limit for lwtunnel_output for the case
         when user creates circular dst redirection. Also resolves the
         issue for ILA.
       - Fix to ensure headroom for potential future L2 header is still
         guaranteed
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f577e22c
    • T
      bpf: Add tests and samples for LWT-BPF · f74599f7
      Thomas Graf 提交于
      Adds a series of tests to verify the functionality of attaching
      BPF programs at LWT hooks.
      
      Also adds a sample which collects a histogram of packet sizes which
      pass through an LWT hook.
      
      $ ./lwt_len_hist.sh
      Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.253.2 () port 0 AF_INET : demo
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    39857.69
             1 -> 1        : 0        |                                      |
             2 -> 3        : 0        |                                      |
             4 -> 7        : 0        |                                      |
             8 -> 15       : 0        |                                      |
            16 -> 31       : 0        |                                      |
            32 -> 63       : 22       |                                      |
            64 -> 127      : 98       |                                      |
           128 -> 255      : 213      |                                      |
           256 -> 511      : 1444251  |********                              |
           512 -> 1023     : 660610   |***                                   |
          1024 -> 2047     : 535241   |**                                    |
          2048 -> 4095     : 19       |                                      |
          4096 -> 8191     : 180      |                                      |
          8192 -> 16383    : 5578023  |************************************* |
         16384 -> 32767    : 632099   |***                                   |
         32768 -> 65535    : 6575     |                                      |
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f74599f7
    • T
      bpf: BPF for lightweight tunnel infrastructure · 3a0af8fd
      Thomas Graf 提交于
      Registers new BPF program types which correspond to the LWT hooks:
        - BPF_PROG_TYPE_LWT_IN   => dst_input()
        - BPF_PROG_TYPE_LWT_OUT  => dst_output()
        - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
      
      The separate program types are required to differentiate between the
      capabilities each LWT hook allows:
      
       * Programs attached to dst_input() or dst_output() are restricted and
         may only read the data of an skb. This prevent modification and
         possible invalidation of already validated packet headers on receive
         and the construction of illegal headers while the IP headers are
         still being assembled.
      
       * Programs attached to lwtunnel_xmit() are allowed to modify packet
         content as well as prepending an L2 header via a newly introduced
         helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is
         invoked after the IP header has been assembled completely.
      
      All BPF programs receive an skb with L3 headers attached and may return
      one of the following error codes:
      
       BPF_OK - Continue routing as per nexthop
       BPF_DROP - Drop skb and return EPERM
       BPF_REDIRECT - Redirect skb to device as per redirect() helper.
                      (Only valid in lwtunnel_xmit() context)
      
      The return codes are binary compatible with their TC_ACT_
      relatives to ease compatibility.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a0af8fd
    • T
      route: Set lwtstate for local traffic and cached input dsts · efd85700
      Thomas Graf 提交于
      A route on the output path hitting a RTN_LOCAL route will keep the dst
      associated on its way through the loopback device. On the receive path,
      the dst_input() call will thus invoke the input handler of the route
      created in the output path. Thus, lwt redirection for input must be done
      for dsts allocated in the otuput path as well.
      
      Also, if a route is cached in the input path, the allocated dst should
      respect lwtunnel configuration on the nexthop as well.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efd85700
    • T
      route: Set orig_output when redirecting to lwt on locally generated traffic · 11b3d9c5
      Thomas Graf 提交于
      orig_output for IPv4 was only set for dsts which hit an input route.
      Set it consistently for locally generated traffic as well to allow
      lwt to continue the dst_output() path as configured by the nexthop.
      
      Fixes: 25368623 ("lwt: Add support to redirect dst.input")
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11b3d9c5
    • D
      Merge branch 'mlx5-updates' · ee3d7c6e
      David S. Miller 提交于
      Saeed Mahameed says:
      
      ====================
      Mellanox 100G mlx5 updates 2016-11-29
      
      The following series from Tariq and Roi, provides some critical fixes
      and updates for the mlx5e driver.
      
      From Tariq:
       - Fix driver coherent memory huge allocation issues by fragmenting
         completion queues, in a way that is transparent to the netdev driver by
         providing a new buffer type "mlx5_frag_buf" with the same access API.
       - Create UMR MKey per RQ to have better scalability.
      
      From Roi:
       - Some fixes for the encap-decap support and tc flower added lately to the
         mlx5e driver.
      
      v1->v2:
       - Fix start index in error flow of mlx5_frag_buf_alloc_node, pointed out by Eric.
      
      This series was generated against commit:
      31ac1c19 ("geneve: fix ip_hdr_len reserved for geneve6 tunnel.")
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee3d7c6e
    • R
      net/mlx5e: Remove flow encap entry in the correct place · 5067b602
      Roi Dayan 提交于
      Handling flow encap entry should be inside tc del flow
      and is only relevant for offloaded eswitch TC rules.
      
      Fixes: 11a457e9b6c1 ("net/mlx5e: Add basic TC tunnel set action for SRIOV offloads")
      Signed-off-by: NRoi Dayan <roid@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5067b602
    • R
      net/mlx5e: Refactor tc del flow to accept mlx5e_tc_flow instance · 961e8979
      Roi Dayan 提交于
      Change the function that deletes offloaded TC rule to get
      struct mlx5e_tc_flow instance which contains both the flow
      handle and flow attributes. This is a cleanup needed for
      downstream patches, it doesn't change any functionality.
      Signed-off-by: NRoi Dayan <roid@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      961e8979
    • R
      net/mlx5e: Correct cleanup order when deleting offloaded TC rules · 86a33ae1
      Roi Dayan 提交于
      According to the reverse unwinding principle, on delete time we should
      first handle deletion of the steering rule and later handle the vlan
      deletion from the eswitch.
      
      Fixes: 8b32580d ("net/mlx5e: Add TC vlan action for SRIOV offloads")
      Signed-off-by: NRoi Dayan <roid@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86a33ae1
    • R
      net/mlx5e: Remove redundant hashtable lookup in configure flower · 53636068
      Roi Dayan 提交于
      We will never find a flow with the same cookie as cls_flower always
      allocates a new flow and the cookie is the allocated memory address.
      
      Fixes: e3a2b7ed ("net/mlx5e: Support offload cls_flower with drop action")
      Signed-off-by: NRoi Dayan <roid@mellanox.com>
      Reviewed-by: NHadar Hen Zion <hadarh@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53636068
    • T
      net/mlx5e: Create UMR MKey per RQ · ec8b9981
      Tariq Toukan 提交于
      In Striding RQ implementation, we used a single UMR
      (User-Mode Memory Registration) memory key for all RQs.
      When the product of RQs number*size gets high, we hit a
      limitation of u16 field size in FW.
      
      Here we move to using a UMR memory key per RQ, so we can
      scale to any number of rings, with the maximum buffer
      size in each.
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec8b9981
    • T
      net/mlx5e: Move function mlx5e_create_umr_mkey · 3608ae77
      Tariq Toukan 提交于
      In next patch we are going to create a UMR MKey per RQ, we need
      mlx5e_create_umr_mkey declared before mlx5e_create_rq.
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3608ae77
    • T
      net/mlx5e: Implement Fragmented Work Queue (WQ) · 1c1b5228
      Tariq Toukan 提交于
      Add new type of struct mlx5_frag_buf which is used to allocate fragmented
      buffers rather than contiguous, and make the Completion Queues (CQs) use
      it as they are big (default of 2MB per CQ in Striding RQ).
      
      This fixes the failures of type:
      "mlx5e_open_locked: mlx5e_open_channels failed, -12"
      due to dma_zalloc_coherent insufficient contiguous coherent memory to
      satisfy the driver's request when the user tries to setup more or larger
      rings.
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Reported-by: NSebastian Ott <sebott@linux.vnet.ibm.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c1b5228
    • D
      Merge branch 'altera-tse-sgmii-pcs' · 6c0c6203
      David S. Miller 提交于
      Neill Whillans says:
      
      ====================
      net: Add support for SGMII PCS on Altera TSE MAC
      
      These patches were created as part of work to add support for SGMII
      PCS functionality to the Altera TSE MAC. Patches are based on 4.9-rc6
      git tree.
      
      The first patch in the series adds support for the VSC8572 dual-port
      Gigabit Ethernet transceiver, used in integration testing.
      
      The second patch adds support for the SGMII PCS functionality to the
      Altera TSE driver.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c0c6203
    • N
      net: ethernet: altera_tse: add support for SGMII PCS · 3b804564
      Neill Whillans 提交于
      Add support for the (optional) SGMII PCS functionality of the Altera
      TSE MAC. If the phy-mode is set to 'sgmii' then we attempt to discover
      and initialise the PCS so that the MAC can communicate to the PHY.
      
      The PCS IP block provides a scratch register for testing presence of
      the PCS, which is mapped into one of the two MDIO spaces present in
      the MAC's register space.  Once we have determined that the scratch
      register is functioning, we attempt to initialise the PCS to
      auto-negotiate an SGMII link with the PHY. There is no need to monitor
      or manage the SGMII link beyond this, since the normal PHY MDIO will
      then be used to monitor the media layer.
      
      The Altera TSE MAC has only one way in which it can be configured with an
      SGMII PCS, and as such, this patch only looks to the phy-mode to select
      whether or not to attempt to initialise the PCS registers.  During
      initialisation, we report the PCS's equivalent of a PHY ID register.
      This can be parameterised during the IP instantiation and is often left
      as '0x00000000' which is not an error.
      Signed-off-by: NNeill Whillans <neill.whillans@codethink.co.uk>
      Reviewed-by: NDaniel Silverstone <daniel.silverstone@codethink.co.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b804564