1. 16 4月, 2016 21 次提交
  2. 15 4月, 2016 19 次提交
    • P
      tun: use per cpu variables for stats accounting · 608b9977
      Paolo Abeni 提交于
      Currently the tun device accounting uses dev->stats without applying any
      kind of protection, regardless that accounting happens in preemptible
      process context.
      This patch move the tun stats to a per cpu data structure, and protect
      the updates with  u64_stats_update_begin()/u64_stats_update_end() or
      this_cpu_inc according to the stat type. The per cpu stats are
      aggregated by the newly added ndo_get_stats64 ops.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      608b9977
    • D
      Merge branch 'bpf-ARG_PTR_TO_RAW_STACK' · 548aacdd
      David S. Miller 提交于
      Merge branch 'bpf-ARG_PTR_TO_RAW_STACK'
      
      Daniel Borkmann says:
      
      ====================
      BPF updates
      
      This series adds a new verifier argument type called
      ARG_PTR_TO_RAW_STACK and converts related helpers to make
      use of it. Basic idea is that we can save init of stack
      memory when the helper function is guaranteed to fully
      fill out the passed buffer in every path. Series also adds
      test cases and converts samples. For more details, please
      see individual patches.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      548aacdd
    • D
      bpf, samples: add test cases for raw stack · 3f2050e2
      Daniel Borkmann 提交于
      This adds test cases mostly around ARG_PTR_TO_RAW_STACK to check the
      verifier behaviour.
      
        [...]
        #84 raw_stack: no skb_load_bytes OK
        #85 raw_stack: skb_load_bytes, no init OK
        #86 raw_stack: skb_load_bytes, init OK
        #87 raw_stack: skb_load_bytes, spilled regs around bounds OK
        #88 raw_stack: skb_load_bytes, spilled regs corruption OK
        #89 raw_stack: skb_load_bytes, spilled regs corruption 2 OK
        #90 raw_stack: skb_load_bytes, spilled regs + data OK
        #91 raw_stack: skb_load_bytes, invalid access 1 OK
        #92 raw_stack: skb_load_bytes, invalid access 2 OK
        #93 raw_stack: skb_load_bytes, invalid access 3 OK
        #94 raw_stack: skb_load_bytes, invalid access 4 OK
        #95 raw_stack: skb_load_bytes, invalid access 5 OK
        #96 raw_stack: skb_load_bytes, invalid access 6 OK
        #97 raw_stack: skb_load_bytes, large access OK
        Summary: 98 PASSED, 0 FAILED
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f2050e2
    • D
      bpf, samples: don't zero data when not needed · 02413cab
      Daniel Borkmann 提交于
      Remove the zero initialization in the sample programs where appropriate.
      Note that this is an optimization which is now possible, old programs
      still doing the zero initialization are just fine as well. Also, make
      sure we don't have padding issues when we don't memset() the entire
      struct anymore.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02413cab
    • D
      bpf: convert relevant helper args to ARG_PTR_TO_RAW_STACK · 074f528e
      Daniel Borkmann 提交于
      This patch converts all helpers that can use ARG_PTR_TO_RAW_STACK as argument
      type. For tc programs this is bpf_skb_load_bytes(), bpf_skb_get_tunnel_key(),
      bpf_skb_get_tunnel_opt(). For tracing, this optimizes bpf_get_current_comm()
      and bpf_probe_read(). The check in bpf_skb_load_bytes() for MAX_BPF_STACK can
      also be removed since the verifier already makes sure we stay within bounds
      on stack buffers.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      074f528e
    • D
      bpf, verifier: add ARG_PTR_TO_RAW_STACK type · 435faee1
      Daniel Borkmann 提交于
      When passing buffers from eBPF stack space into a helper function, we have
      ARG_PTR_TO_STACK argument type for helpers available. The verifier makes sure
      that such buffers are initialized, within boundaries, etc.
      
      However, the downside with this is that we have a couple of helper functions
      such as bpf_skb_load_bytes() that fill out the passed buffer in the expected
      success case anyway, so zero initializing them prior to the helper call is
      unneeded/wasted instructions in the eBPF program that can be avoided.
      
      Therefore, add a new helper function argument type called ARG_PTR_TO_RAW_STACK.
      The idea is to skip the STACK_MISC check in check_stack_boundary() and color
      the related stack slots as STACK_MISC after we checked all call arguments.
      
      Helper functions using ARG_PTR_TO_RAW_STACK must make sure that every path of
      the helper function will fill the provided buffer area, so that we cannot leak
      any uninitialized stack memory. This f.e. means that error paths need to
      memset() the buffers, but the expected fast-path doesn't have to do this
      anymore.
      
      Since there's no such helper needing more than at most one ARG_PTR_TO_RAW_STACK
      argument, we can keep it simple and don't need to check for multiple areas.
      Should in future such a use-case really appear, we have check_raw_mode() that
      will make sure we implement support for it first.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      435faee1
    • D
      bpf, verifier: add bpf_call_arg_meta for passing meta data · 33ff9823
      Daniel Borkmann 提交于
      Currently, when the verifier checks calls in check_call() function, we
      call check_func_arg() for all 5 arguments e.g. to make sure expected types
      are correct. In some cases, we collect meta data (here: map pointer) to
      perform additional checks such as checking stack boundary on key/value
      sizes for subsequent arguments. As we're going to extend the meta data,
      add a generic struct bpf_call_arg_meta that we can use for passing into
      check_func_arg().
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33ff9823
    • M
      sctp: add support for RPS and RFS · 486bdee0
      Marcelo Ricardo Leitner 提交于
      This patch adds what's missing to properly support RPS and RFS on SCTP,
      as some of it is already implemented in common calls.
      
      Having support for RPS and RFS allows better scaling specially because
      not all NICs support hashing SCTP headers.
      
      Save the hash right when we dequeue a skb from inqueue so we do it only
      once per skb instead of per chunk. New sockets will then inherit the
      hash through sctp_copy_sock().
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      486bdee0
    • E
      net: validate_xmit_skb() changes · d21fd63e
      Eric Dumazet 提交于
      skbs given to validate_xmit_skb() should not have a next
      pointer anymore.
      
      Also if a packet is dropped, increment dev->tx_dropped
      __dev_queue_xmit() no longer has to change tx_dropped in this case.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d21fd63e
    • W
      packet: uses kfree_skb() for errors. · da37845f
      Weongyo Jeong 提交于
      consume_skb() isn't for error cases that kfree_skb() is more proper
      one.  At this patch, it fixed tpacket_rcv() and packet_rcv() to be
      consistent for error or non-error cases letting perf trace its event
      properly.
      Signed-off-by: NWeongyo Jeong <weongyo.linux@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da37845f
    • P
      tipc: fix a race condition leading to subscriber refcnt bug · 333f7962
      Parthasarathy Bhuvaragan 提交于
      Until now, the requests sent to topology server are queued
      to a workqueue by the generic server framework.
      These messages are processed by worker threads and trigger the
      registered callbacks.
      To reduce latency on uniprocessor systems, explicit rescheduling
      is performed using cond_resched() after MAX_RECV_MSG_COUNT(25)
      messages.
      
      This implementation on SMP systems leads to an subscriber refcnt
      error as described below:
      When a worker thread yields by calling cond_resched() in a SMP
      system, a new worker is created on another CPU to process the
      pending workitem. Sometimes the sleeping thread wakes up before
      the new thread finishes execution.
      This breaks the assumption on ordering and being single threaded.
      The fault is more frequent when MAX_RECV_MSG_COUNT is lowered.
      
      If the first thread was processing subscription create and the
      second thread processing close(), the close request will free
      the subscriber and the create request oops as follows:
      
      [31.224137] WARNING: CPU: 2 PID: 266 at include/linux/kref.h:46 tipc_subscrb_rcv_cb+0x317/0x380         [tipc]
      [31.228143] CPU: 2 PID: 266 Comm: kworker/u8:1 Not tainted 4.5.0+ #97
      [31.228377] Workqueue: tipc_rcv tipc_recv_work [tipc]
      [...]
      [31.228377] Call Trace:
      [31.228377]  [<ffffffff812fbb6b>] dump_stack+0x4d/0x72
      [31.228377]  [<ffffffff8105a311>] __warn+0xd1/0xf0
      [31.228377]  [<ffffffff8105a3fd>] warn_slowpath_null+0x1d/0x20
      [31.228377]  [<ffffffffa0098067>] tipc_subscrb_rcv_cb+0x317/0x380 [tipc]
      [31.228377]  [<ffffffffa00a4984>] tipc_receive_from_sock+0xd4/0x130 [tipc]
      [31.228377]  [<ffffffffa00a439b>] tipc_recv_work+0x2b/0x50 [tipc]
      [31.228377]  [<ffffffff81071925>] process_one_work+0x145/0x3d0
      [31.246554] ---[ end trace c3882c9baa05a4fd ]---
      [31.248327] BUG: spinlock bad magic on CPU#2, kworker/u8:1/266
      [31.249119] BUG: unable to handle kernel NULL pointer dereference at 0000000000000428
      [31.249323] IP: [<ffffffff81099d0c>] spin_dump+0x5c/0xe0
      [31.249323] PGD 0
      [31.249323] Oops: 0000 [#1] SMP
      
      In this commit, we
      - rename tipc_conn_shutdown() to tipc_conn_release().
      - move connection release callback execution from tipc_close_conn()
        to a new function tipc_sock_release(), which is executed before
        we free the connection.
      Thus we release the subscriber during connection release procedure
      rather than connection shutdown procedure.
      Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      333f7962
    • D
      Merge branch 'gro-fixed-id-gso-partial' · edd93cd7
      David S. Miller 提交于
      Alexander Duyck says:
      
      ====================
      GRO Fixed IPv4 ID support and GSO partial support
      
      This patch series sets up a few different things.
      
      First it adds support for GRO of frames with a fixed IP ID value.  This
      will allow us to perform GRO for frames that go through things like an IPv6
      to IPv4 header translation.
      
      The second item we add is support for segmenting frames that are generated
      this way.  Most devices only support an incrementing IP ID value, and in
      the case of TCP the IP ID can be ignored in many cases since the DF bit
      should be set.  So we can technically segment these frames using existing
      TSO if we are willing to allow the IP ID to be mangled.  As such I have
      added a matching feature for the new form of GRO/GSO called TCP IPv4 ID
      mangling.  With this enabled we can assemble and disassemble a frame with
      the sequence number fixed and the only ill effect will be that the IPv4 ID
      will be altered which may or may not have any noticeable effect.  As such I
      have defaulted the feature to disabled.
      
      The third item this patch series adds is support for partial GSO
      segmentation.  Partial GSO segmentation allows us to split a large frame
      into two pieces.  The first piece will have an even multiple of MSS worth
      of data and the headers before the one pointed to by csum_start will have
      been updated so that they are correct for if the data payload had already
      been segmented.  By doing this we can do things such as precompute the
      outer header checksums for a frame to be segmented allowing us to perform
      TSO on devices that don't support tunneling, or tunneling with outer header
      checksums.
      
      This patch set is based on the net-next tree, but I included "net: remove
      netdevice gso_min_segs" in my tree as I assume it is likely to be applied
      before this patch set will and I wanted to avoid a merge conflict.
      
      v2: Fixed items reported by Jesse Gross
      	fixed missing GSO flag in MPLS check
      	adding DF check for MANGLEID
          Moved extra GSO feature checks into gso_features_check
          Rebased batches to account for "net: remove netdevice gso_min_segs"
      
      Driver patches from the first patch set should still be compatible.  However
      I do have a few changes in them so I will submit a v2 of those to Jeff
      Kirsher once these patches are accepted into net-next.
      
      Example driver patches for i40e, ixgbe, and igb:
      https://patchwork.ozlabs.org/patch/608221/
      https://patchwork.ozlabs.org/patch/608224/
      https://patchwork.ozlabs.org/patch/608225/
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edd93cd7
    • A
      Documentation: Add documentation for TSO and GSO features · f7a6272b
      Alexander Duyck 提交于
      This document is a starting point for defining the TSO and GSO features.
      The whole thing is starting to get a bit messy so I wanted to make sure we
      have notes somwhere to start describing what does and doesn't work.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7a6272b
    • A
      GSO: Support partial segmentation offload · 802ab55a
      Alexander Duyck 提交于
      This patch adds support for something I am referring to as GSO partial.
      The basic idea is that we can support a broader range of devices for
      segmentation if we use fixed outer headers and have the hardware only
      really deal with segmenting the inner header.  The idea behind the naming
      is due to the fact that everything before csum_start will be fixed headers,
      and everything after will be the region that is handled by hardware.
      
      With the current implementation it allows us to add support for the
      following GSO types with an inner TSO_MANGLEID or TSO6 offload:
      NETIF_F_GSO_GRE
      NETIF_F_GSO_GRE_CSUM
      NETIF_F_GSO_IPIP
      NETIF_F_GSO_SIT
      NETIF_F_UDP_TUNNEL
      NETIF_F_UDP_TUNNEL_CSUM
      
      In the case of hardware that already supports tunneling we may be able to
      extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
      hardware can support updating inner IPv4 headers.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      802ab55a
    • A
      GRO: Add support for TCP with fixed IPv4 ID field, limit tunnel IP ID values · 1530545e
      Alexander Duyck 提交于
      This patch does two things.
      
      First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field.  As
      a result we should now be able to aggregate flows that were converted from
      IPv6 to IPv4.  In addition this allows us more flexibility for future
      implementations of segmentation as we may be able to use a fixed IP ID when
      segmenting the flow.
      
      The second thing this does is that it places limitations on the outer IPv4
      ID header in the case of tunneled frames.  Specifically it forces the IP ID
      to be incrementing by 1 unless the DF bit is set in the outer IPv4 header.
      This way we can avoid creating overlapping series of IP IDs that could
      possibly be fragmented if the frame goes through GRO and is then
      resegmented via GSO.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1530545e
    • A
      GSO: Add GSO type for fixed IPv4 ID · cbc53e08
      Alexander Duyck 提交于
      This patch adds support for TSO using IPv4 headers with a fixed IP ID
      field.  This is meant to allow us to do a lossless GRO in the case of TCP
      flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
      headers.
      
      In addition I am adding a feature that for now I am referring to TSO with
      IP ID mangling.  Basically when this flag is enabled the device has the
      option to either output the flow with incrementing IP IDs or with a fixed
      IP ID regardless of what the original IP ID ordering was.  This is useful
      in cases where the DF bit is set and we do not care if the original IP ID
      value is maintained.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbc53e08
    • A
      ethtool: Add support for toggling any of the GSO offloads · 518f213d
      Alexander Duyck 提交于
      The strings were missing for several of the GSO offloads that are
      available.  This patch provides the missing strings so that we can toggle
      or query any of them via the ethtool command.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      518f213d
    • D
      Merge branch 'mlxsw-devlink-shared-buffers' · cb689269
      David S. Miller 提交于
      Jiri Pirko says:
      
      ====================
      devlink + mlxsw: add support for config and control of shared buffers
      
      ASICs implement shared buffer for packet forwarding purposes and enable
      flexible partitioning of the shared buffer for different flows and ports,
      enabling non-blocking progress of different flows as well as separation
      of lossy traffic from loss-less traffic when using Per-Priority Flow
      Control (PFC). The shared buffer optimizes the buffer utilization for better
      absorption of packet bursts.
      
      This patchset implements API which is based on the model SAI uses. That is
      aligned with multiple ASIC vendors so this API should be vendor neutral.
      
      Userspace counterpart patchset for devlink iproute2 tool can be found here:
      https://github.com/jpirko/iproute2_mlxsw/tree/devlink_sb
      
      Couple of examples of usage:
      
      switch$ devlink sb help
      Usage: devlink sb show [ DEV [ sb SB_INDEX ] ]
             devlink sb pool show [ DEV [ sb SB_INDEX ] pool POOL_INDEX ]
             devlink sb pool set DEV [ sb SB_INDEX ] pool POOL_INDEX
                                 size POOL_SIZE thtype { static | dynamic }
             devlink sb port pool show [ DEV/PORT_INDEX [ sb SB_INDEX ]
                                         pool POOL_INDEX ]
             devlink sb port pool set DEV/PORT_INDEX [ sb SB_INDEX ]
                                      pool POOL_INDEX th THRESHOLD
             devlink sb tc bind show [ DEV/PORT_INDEX [ sb SB_INDEX ] tc TC_INDEX ]
             devlink sb tc bind set DEV/PORT_INDEX [ sb SB_INDEX ] tc TC_INDEX
                                    type { ingress | egress } pool POOL_INDEX
                                    th THRESHOLD
             devlink sb occupancy show { DEV | DEV/PORT_INDEX } [ sb SB_INDEX ]
             devlink sb occupancy snapshot DEV [ sb SB_INDEX ]
             devlink sb occupancy clearmax DEV [ sb SB_INDEX ]
      
      switch$ devlink sb show
      pci/0000:03:00.0: sb 0 size 16777216 ing_pools 4 eg_pools 4 ing_tcs 8 eg_tcs 8
      
      switch$ devlink sb pool show
      pci/0000:03:00.0: sb 0 pool 0 type ingress size 12400032 thtype dynamic
      pci/0000:03:00.0: sb 0 pool 1 type ingress size 0 thtype dynamic
      pci/0000:03:00.0: sb 0 pool 2 type ingress size 0 thtype dynamic
      pci/0000:03:00.0: sb 0 pool 3 type ingress size 200064 thtype dynamic
      pci/0000:03:00.0: sb 0 pool 4 type egress size 13220064 thtype dynamic
      pci/0000:03:00.0: sb 0 pool 5 type egress size 0 thtype dynamic
      pci/0000:03:00.0: sb 0 pool 6 type egress size 0 thtype dynamic
      pci/0000:03:00.0: sb 0 pool 7 type egress size 0 thtype dynamic
      
      switch$ devlink sb port pool show sw0p7 pool 0
      sw0p7: sb 0 pool 0 threshold 16
      
      switch$ sudo devlink sb port pool set sw0p7 pool 0 th 15
      
      switch$ devlink sb port pool show sw0p7 pool 0
      sw0p7: sb 0 pool 0 threshold 15
      
      switch$ devlink sb tc bind show sw0p7 tc 0 type ingress
      sw0p7: sb 0 tc 0 type ingress pool 0 threshold 10
      
      switch$ sudo devlink sb tc bind set sw0p7 tc 0 type ingress pool 0 th 9
      
      switch$ devlink sb tc bind show sw0p7 tc 0 type ingress
      sw0p7: sb 0 tc 0 type ingress pool 0 threshold 9
      
      switch$ sudo devlink sb occupancy snapshot pci/0000:03:00.0
      
      switch$ devlink sb occupancy show sw0p7
      sw0p7:
        pool: 0:      82944/3217344 1:          0/0       2:          0/0       3:          0/0
              4:          0/384     5:          0/0       6:          0/0       7:          0/0
        itc:  0(0):   96768/3217344 1(0):       0/0       2(0):       0/0       3(0):       0/0
              4(0):       0/0       5(0):       0/0       6(0):       0/0       7(0):       0/0
        etc:  0(4):       0/384     1(4):       0/0       2(4):       0/0       3(4):       0/0
              4(4):       0/0       5(4):       0/0       6(4):       0/0       7(4):       0/0
      
      switch$ sudo devlink sb occupancy clearmax pci/0000:03:00.0
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb689269
    • J
      mlxsw: spectrum_buffers: Implement occupancy monitoring · 2d0ed39f
      Jiri Pirko 提交于
      Implement occupancy API introduced in devlink and mlxsw core. This is
      done by accessing SBPM register for Port-Pool and SBSR for Port-TC
      current and max occupancy values. Max clear is implemented using the
      same registers.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d0ed39f