1. 22 2月, 2016 15 次提交
    • D
      Merge branch 'bpf-helper-improvements' · 9c572dc4
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      BPF updates
      
      This set contains various updates for eBPF, i.e. the addition of a
      generic csum helper function and other misc bits that mostly improve
      existing helpers and ease programming with eBPF on cls_bpf. For more
      details, please see individual patches.
      
      Set is rebased on top of http://patchwork.ozlabs.org/patch/584465/.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c572dc4
    • D
      bpf: don't emit mov A,A on return · 6205b9cf
      Daniel Borkmann 提交于
      While debugging with bpf_jit_disasm I noticed emissions of 'mov %eax,%eax',
      and found that this comes from BPF_RET | BPF_A translations from classic
      BPF. Emitting this is unnecessary as BPF_REG_A is mapped into BPF_REG_0
      already, therefore only emit a mov when immediates are used as return value.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6205b9cf
    • D
      bpf: fix csum update in bpf_l4_csum_replace helper for udp · 2f72959a
      Daniel Borkmann 提交于
      When using this helper for updating UDP checksums, we need to extend
      this in order to write CSUM_MANGLED_0 for csum computations that result
      into 0 as sum. Reason we need this is because packets with a checksum
      could otherwise become incorrectly marked as a packet without a checksum.
      Likewise, if the user indicates BPF_F_MARK_MANGLED_0, then we should
      not turn packets without a checksum into ones with a checksum.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f72959a
    • D
      bpf: try harder on clones when writing into skb · 3697649f
      Daniel Borkmann 提交于
      When we're dealing with clones and the area is not writeable, try
      harder and get a copy via pskb_expand_head(). Replace also other
      occurences in tc actions with the new skb_try_make_writable().
      Reported-by: NAshhad Sheikh <ashhadsheikh394@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3697649f
    • D
      bpf: remove artificial bpf_skb_{load, store}_bytes buffer limitation · 21cafc1d
      Daniel Borkmann 提交于
      We currently limit bpf_skb_store_bytes() and bpf_skb_load_bytes()
      helpers to only store or load a maximum buffer of 16 bytes. Thus,
      loading, rewriting and storing headers require several bpf_skb_load_bytes()
      and bpf_skb_store_bytes() calls.
      
      Also here we can use a per-cpu scratch buffer instead in order to not
      pressure stack space any further. I do suspect that this limit was mainly
      set in place for this particular reason. So, ease program development
      by removing this limitation and make the scratchpad generic, so it can
      be reused.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21cafc1d
    • D
      bpf: add generic bpf_csum_diff helper · 7d672345
      Daniel Borkmann 提交于
      For L4 checksums, we currently have bpf_l4_csum_replace() helper. It's
      currently limited to handle 2 and 4 byte changes in a header and feeds the
      from/to into inet_proto_csum_replace{2,4}() helpers of the kernel. When
      working with IPv6, for example, this makes it rather cumbersome to deal
      with, similarly when editing larger parts of a header.
      
      Instead, extend the API in a more generic way: For bpf_l4_csum_replace(),
      add a case for header field mask of 0 to change the checksum at a given
      offset through inet_proto_csum_replace_by_diff(), and provide a helper
      bpf_csum_diff() that can generically calculate a from/to diff for arbitrary
      amounts of data.
      
      This can be used in multiple ways: for the bpf_l4_csum_replace() only
      part, this even provides us with the option to insert precalculated diffs
      from user space f.e. from a map, or from bpf_csum_diff() during runtime.
      
      bpf_csum_diff() has a optional from/to stack buffer input, so we can
      calculate a diff by using a scratchbuffer for scenarios where we're
      inserting (from is NULL), removing (to is NULL) or diffing (from/to buffers
      don't need to be of equal size) data. Also, bpf_csum_diff() allows to
      feed a previous csum into csum_partial(), so the function can also be
      cascaded.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d672345
    • D
      bpf: add new arg_type that allows for 0 sized stack buffer · 8e2fe1d9
      Daniel Borkmann 提交于
      Currently, when we pass a buffer from the eBPF stack into a helper
      function, the function proto indicates argument types as ARG_PTR_TO_STACK
      and ARG_CONST_STACK_SIZE pair. If R<X> contains the former, then R<X+1>
      must be of the latter type. Then, verifier checks whether the buffer
      points into eBPF stack, is initialized, etc. The verifier also guarantees
      that the constant value passed in R<X+1> is greater than 0, so helper
      functions don't need to test for it and can always assume a non-NULL
      initialized buffer as well as non-0 buffer size.
      
      This patch adds a new argument types ARG_CONST_STACK_SIZE_OR_ZERO that
      allows to also pass NULL as R<X> and 0 as R<X+1> into the helper function.
      Such helper functions, of course, need to be able to handle these cases
      internally then. Verifier guarantees that either R<X> == NULL && R<X+1> == 0
      or R<X> != NULL && R<X+1> != 0 (like the case of ARG_CONST_STACK_SIZE), any
      other combinations are not possible to load.
      
      I went through various options of extending the verifier, and introducing
      the type ARG_CONST_STACK_SIZE_OR_ZERO seems to have most minimal changes
      needed to the verifier.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e2fe1d9
    • D
      Merge branch 'geneve-vxlan-outer-checksum' · 8b393f83
      David S. Miller 提交于
      Alexander Duyck says:
      
      ====================
      GENEVE/VXLAN: Enable outer Tx checksum by default
      
      This patch series makes it so that we enable the outer Tx checksum for IPv4
      tunnels by default.  This makes the behavior consistent with how we were
      handling this for IPv6.  In addition I have updated the internal flags for
      these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
      match up will with the ZERO_CSUM6_TX flag which was already in use for
      IPv6.
      
      For most network devices this should be a net gain in terms of performance
      as having the outer header checksum present allows for devices to report
      CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
      to determine if the inner header checksum is valid.
      
      Below is some data I collected with ixgbe with an X540 that demonstrates
      this.  I located two PFs connected back to back in two different name
      spaces and then setup a pair of tunnels on each, one with checksum enabled
      and one without.
      
      Recv   Send    Send                          Utilization
      Socket Socket  Message  Elapsed              Send
      Size   Size    Size     Time     Throughput  local
      bytes  bytes   bytes    secs.    10^6bits/s  % S
      
      noudpcsum:
       87380  16384  16384    30.00      8898.67   12.80
      udpcsum:
       87380  16384  16384    30.00      9088.47   5.69
      
      The one spot where this may cause a performance regression is if the
      environment contains devices that can parse the inner headers and a device
      supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
      the case of such a device we have to fall back to using GSO to segment the
      tunnel instead of TSO and as a result we may take a performance hit as seen
      below with i40e.
      
      Recv   Send    Send                          Utilization
      Socket Socket  Message  Elapsed              Send
      Size   Size    Size     Time     Throughput  local
      bytes  bytes   bytes    secs.    10^6bits/s  % S
      
      noudpcsum:
       87380  16384  16384    30.00      9085.21   3.32
      udpcsum:
       87380  16384  16384    30.00      9089.23   5.54
      
      In addition it will be necessary to update iproute2 so that we don't
      provide the checksum attribute unless specified.  This way on older kernels
      which don't have local checksum offload we will default to disabling the
      outer checksum, and on newer kernels that have LCO we can default to
      enabling it.
      
      I also haven't investigated the effect this will have on OVS.  However I
      suspect the impact should be minimal as the worst case scenario should be
      that Tx checksumming will become enabled by default which should be
      consistent with the existing behavior for IPv6.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b393f83
    • A
      VXLAN: Support outer IPv4 Tx checksums by default · 6ceb31ca
      Alexander Duyck 提交于
      This change makes it so that if UDP CSUM is not specified we will default
      to enabling it.  The main motivation behind this is the fact that with the
      use of outer checksum we can greatly improve the performance for VXLAN
      tunnels on devices that don't know how to parse tunnel headers.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Acked-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ceb31ca
    • A
      GENEVE: Support outer IPv4 Tx checksums by default · 14f1f724
      Alexander Duyck 提交于
      This change makes it so that if UDP CSUM is not specified we will default
      to enabling it.  The main motivation behind this is the fact that with the
      use of outer checksum we can greatly improve the performance for GENEVE
      tunnels on hardware that doesn't know how to parse them.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Acked-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14f1f724
    • D
      Merge branch 'lwt-autoload' · 417b7ca4
      David S. Miller 提交于
      Robert Shearman says:
      
      ====================
      lwtunnel: autoload of lwt modules
      
      Changes since v1:
       - remove "LWTUNNEL_ENCAP_" prefix for the string form of the encaps
         used when requesting the module to reduce duplication, and don't
         bother returning strings for lwt modules using netdevices, both
         suggested by Jiri.
       - update commit message of first patch to clarify security
         implications, in response to Eric's comments.
      
      The lwt implementations using net devices can autoload using the
      existing mechanism using IFLA_INFO_KIND. However, there's no mechanism
      that lwt modules not using net devices can use.
      
      Therefore, these patches add the ability to autoload modules
      registering lwt operations for lwt implementations not using a net
      device so that users don't have to manually load the modules.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      417b7ca4
    • R
      ila: autoload module · 84a8cbe4
      Robert Shearman 提交于
      Avoid users having to manually load the module by adding a module
      alias allowing it to be autoloaded by the lwt infra.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84a8cbe4
    • R
      mpls: autoload lwt module · b2b04edc
      Robert Shearman 提交于
      Avoid users having to manually load the module by adding a module
      alias allowing it to be autoloaded by the lwt infra.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2b04edc
    • R
      lwtunnel: autoload of lwt modules · 745041e2
      Robert Shearman 提交于
      The lwt implementations using net devices can autoload using the
      existing mechanism using IFLA_INFO_KIND. However, there's no mechanism
      that lwt modules not using net devices can use.
      
      Therefore, add the ability to autoload modules registering lwt
      operations for lwt implementations not using a net device so that
      users don't have to manually load the modules.
      
      Only users with the CAP_NET_ADMIN capability can cause modules to be
      loaded, which is ensured by rtnetlink_rcv_msg rejecting non-RTM_GETxxx
      messages for users without this capability, and by
      lwtunnel_build_state not being called in response to RTM_GETxxx
      messages.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      745041e2
    • Z
      vlan: turn on unicast filtering on vlan device · e817af27
      Zhang Shengju 提交于
      Currently vlan device inherits unicast filtering flag from underlying
      device. If underlying device doesn't support unicast filter, this will
      put vlan device into promiscuous mode when it's stacked.
      
      Tun on IFF_UNICAST_FLT on the vlan device in any case so that it does
      not go into promiscuous mode needlessly. If underlying device does not
      support unicast filtering, that device will enter promiscuous mode.
      Signed-off-by: NZhang Shengju <zhangshengju@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e817af27
  2. 20 2月, 2016 24 次提交
    • D
      Merge branch 'bpf-get-stackid' · 80c804bf
      David S. Miller 提交于
      Alexei Starovoitov says:
      
      ====================
      bpf_get_stackid() and stack_trace map
      
      This patch set introduces new map type to store stack traces and
      corresponding bpf_get_stackid() helper.
      BPF programs already can walk the stack via unrolled loop
      of bpf_probe_read()s which is ok for simple analysis, but it's
      not efficient and limited to <30 frames after that the programs
      don't fit into MAX_BPF_STACK. With bpf_get_stackid() helper
      the programs can collect up to PERF_MAX_STACK_DEPTH both
      user and kernel frames.
      Using stack traces as a key in a map turned out to be very useful
      for generating flame graphs, off-cpu graphs, waker and chain graphs.
      Patch 3 is a simplified version of 'offwaketime' tool which is
      described in detail here:
      http://brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html
      
      Earlier version of this patch were using save_stack_trace() helper,
      but 'unreliable' frames add to much noise and two equiavlent
      stack traces produce different 'stackid's.
      Using lockdep style of storing frames with MAX_STACK_TRACE_ENTRIES is
      great for lockdep, but not acceptable for bpf, since the stack_trace
      map needs to be freed when user Ctrl-C the tool.
      The ftrace style with per_cpu(struct ftrace_stack) is great, but it's
      tightly coupled with ftrace ring buffer and has the same 'unreliable'
      noise. perf_event's perf_callchain() mechanism is also very efficient
      and it only needed minor generalization which is done in patch 1
      to be used by bpf stack_trace maps.
      Peter, please take a look at patch 1.
      If you're ok with it, I'd like to take the whole set via net-next.
      
      Patch 1 - generalization of perf_callchain()
      Patch 2 - stack_trace map done as lock-less hashtable without link list
        to avoid spinlock on insertion which is critical path when
        bpf_get_stackid() helper is called for every task switch event
      Patch 3 - offwaketime example
      
      After the patch the 'perf report' for artificial 'sched_bench'
      benchmark that doing pthread_cond_wait/signal and 'offwaketime'
      example is running in the background:
       16.35%  swapper      [kernel.vmlinux]    [k] intel_idle
        2.18%  sched_bench  [kernel.vmlinux]    [k] __switch_to
        2.18%  sched_bench  libpthread-2.12.so  [.] pthread_cond_signal@@GLIBC_2.3.2
        1.72%  sched_bench  libpthread-2.12.so  [.] pthread_mutex_unlock
        1.53%  sched_bench  [kernel.vmlinux]    [k] bpf_get_stackid
        1.44%  sched_bench  [kernel.vmlinux]    [k] entry_SYSCALL_64
        1.39%  sched_bench  [kernel.vmlinux]    [k] __call_rcu.constprop.73
        1.13%  sched_bench  libpthread-2.12.so  [.] pthread_mutex_lock
        1.07%  sched_bench  libpthread-2.12.so  [.] pthread_cond_wait@@GLIBC_2.3.2
        1.07%  sched_bench  [kernel.vmlinux]    [k] hash_futex
        1.05%  sched_bench  [kernel.vmlinux]    [k] do_futex
        1.05%  sched_bench  [kernel.vmlinux]    [k] get_futex_key_refs.isra.13
      
      The hotest part of bpf_get_stackid() is inlined jhash2, so we may consider
      using some faster hash in the future, but it's good enough for now.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80c804bf
    • A
      samples/bpf: offwaketime example · a6ffe7b9
      Alexei Starovoitov 提交于
      This is simplified version of Brendan Gregg's offwaketime:
      This program shows kernel stack traces and task names that were blocked and
      "off-CPU", along with the stack traces and task names for the threads that woke
      them, and the total elapsed time from when they blocked to when they were woken
      up. The combined stacks, task names, and total time is summarized in kernel
      context for efficiency.
      
      Example:
      $ sudo ./offwaketime | flamegraph.pl > demo.svg
      Open demo.svg in the browser as FlameGraph visualization.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6ffe7b9
    • A
      bpf: introduce BPF_MAP_TYPE_STACK_TRACE · d5a3b1f6
      Alexei Starovoitov 提交于
      add new map type to store stack traces and corresponding helper
      bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
      @ctx: struct pt_regs*
      @map: pointer to stack_trace map
      @flags: bits 0-7 - numer of stack frames to skip
              bit 8 - collect user stack instead of kernel
              bit 9 - compare stacks by hash only
              bit 10 - if two different stacks hash into the same stackid
                       discard old
              other bits - reserved
      Return: >= 0 stackid on success or negative error
      
      stackid is a 32-bit integer handle that can be further combined with
      other data (including other stackid) and used as a key into maps.
      
      Userspace will access stackmap using standard lookup/delete syscall commands to
      retrieve full stack trace for given stackid.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5a3b1f6
    • A
      perf: generalize perf_callchain · 568b329a
      Alexei Starovoitov 提交于
      . avoid walking the stack when there is no room left in the buffer
      . generalize get_perf_callchain() to be called from bpf helper
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      568b329a
    • D
      net: use skb_postpush_rcsum instead of own implementations · 6b83d28a
      Daniel Borkmann 提交于
      Replace individual implementations with the recently introduced
      skb_postpush_rcsum() helper.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NTom Herbert <tom@herbertland.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b83d28a
    • A
      phy: marvell/micrel: Fix Unpossible condition · 321b4d4b
      Andrew Lunn 提交于
      commit 2b2427d0 ("phy: micrel: Add ethtool statistics counters")
      from Dec 30, 2015, leads to the following static checker
      warning:
      
              drivers/net/phy/micrel.c:609 kszphy_get_stat()
              warn: unsigned 'val' is never less than zero.
      
      drivers/net/phy/micrel.c
         602  static u64 kszphy_get_stat(struct phy_device *phydev, int i)
         603  {
         604          struct kszphy_hw_stat stat = kszphy_hw_stats[i];
         605          struct kszphy_priv *priv = phydev->priv;
         606          u64 val;
         607
         608          val = phy_read(phydev, stat.reg);
         609          if (val < 0) {
                          ^^^^^^^
      Unpossible!
      
         610                  val = UINT64_MAX;
         611          } else {
         612                  val = val & ((1 << stat.bits) - 1);
         613                  priv->stats[i] += val;
         614                  val = priv->stats[i];
         615          }
         616
         617          return val;
         618  }
      
      The same problem exists in the Marvell driver. Fix both.
      
      Fixes: 2b2427d0 ("phy: micrel: Add ethtool statistics counters")
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reported-by: NJulia.Lawall <julia.lawall@lip6.fr>
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      321b4d4b
    • D
      Merge branch 'ethtool-perqueue-params' · 2f860177
      David S. Miller 提交于
      Kan Liang says:
      
      ====================
      ethtool per queue parameters support
      
      Modern network interface controllers usually support multiple receive
      and transmit queues. Each queue may have its own parameters. For
      example, Intel XL710/X710 hardware supports per queue interrupt
      moderation. However, current ethtool does not support per queue
      parameters option. User has to set parameters for the whole NIC.
      This series extends ethtool to support per queue parameters option.
      
      Since the support of per queue parameters vary with different cards,
      it is impossible to address all cards in one patch. This series only
      supports per queue coalesce options on i40e driver. The framework used
      in the patch can be easily extended to other cards and parameters.
      
      The lib bitmap needs to be extended to facilitate exchanging queue bitmaps
      between user space and kernel space. Two patches from David's latest V8
      patch series are also cited in this series. You may refer to
      https://lkml.org/lkml/2016/2/9/919 for more details.
      
      Changes since V6:
       - Rebase on commit 76d13b56. Did minor change in patch 6.
      
      Changes since V5:
       - Add test_bitmap.c and bitmap.sh in the series. They are forgot
         to be added previously.
       - Update the first two patches to David's latest V8 version. The changes
         include
            - bitmap u32 API returns number of bits copied, unit tests updated
            - module_exit in test_bitmap
       - Also change the mode of bitmap.sh to 755 according to Ben's suggestion
      
      Changes since V4:
       - Modify set/get_per_queue_coalesce function description
       - Change the queue number to be u32
       - Correct an error of calculating coalesce backup buffer address
       - Rename queue_num to n_queues
       - Don't log error message in __i40e_get_coalesce
      
      Changes since V3:
       - Based on David's lib bitmap.
       - ETHTOOL_PERQUEUE should be handled before the containing switch
       - Make the rollback code unconditional
       - some minor changes according to Ben's feedback
      
      Changes since V2:
       - Add queue-specific settings for interrupt moderation in i40e
      
      Changes since V1:
       - Checking the sub-command number to determine whether the command
         requires CAP_NET_ADMIN
       - Refine the struct ethtool_per_queue_op and improve the comments
       - Use bitmap functions to parse queue mask
       - Improve comments
       - Use bitmap functions to parse queue mask
       - Improve comments
       - Add rollback support
       - Correct the way to find the vector for specific queue.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f860177
    • K
      i40e/ethtool: support coalesce setting by queue · f3757a4d
      Kan Liang 提交于
      This patch implements set_per_queue_coalesce for i40e driver.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Acked-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3757a4d
    • K
      i40e/ethtool: support coalesce getting by queue · be280bad
      Kan Liang 提交于
      This patch implements get_per_queue_coalesce for i40e driver.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Acked-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be280bad
    • K
      i40e: queue-specific settings for interrupt moderation · a75e8005
      Kan Liang 提交于
      For i40e driver, each vector has its own ITR register. However, there
      are no concept of queue-specific settings in the driver proper. Only
      global variable is used to store ITR values. That will cause problems
      especially when resetting the vector. The specific ITR values could be
      lost.
      This patch move rx_itr_setting and tx_itr_setting to i40e_ring to store
      specific ITR register for each queue.
      i40e_get_coalesce and i40e_set_coalesce are also modified accordingly to
      support queue-specific settings. To make it compatible with old ethtool,
      if user doesn't specify the queue number, i40e_get_coalesce will return
      queue 0's value. While i40e_set_coalesce will apply value to all queues.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Acked-by: NShannon Nelson <shannon.nelson@intel.com>
      Acked-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a75e8005
    • K
      net/ethtool: support set coalesce per queue · f38d138a
      Kan Liang 提交于
      This patch implements sub command ETHTOOL_SCOALESCE for ioctl
      ETHTOOL_PERQUEUE. It introduces an interface set_per_queue_coalesce to
      set coalesce of each masked queue to device driver. The wanted coalesce
      information are stored in "data" for each masked queue, which can copy
      from userspace.
      If it fails to set coalesce to device driver, the value which already
      set to specific queue will be tried to rollback.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Reviewed-by: NBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f38d138a
    • K
      net/ethtool: support get coalesce per queue · 421797b1
      Kan Liang 提交于
      This patch implements sub command ETHTOOL_GCOALESCE for ioctl
      ETHTOOL_PERQUEUE. It introduces an interface get_per_queue_coalesce to
      get coalesce of each masked queue from device driver. Then the interrupt
      coalescing parameters will be copied back to user space one by one.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Reviewed-by: NBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      421797b1
    • K
      net/ethtool: introduce a new ioctl for per queue setting · ac2c7ad0
      Kan Liang 提交于
      Introduce a new ioctl ETHTOOL_PERQUEUE for per queue parameters setting.
      The following patches will enable some SUB_COMMANDs for per queue
      setting.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Reviewed-by: NBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac2c7ad0
    • D
      test_bitmap: unit tests for lib/bitmap.c · 5fd003f5
      David Decotigny 提交于
      This is mainly testing bitmap construction and conversion to/from u32[]
      for now.
      
      Tested:
        qemu i386, x86_64, ppc, ppc64 BE and LE, ARM.
      Signed-off-by: NDavid Decotigny <decot@googlers.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5fd003f5
    • D
      lib/bitmap.c: conversion routines to/from u32 array · e52bc7c2
      David Decotigny 提交于
      Aimed at transferring bitmaps to/from user-space in a 32/64-bit agnostic
      way.
      
      Tested:
        unit tests (next patch) on qemu i386, x86_64, ppc, ppc64 BE and LE,
        ARM.
      Signed-off-by: NDavid Decotigny <decot@googlers.com>
      Reviewed-by: NBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e52bc7c2
    • S
      hv_netvsc: add software transmit timestamp support · 76d13b56
      sixiao@microsoft.com 提交于
      Enable skb_tx_timestamp in hyperv netvsc.
      Signed-off-by: NSimon Xiao <sixiao@microsoft.com>
      Reviewed-by: NK. Y. Srinivasan <kys@microsoft.com>
      Reviewed-by: NHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76d13b56
    • W
      ipv6: pass up EMSGSIZE msg for UDP socket in Ipv6 · e0d8c1b7
      Wei Wang 提交于
      In ipv4,  when  the machine receives a ICMP_FRAG_NEEDED message,  the
      connected UDP socket will get EMSGSIZE message on its next read from the
      socket.
      However, this is not the case for ipv6.
      This fix modifies the udp err handler in Ipv6 for ICMP6_PKT_TOOBIG to
      make it similar to ipv4 behavior. That is when the machine gets an
      ICMP6_PKT_TOOBIG message, the connected UDP socket will get EMSGSIZE
      message on its next read from the socket.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0d8c1b7
    • P
      be2net: Fix pcie error recovery in case of NIC+RoCE adapters · 68f22793
      Padmanabh Ratnakar 提交于
      Interrupts registered by RoCE driver are not unregistered when
      msix interrupts are disabled during error recovery causing a
      crash. Detach the adapter instance from RoCE driver when error
      is detected to complete the cleanup. Attach the driver again after
      the adapter is recovered from error.
      Signed-off-by: NPadmanabh Ratnakar <padmanabh.ratnakar@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68f22793
    • S
      net: macb: make magic-packet property generic · 7c4a1d0c
      Sergio Prado 提交于
      As requested by Rob Herring on patch
      https://patchwork.ozlabs.org/patch/580862/.
      
      This is a new property that it's still in net-next and has never been
      used in production, so we are not breaking anything with the
      incompatible binding change.
      Signed-off-by: NSergio Prado <sergio.prado@e-labworks.com>
      Acked-by: NRob Herring <robh@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c4a1d0c
    • D
      Merge branch 'bridge-mdb-attrs' · ef240c10
      David S. Miller 提交于
      Nikolay Aleksandrov says:
      
      ====================
      bridge: mdb: add support for extended attributes
      
      This small set allows to extend the per mdb entry exported attributes,
      before this set we had only a structure exported which couldn't be changed
      because we would've broken user-space, after this we extend the attribute
      that was used for the structure and add per-mdb entry attributes after the
      struct has been added (see patch 02 for more details). Note that the reason
      we can't simply add an attribute after MDBA_MDB_ENTRY_INFO is that current
      users (e.g. iproute2) walk over the attribute list directly without
      checking for the attribute type.
      Patch 01 is a simple change to reduce one indentation level in order to
      avoid over 80 char lines.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef240c10
    • N
      bridge: mdb: add support for more attributes and export timer · 21257156
      Nikolay Aleksandrov 提交于
      Currently mdb entries are exported directly as a structure inside
      MDBA_MDB_ENTRY_INFO attribute, we can't really extend it without
      breaking user-space. In order to export new mdb fields, I've converted
      the MDBA_MDB_ENTRY_INFO into a nested attribute which starts like before
      with struct br_mdb_entry (without header, as it's casted directly in
      iproute2) and continues with MDBA_MDB_EATTR_ attributes. This way we
      keep compatibility with older users and can export new data.
      I've tested this with iproute2, both with and without support for the
      added attribute and it works fine.
      So basically we again have MDBA_MDB_ENTRY_INFO with struct br_mdb_entry
      inside but it may contain also some additional MDBA_MDB_EATTR_ attributes
      such as MDBA_MDB_EATTR_TIMER which can be parsed by user-space.
      
      So the new structure is:
      [MDBA_MDB] = {
           [MDBA_MDB_ENTRY] = {
               [MDBA_MDB_ENTRY_INFO]
               [MDBA_MDB_ENTRY_INFO] { <- Nested attribute
                   struct br_mdb_entry <- nla_put_nohdr()
                   [MDBA_MDB_ENTRY attributes] <- normal netlink attributes
               }
           }
      }
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21257156
    • N
      bridge: mdb: reduce the indentation level in br_mdb_fill_info · 76cc173d
      Nikolay Aleksandrov 提交于
      Switch the port check and skip if it's null, this allows us to reduce one
      indentation level.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76cc173d
    • S
      bpf: grab rcu read lock for bpf_percpu_hash_update · 6bbd9a05
      Sasha Levin 提交于
      bpf_percpu_hash_update() expects rcu lock to be held and warns if it's not,
      which pointed out a missing rcu read lock.
      
      Fixes: 15a07b33 ("bpf: add lookup/update support for per-cpu hash and array maps")
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bbd9a05
    • D
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · dfa2eb86
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2016-02-19
      
      This series contains updates to i40e/i40evf only.
      
      Alex Duyck splits up the descriptor count function from the function that
      stops the ring to have access to the descriptor count used for the data
      portion of the frame.  The rewrites the logic for how we determine if we
      can transmit the frame or if it needs to be linearized.  Place the checksum
      close to TSO since they have a lot in common and it can help to reduce the
      decision tree for how to handle the frame as the first check in TSO is to
      see if checksumming is offloaded.
      
      Carolyn adds functions to blink leds on devices using 10GBaseT PHY since
      MAC registers used in other designs do not work in this device configuration.
      Fixes an issue where a previously removed message has returned.
      
      Kevin increases the timeout when checking GLGEN_RSTAT_DEVSTATE bit since
      linking with particular PHY types, the amount of time it takes for the
      GLGEN_RSTAT_DEVSTATE to be set increases greatly.
      
      Neerav changes the receive queues to not wait to be disabled before DCB
      has been reconfigured, like transmit queues.
      
      Anjali adds new register definitions for programming the parser, flow
      director and RSS blocks in the hardware.
      
      Shannon adds the new opcodes and structures used for asking the firmware
      to update receive control registers that need extra care when being
      accessed while under heavy traffic.  Integrates the new AdminQ functions
      for safely accessing the receive control registers that may be affected
      by heavy small packet traffic.
      
      Mitch provides another colorful patch description on letting go of
      the stale local VSI pointer when the VF resets.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dfa2eb86
  3. 19 2月, 2016 1 次提交