1. 06 2月, 2016 15 次提交
    • A
      bpf: add lookup/update support for per-cpu hash and array maps · 15a07b33
      Alexei Starovoitov 提交于
      The functions bpf_map_lookup_elem(map, key, value) and
      bpf_map_update_elem(map, key, value, flags) need to get/set
      values from all-cpus for per-cpu hash and array maps,
      so that user space can aggregate/update them as necessary.
      
      Example of single counter aggregation in user space:
        unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
        long values[nr_cpus];
        long value = 0;
      
        bpf_lookup_elem(fd, key, values);
        for (i = 0; i < nr_cpus; i++)
          value += values[i];
      
      The user space must provide round_up(value_size, 8) * nr_cpus
      array to get/set values, since kernel will use 'long' copy
      of per-cpu values to try to copy good counters atomically.
      It's a best-effort, since bpf programs and user space are racing
      to access the same memory.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15a07b33
    • A
      bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map · a10423b8
      Alexei Starovoitov 提交于
      Primary use case is a histogram array of latency
      where bpf program computes the latency of block requests or other
      events and stores histogram of latency into array of 64 elements.
      All cpus are constantly running, so normal increment is not accurate,
      bpf_xadd causes cache ping-pong and this per-cpu approach allows
      fastest collision-free counters.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a10423b8
    • A
      bpf: introduce BPF_MAP_TYPE_PERCPU_HASH map · 824bd0ce
      Alexei Starovoitov 提交于
      Introduce BPF_MAP_TYPE_PERCPU_HASH map type which is used to do
      accurate counters without need to use BPF_XADD instruction which turned
      out to be too costly for high-performance network monitoring.
      In the typical use case the 'key' is the flow tuple or other long
      living object that sees a lot of events per second.
      
      bpf_map_lookup_elem() returns per-cpu area.
      Example:
      struct {
        u32 packets;
        u32 bytes;
      } * ptr = bpf_map_lookup_elem(&map, &key);
      /* ptr points to this_cpu area of the value, so the following
       * increments will not collide with other cpus
       */
      ptr->packets ++;
      ptr->bytes += skb->len;
      
      bpf_update_elem() atomically creates a new element where all per-cpu
      values are zero initialized and this_cpu value is populated with
      given 'value'.
      Note that non-per-cpu hash map always allocates new element
      and then deletes old after rcu grace period to maintain atomicity
      of update. Per-cpu hash map updates element values in-place.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      824bd0ce
    • K
      ethtool: Declare netdev_rss_key as __read_mostly. · ba905f5e
      Kim Jones 提交于
      netdev_rss_key is written to once and thereafter is read by
      drivers when they are initialising. The fact that it is mostly
      read and not written to makes it a candidate for a __read_mostly
      declaration.
      Signed-off-by: NKim Jones <kim-marie.jones@intel.com>
      Signed-off-by: NAlan Carey <alan.carey@intel.com>
      Acked-by: NRami Rosen <rami.rosen@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba905f5e
    • D
      Merge branch 'tcp_fast_open_synack_fin' · ef449678
      David S. Miller 提交于
      Eric Dumazet says:
      
      ====================
      tcp: fastopen: accept data/FIN present in SYNACK
      
      Implements RFC 7413 (TCP Fast Open) 4.2.2, accepting payload and/or FIN
      in SYNACK messages, and prepare removal of SYN flag in tcp_recvmsg()
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef449678
    • E
      tcp: do not enqueue skb with SYN flag · 9d691539
      Eric Dumazet 提交于
      If we remove the SYN flag from the skbs that tcp_fastopen_add_skb()
      places in socket receive queue, then we can remove the test that
      tcp_recvmsg() has to perform in fast path.
      
      All we have to do is to adjust SEQ in the slow path.
      
      For the moment, we place an unlikely() and output a message
      if we find an skb having SYN flag set.
      Goal would be to get rid of the test completely.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d691539
    • E
      tcp: fastopen: accept data/FIN present in SYNACK message · 61d2bcae
      Eric Dumazet 提交于
      RFC 7413 (TCP Fast Open) 4.2.2 states that the SYNACK message
      MAY include data and/or FIN
      
      This patch adds support for the client side :
      
      If we receive a SYNACK with payload or FIN, queue the skb instead
      of ignoring it.
      
      Since we already support the same for SYN, we refactor the existing
      code and reuse it. Note we need to clone the skb, so this operation
      might fail under memory pressure.
      
      Sara Dickinson pointed out FreeBSD server Fast Open implementation
      was planned to generate such SYNACK in the future.
      
      The server side might be implemented on linux later.
      Reported-by: NSara Dickinson <sara@sinodun.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61d2bcae
    • D
      Merge branch 'rx_nohandler' · df03288b
      David S. Miller 提交于
      Jarod Wilson says:
      
      ====================
      net: add and use rx_nohandler stat counter
      
      The network core tries to keep track of dropped packets, but some packets
      you wouldn't really call dropped, so much as intentionally ignored, under
      certain circumstances. One such case is that of bonding and team device
      slaves that are currently inactive. Their respective rx_handler functions
      return RX_HANDLER_EXACT (the only places in the kernel that return that),
      which ends up tracking into the network core's __netif_receive_skb_core()
      function's drop path, with no pt_prev set. On a noisy network, this can
      result in a very rapidly incrementing rx_dropped counter, not only on the
      inactive slave(s), but also on the master device, such as the following:
      
      $ cat /proc/net/dev
      Inter-|   Receive                                                |  Transmit
       face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
        p7p1: 14783346  140430    0 140428    0     0          0      2040      680       8    0    0    0     0       0          0
        p7p2: 14805198  140648    0    0    0     0          0      2034        0       0    0    0    0     0       0          0
       bond0: 53365248  532798    0 421160    0     0          0    115151     2040      24    0    0    0     0       0          0
          lo:    5420      54    0    0    0     0          0         0     5420      54    0    0    0     0       0          0
        p5p1: 19292195  196197    0 140368    0     0          0     56564      680       8    0    0    0     0       0          0
        p5p2: 19289707  196171    0 140364    0     0          0     56547      680       8    0    0    0     0       0          0
         em3: 20996626  158214    0    0    0     0          0       383        0       0    0    0    0     0       0          0
         em2: 14065122  138462    0    0    0     0          0       310        0       0    0    0    0     0       0          0
         em1: 14063162  138440    0    0    0     0          0       308        0       0    0    0    0     0       0          0
         em4: 21050830  158729    0    0    0     0          0       385    71662     469    0    0    0     0       0          0
         ib0:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0
      
      In this scenario, p5p1, p5p2 and p7p1 are all inactive slaves in an
      active-backup bond0, and you can see that all three have high drop counts,
      with the master bond0 showing a tally of all three.
      
      I know that this was previously discussed some here:
      
          http://www.spinics.net/lists/netdev/msg226341.html
      
      It seems additional counters never came to fruition, so this is a first
      attempt at creating one of them, so that we stop calling these drops,
      which for users monitoring rx_dropped, causes great alarm, and renders the
      counter much less useful for them.
      
      This adds a sysfs statistics node and makes the counter available via
      netlink.
      
      Additionally, I'm not certain if this set qualifies for net, or if it
      should be put aside and resubmitted for net-next after 4.5 is put to
      bed, but I do have users who consider this an important bugfix.
      
      This has been tested quite a bit on x86_64, and now lightly on i686 as
      well, to verify functionality of updates to netdev_stats_to_stats64()
      on 32-bit arches.
      ====================
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df03288b
    • J
      bond: track sum of rx_nohandler for all slaves · f344b0d9
      Jarod Wilson 提交于
      Sample output with this set applied for an active-backup bond:
      
      $ cat /sys/devices/virtual/net/bond0/lower_p7p1/statistics/rx_nohandler
      16568
      $ cat /sys/devices/virtual/net/bond0/lower_p5p2/statistics/rx_nohandler
      16583
      $ cat /sys/devices/virtual/net/bond0/statistics/rx_nohandler
      33151
      
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <gospo@cumulusnetworks.com>
      CC: netdev@vger.kernel.org
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f344b0d9
    • J
      team: track sum of rx_nohandler for all slaves · bb63daf9
      Jarod Wilson 提交于
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: netdev@vger.kernel.org
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bb63daf9
    • J
      net: add rx_nohandler stat counter · 6e7333d3
      Jarod Wilson 提交于
      This adds an rx_nohandler stat counter, along with a sysfs statistics
      node, and copies the counter out via netlink as well.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@mellanox.com>
      CC: Daniel Borkmann <daniel@iogearbox.net>
      CC: Tom Herbert <tom@herbertland.com>
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <gospo@cumulusnetworks.com>
      CC: netdev@vger.kernel.org
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e7333d3
    • J
      net/core: relax BUILD_BUG_ON in netdev_stats_to_stats64 · 9256645a
      Jarod Wilson 提交于
      The netdev_stats_to_stats64 function copies the deprecated
      net_device_stats format stats into rtnl_link_stats64 for legacy support
      purposes, but with the BUILD_BUG_ON as it was, it wasn't possible to
      extend rtnl_link_stats64 without also extending net_device_stats. Relax
      the BUILD_BUG_ON to only require that rtnl_link_stats64 is larger, and
      zero out all the stat counters that aren't present in net_device_stats.
      
      CC: Eric Dumazet <edumazet@google.com>
      CC: netdev@vger.kernel.org
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9256645a
    • R
      tipc: fix link priority propagation · 81729810
      Richard Alpe 提交于
      Currently link priority changes isn't handled for active links. In
      this patch we resolve this by changing our priority if the peer passes
      a valid priority in a state message.
      Reviewed-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NRichard Alpe <richard.alpe@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81729810
    • R
      tipc: fix link attribute propagation bug · d01332f1
      Richard Alpe 提交于
      Changing certain link attributes (link tolerance and link priority)
      from the TIPC management tool is supposed to automatically take
      effect at both endpoints of the affected link.
      
      Currently the media address is not instantiated for the link and is
      used uninstantiated when crafting protocol messages designated for the
      peer endpoint. This means that changing a link property currently
      results in the property being changed on the local machine but the
      protocol message designated for the peer gets lost. Resulting in
      property discrepancy between the endpoints.
      
      In this patch we resolve this by using the media address from the
      link entry and using the bearer transmit function to send it. Hence,
      we can now eliminate the redundant function tipc_link_prot_xmit() and
      the redundant field tipc_link::media_addr.
      
      Fixes: 2af5ae37 (tipc: clean up unused code and structures)
      Reviewed-by: NJon Maloy <jon.maloy@ericsson.com>
      Reported-by: NJason Hu <huzhijiang@gmail.com>
      Signed-off-by: NRichard Alpe <richard.alpe@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d01332f1
    • D
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 6247fd9f
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2016-02-03
      
      This series contains updates to i40e and i40evf only.
      
      Kiran adds the MAC filter element to the end of the list instead of HEAD
      just in case there are ever any ordering issues in the future.
      
      Anjali fixes several RSS issues, first fixes the hash PCTYPE enable for
      X722 since it supports a broader selection of PCTYPES for TCP and UDP.
      Then fixes a bug in XL710, X710, and X722 support for RSS since we cannot
      reduce the 4-tuple for RSS for TCP/IPv4/IPv6 or UDP/IPv4/IPv6 packets
      since this requires a product feature change coming in a later release.
      Cleans up the reset code where the restart-autoneg workaround is
      applied, since X722 does not need the workaround, add a flag to indicate
      which MAC and firmware version require the workaround to be applied.
      Adds new device id's for X722 and code to add their support.  Also
      adds another way to access the RSS keys and lookup table using the admin
      queue for X722 devices.
      
      Catherine updates the driver to replace the MAC check with a feature
      flag check for 100M SGMII, since it is only support on X722 devices
      currently.
      
      Mitch reworks the VF driver to allow channel bonding, which was not
      possible before this patch due to the asynchronous nature of the admin
      queue mechanism.  Also fixes a rare case which causes a panic if the
      VF driver is removed during reset recovery, resolve this by setting the
      ring pointers to NULL after freeing them.
      
      Shannon cleans up the driver where device capabilities were defined in
      two different places, and neither had all the definitions, so he
      consolidates the definitions in the admin queue API.  Also adds the new
      proxy-wake-on-lan capability bit available with the new X722 device.
      Lastly, added the new External Device Power Ability field to the
      get_link_status data structure by using a reserved field at the end
      of the structure.
      
      Jesse mimics the ixgbe driver's use of a private work queue in the i40e
      and i40evf drivers to avoid blocking the system work queue.
      
      Greg cleans up the driver to limit the firmware revision checks to
      properly handle DCB configurations from the firmware to the older
      devices which need these checks (specifically X710 and XL710 devices
      only).
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6247fd9f
  2. 05 2月, 2016 1 次提交
    • M
      ipvlan: inherit MTU from master device · 296d4856
      Mahesh Bandewar 提交于
      When we create IPvlan slave; we use ether_setup() and that
      sets up default MTU to 1500 while the master device may have
      lower / different MTU. Any subsequent changes to the masters'
      MTU are reflected into the slaves' MTU setting. However if those
      don't happen (most likely scenario), the slaves' MTU stays at
      1500 which could be bad.
      
      This change adds code to inherit MTU from the master device
      instead of using the default value during the link initialization
      phase.
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: Tim Hockins <thockins@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      296d4856
  3. 04 2月, 2016 20 次提交
  4. 02 2月, 2016 4 次提交
    • D
      b45efa30
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 34229b27
      Linus Torvalds 提交于
      Pull networking fixes from David Miller:
       "This looks like a lot but it's a mixture of regression fixes as well
        as fixes for longer standing issues.
      
         1) Fix on-channel cancellation in mac80211, from Johannes Berg.
      
         2) Handle CHECKSUM_COMPLETE properly in xt_TCPMSS netfilter xtables
            module, from Eric Dumazet.
      
         3) Avoid infinite loop in UDP SO_REUSEPORT logic, also from Eric
            Dumazet.
      
         4) Avoid a NULL deref if we try to set SO_REUSEPORT after a socket is
            bound, from Craig Gallek.
      
         5) GRO key comparisons don't take lightweight tunnels into account,
            from Jesse Gross.
      
         6) Fix struct pid leak via SCM credentials in AF_UNIX, from Eric
            Dumazet.
      
         7) We need to set the rtnl_link_ops of ipv6 SIT tunnels before we
            register them, otherwise the NEWLINK netlink message is missing
            the proper attributes.  From Thadeu Lima de Souza Cascardo.
      
         8) Several Spectrum chip bug fixes for mlxsw switch driver, from Ido
            Schimmel
      
         9) Handle fragments properly in ipv4 easly socket demux, from Eric
            Dumazet.
      
        10) Don't ignore the ifindex key specifier on ipv6 output route
            lookups, from Paolo Abeni"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (128 commits)
        tcp: avoid cwnd undo after receiving ECN
        irda: fix a potential use-after-free in ircomm_param_request
        net: tg3: avoid uninitialized variable warning
        net: nb8800: avoid uninitialized variable warning
        net: vxge: avoid unused function warnings
        net: bgmac: clarify CONFIG_BCMA dependency
        net: hp100: remove unnecessary #ifdefs
        net: davinci_cpdma: use dma_addr_t for DMA address
        ipv6/udp: use sticky pktinfo egress ifindex on connect()
        ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail()
        netlink: not trim skb for mmaped socket when dump
        vxlan: fix a out of bounds access in __vxlan_find_mac
        net: dsa: mv88e6xxx: fix port VLAN maps
        fib_trie: Fix shift by 32 in fib_table_lookup
        net: moxart: use correct accessors for DMA memory
        ipv4: ipconfig: avoid unused ic_proto_used symbol
        bnxt_en: Fix crash in bnxt_free_tx_skbs() during tx timeout.
        bnxt_en: Exclude rx_drop_pkts hw counter from the stack's rx_dropped counter.
        bnxt_en: Ring free response from close path should use completion ring
        net_sched: drr: check for NULL pointer in drr_dequeue
        ...
      34229b27
    • L
      Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · 2c923414
      Linus Torvalds 提交于
      Pull crypto fixes from Herbert Xu:
       "This fixes the following issues:
      
        API:
         - algif_hash needs to wait for init operations to complete.
         - The has_key setting for shash was always true.
      
        Algorithms:
         - Add missing selections of CRYPTO_HASH.
         - Fix pkcs7 authentication.
      
        Drivers:
         - Fix stack alignment bug in chacha20-ssse3.
         - Fix performance regression in caam due to incorrect setting.
         - Fix potential compile-only build failure of stm32"
      
      * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: atmel-aes - remove calls of clk_prepare() from atomic contexts
        crypto: algif_hash - wait for crypto_ahash_init() to complete
        crypto: shash - Fix has_key setting
        hwrng: stm32 - Fix dependencies for !HAS_IOMEM archs
        crypto: ghash,poly1305 - select CRYPTO_HASH where needed
        crypto: chacha20-ssse3 - Align stack pointer to 64 bytes
        PKCS#7: Don't require SpcSpOpusInfo in Authenticode pkcs7 signatures
        crypto: caam - make write transactions bufferable on PPC platforms
      2c923414
    • L
      Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 29a8ea4f
      Linus Torvalds 提交于
      Pull libnvdimm fixes from Dan Williams:
       "1/ Fixes to the libnvdimm 'pfn' device that establishes a reserved
           area for storing a struct page array.
      
        2/ Fixes for dax operations on a raw block device to prevent pagecache
           collisions with dax mappings.
      
        3/ A fix for pfn_t usage in vm_insert_mixed that lead to a null
           pointer de-reference.
      
        These have received build success notification from the kbuild robot
        across 153 configs and pass the latest ndctl tests"
      
      * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
        phys_to_pfn_t: use phys_addr_t
        mm: fix pfn_t to page conversion in vm_insert_mixed
        block: use DAX for partition table reads
        block: revert runtime dax control of the raw block device
        fs, block: force direct-I/O for dax-enabled block devices
        devm_memremap_pages: fix vmem_altmap lifetime + alignment handling
        libnvdimm, pfn: fix restoring memmap location
        libnvdimm: fix mode determination for e820 devices
      29a8ea4f