1. 03 7月, 2014 13 次提交
  2. 02 7月, 2014 27 次提交
    • E
      inet: move ipv6only in sock_common · 9fe516ba
      Eric Dumazet 提交于
      When an UDP application switches from AF_INET to AF_INET6 sockets, we
      have a small performance degradation for IPv4 communications because of
      extra cache line misses to access ipv6only information.
      
      This can also be noticed for TCP listeners, as ipv6_only_sock() is also
      used from __inet_lookup_listener()->compute_score()
      
      This is magnified when SO_REUSEPORT is used.
      
      Move ipv6only into struct sock_common so that it is available at
      no extra cost in lookups.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fe516ba
    • D
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next · 090cce42
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2014-07-01
      
      This series contains updates to i40e, i40evf, igb and ixgbe.
      
      Shannon adds the Base Address High and Low to the admin queue structure
      to simplify the logic in the configuration routines.  Also adds code to
      clear all queues and interrupts to help clean up after a PXE or other
      early boot activity.
      
      Kevin fixes mask assignment value since -1 cannot be used for unsigned
      integer types.
      
      Mitch fixes an issue where in some circumstances the reply from the PF
      would come back before we were able to properly modify the admin queue
      pending and required flags.  This would mess up the flags and put the
      driver in an indeterminate state, so fix this by simply setting the flags
      before sending the request to the admin queue.  Also changes the branding
      string for i40evf to reduce confusion and to match up with our other
      marketing materials.
      
      Kamil adds a new variable defining admin send queue (ASQ) command write
      back timeout to allow for dynamic modification of this timeout.
      
      Anjali fix a bug in the flow director filter replay logic, so that we
      call a replay after a sideband reset correctly.
      
      Jesse adds code to initialize all members of the context descriptor to
      prevent possible stale data.
      
      Christopher fixes i40e to prevent writing to reserved bits, since the
      queue index is only 0-127.
      
      Jacob removes the unneeded header export.h from the i40e PTP code.
      Fixes ixgbe PTP code where the PPS signal was not correct, as it
      generates a one half HZ clock signal, it only generates one level
      change per second.  To generate a full clock, we need two level changes
      per second.
      
      Todd provides a fix for igb to bring up link when the PHY has powered
      up, which was reported by Jeff Westfahl.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      090cce42
    • J
      bonding: allow to add vlans on top of empty bond · 763e0ecd
      Jiri Pirko 提交于
      This limitation maybe had some reason in the past, but now there is not
      one -> removing this.
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Acked-by: NVeaceslav Falico <vfalico@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      763e0ecd
    • D
      Merge branch 'cxgb4-next' · 813f8e29
      David S. Miller 提交于
      Hariprasad Shenai says:
      
      ====================
      cxgb4: Fix for PCI passthrough and some Misc. fixes
      
      This patch series fixes probe failure in VM when PF is exposed through PCI
      Passthrough. Adds support to use firmware interface to get BAR0 value.
      Replace the backdoor mechanism to access the HW memory with PCIe Window method
      which fixes memory I/O. Also adds device ID of few more adapters for cxgb4 and
      cxgb4vf driver.
      
      The patches series is created against 'net-next' tree.
      And includes patches on cxgb4, cxgb4vf and iw_cxgb4 driver.
      
      Since this patch-series contains mainly cxgb4 related changes, we would like to
      request this patch series to get merged via David Miller's 'net-next' tree.
      
      We have included all the maintainers of respective drivers. Kindly review the
      change and let us know in case of any review comments.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      813f8e29
    • H
    • H
    • H
      cxgb4: Replaced the backdoor mechanism to access the HW memory with PCIe Window method · fc5ab020
      Hariprasad Shenai 提交于
      Rip out a bunch of redundant PCI-E Memory Window Read/Write routines,
      collapse the more general purpose routines into a single routine
      thereby eliminating the need for a large stack frame (and extra data
      copying) in the outer routine, change everything to use the improved
      routine t4_memory_rw.
      
      Based on origninal work by Casey Leedom <leedom@chelsio.com> and
      Steve Wise <swise@opengridcomputing.com>
      Signed-off-by: NCasey Leedom <leedom@chelsio.com>
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc5ab020
    • H
      cxgb4: Use FW interface to get BAR0 value · 0abfd152
      Hariprasad Shenai 提交于
      Use the firmware interface to get the BAR0 value since we really don't want
      to use the PCI-E Configuration Space Backdoor access which is owned by the
      firmware.
      
      Set up PCI-E Memory Window registers using the true values programmed into
      BAR registers.  When the PF4 "Master Function" is exported to a Virtual
      Machine, the values returned by pci_resource_start() will be for the
      synthetic PCI-E Configuration Space and not the real addresses. But we need
      to program the PCI-E Memory Window address decoders with the real addresses
      that we're going to be using in order to have accesses through the Memory
      Windows work.
      
      Based on origninal work by Casey Leedom <leedom@chelsio.com>
      Signed-off-by: NCasey Leedom <leedom@chelsio.com>
      Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0abfd152
    • H
      rdma/cxgb4: Fixes cxgb4 probe failure in VM when PF is exposed through PCI Passthrough · 35b1de55
      Hariprasad Shenai 提交于
      Change logic which determines our Physical Function at PCI Probe time.
      Now we read the PL_WHOAMI register and get the Physical Function.
      
      Pass Physical Function to Upper Layer Drivers in lld_info structure in the
      new field "pf" added to lld_info.  This is useful for the cases where the
      PF, say PF4, is attached to a Virtual Machine via some form of "PCI
      Pass Through" technology and the PCI Function shows up as PF0 in the VM.
      
      Based on original work by Casey Leedom <leedom@chelsio.com>
      Signed-off-by: NCasey Leedom <leedom@chelsio.com>
      Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35b1de55
    • D
      Merge branch 'dp83640-next' · 2eb27a16
      David S. Miller 提交于
      Stefan Sørensen says:
      
      ====================
      dp83640: Increase support perout pins
      
      This patch series increases the number of periodic output pins supported
      on the dp83640 to 7, and allows for reprogramming the calibration pin.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2eb27a16
    • S
      ptp: Allow reassigning calibration pin function · 72df7a72
      Stefan Sørensen 提交于
      The ptp pin function programming does not allow calibration pin to change
      function. This is problematic on hardware that uses the default calibration
      pin for other purposes.
      
      Removing this limitation does not impact calibration if userspace does not
      reprogram the calibration pin.
      Signed-off-by: NStefan Sørensen <stefan.sorensen@spectralink.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      72df7a72
    • S
      dp83640: Get calibration pin with ptp_find_pin · e0155950
      Stefan Sørensen 提交于
      For consistency, use the ptp_find_pin function to get the calibration pin,
      not gpio_tab.
      Signed-off-by: NStefan Sørensen <stefan.sorensen@spectralink.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0155950
    • S
      dp83640: Verify calibration pin assignment · 6f39eb87
      Stefan Sørensen 提交于
      This constraints the pin assignment to not allow the calibration function to
      be reassigned and only allow reassigning the calibratin pin if only one phy is
      connected.
      Signed-off-by: NStefan Sørensen <stefan.sorensen@spectralink.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f39eb87
    • S
      dp83640: Increase supported perout pins to 7 · ad01577a
      Stefan Sørensen 提交于
      This patch increases the number of supported periodic output pins from
      1 to 7. The last pin is reserved for sync.
      Signed-off-by: NStefan Sørensen <stefan.sorensen@spectralink.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad01577a
    • S
      dp83640: Program pulsewidth2 values of perout triggers 0 and 1 · 35e872ae
      Stefan Sørensen 提交于
      Periodic output triggers 0 and 1 of the dp83640 has a programmable
      duty-cycle which is controlled by the Pulsewidth2 field of the trigger
      data register.  This field is not documented in the datasheet, but it
      is described in the "PHYTER Software Development Guide" section
      3.1.4.1. Failing to set the field can result in unstable/no trigger
      output.
      
      Add programming of the Pulsewidth2 field, setting it to the same value
      as the Pulsewidth field for a 50% duty cycle.
      Signed-off-by: NStefan Sørensen <stefan.sorensen@spectralink.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35e872ae
    • D
      Merge branch 'bnx2x-next' · b6fd8b7f
      David S. Miller 提交于
      Yuval Mintz says:
      
      ====================
      bnx2x: Enhancement patch series
      
      This patch series introduces the ability to propagate link parameters
      to VFs as well as control the VF link via hypervisor.
      
      In addition, it contains 2 small improvements [one IOV-related and the
      other improves performance on machines with short cache lines].
      
      Please consider applying these patches to `net-next'.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6fd8b7f
    • Y
      bnx2x: Fail probe of VFs using an old incompatible driver · ebf457f9
      Yuval Mintz 提交于
      There are linux distributions where the inbox bnx2x driver contains SRIOV
      support but doesn't contain the changes introduced in b9871bcf
      "bnx2x: VF RSS support - PF side".
      
      A VF in a VM running that distribution over a new hypervisor will access
      incorrect addresses when trying to transmit packets, causing an attention
      in the hypervisor and making that VF inactive until FLRed.
      
      The driver in the VM has to ne upgraded [no real way to overcome this], but
      due to the HW attention currently arising upgrading the driver in the VM
      would not suffice [since the VF needs also be FLRed if the previous driver
      was already loaded].
      
      This patch causes the PF to fail the acquire message from a VF running an
      old problematic driver; The VF will then gracefully fail it's probe preventing
      the HW attention [and allow clean upgrade of driver in VM].
      Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: NAriel Elior <Ariel.Elior@qlogic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebf457f9
    • D
      bnx2x: enlarge minimal alignemnt of data offset · 9927b514
      Dmitry Kravkov 提交于
      This improves the performance of driver on machine with L1_CACHE_SHIFT of at
      most 32 bytes [HW was planned for 64-byte aligned fastpath data].
      Signed-off-by: NDmitry Kravkov <Dmitry.Kravkov@qlogic.com>
      Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: NAriel Elior <Ariel.Elior@qlogic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9927b514
    • D
      bnx2x: VF can report link speed · 6495d15a
      Dmitry Kravkov 提交于
      Until now VFs were oblvious to the actual configured link parameters.
      This patch does 2 things:
      
        1. It enables a PF to inform its VF using the bulletin board of the link
           configured, and allows the VF to present that information.
      
        2. It adds support of `ndo_set_vf_link_state', allowing the hypervisor
           to set the VF link state.
      Signed-off-by: NDmitry Kravkov <Dmitry.Kravkov@qlogic.com>
      Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: NAriel Elior <Ariel.Elior@qlogic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6495d15a
    • D
      Merge branch 'pktgen' · edd79ca8
      David S. Miller 提交于
      Jesper Dangaard Brouer says:
      
      ====================
      Optimizing pktgen for single CPU performance
      
      This series focus on optimizing "pktgen" for single CPU performance.
      
      V2-series:
       - Removed some patches
       - Doc real reason for TX ring buffer filling up
      
      NIC tuning for pktgen:
       http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html
      
      General overload setup according to:
       http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html
      
      Hardware:
       System: CPU E5-2630
       NIC: Intel ixgbe/82599 chip
      
      Testing done with net-next git tree on top of
       commit 6623b419 ("Merge branch 'master' of...jkirsher/net-next")
      
      Pktgen script exercising race condition:
       https://github.com/netoptimizer/network-testing/blob/master/pktgen/unit_test01_race_add_rem_device_loop.sh
      
      Tool for measuring LOCK overhead:
       https://github.com/netoptimizer/network-testing/blob/master/src/overhead_cmpxchg.c
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edd79ca8
    • J
      pktgen: RCU-ify "if_list" to remove lock in next_to_run() · 8788370a
      Jesper Dangaard Brouer 提交于
      The if_lock()/if_unlock() in next_to_run() adds a significant
      overhead, because its called for every packet in busy loop of
      pktgen_thread_worker().  (Thomas Graf originally pointed me
      at this lock problem).
      
      Removing these two "LOCK" operations should in theory save us approx
      16ns (8ns x 2), as illustrated below we do save 16ns when removing
      the locks and introducing RCU protection.
      
      Performance data with CLONE_SKB==100000, TX-size=512, rx-usecs=30:
       (single CPU performance, ixgbe 10Gbit/s, E5-2630)
       * Prev   : 5684009 pps --> 175.93ns (1/5684009*10^9)
       * RCU-fix: 6272204 pps --> 159.43ns (1/6272204*10^9)
       * Diff   : +588195 pps --> -16.50ns
      
      To understand this RCU patch, I describe the pktgen thread model
      below.
      
      In pktgen there is several kernel threads, but there is only one CPU
      running each kernel thread.  Communication with the kernel threads are
      done through some thread control flags.  This allow the thread to
      change data structures at a know synchronization point, see main
      thread func pktgen_thread_worker().
      
      Userspace changes are communicated through proc-file writes.  There
      are three types of changes, general control changes "pgctrl"
      (func:pgctrl_write), thread changes "kpktgend_X"
      (func:pktgen_thread_write), and interface config changes "etcX@N"
      (func:pktgen_if_write).
      
      Userspace "pgctrl" and "thread" changes are synchronized via the mutex
      pktgen_thread_lock, thus only a single userspace instance can run.
      The mutex is taken while the packet generator is running, by pgctrl
      "start".  Thus e.g. "add_device" cannot be invoked when pktgen is
      running/started.
      
      All "pgctrl" and all "thread" changes, except thread "add_device",
      communicate via the thread control flags.  The main problem is the
      exception "add_device", that modifies threads "if_list" directly.
      
      Fortunately "add_device" cannot be invoked while pktgen is running.
      But there exists a race between "rem_device_all" and "add_device"
      (which normally don't occur, because "rem_device_all" waits 125ms
      before returning). Background'ing "rem_device_all" and running
      "add_device" immediately allow the race to occur.
      
      The race affects the threads (list of devices) "if_list".  The if_lock
      is used for protecting this "if_list".  Other readers are given
      lock-free access to the list under RCU read sections.
      
      Note, interface config changes (via proc) can occur while pktgen is
      running, which worries me a bit.  I'm assuming proc_remove() takes
      appropriate locks, to assure no writers exists after proc_remove()
      finish.
      
      I've been running a script exercising the race condition (leading me
      to fix the proc_remove order), without any issues.  The script also
      exercises concurrent proc writes, while the interface config is
      getting removed.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8788370a
    • J
      pktgen: avoid expensive set_current_state() call in loop · baac167b
      Jesper Dangaard Brouer 提交于
      Avoid calling set_current_state() inside the busy-loop in
      pktgen_thread_worker().  In case of pkt_dev->delay, then it is still
      used/enabled in pktgen_xmit() via the spin() call.
      
      The set_current_state(TASK_INTERRUPTIBLE) uses a xchg, which implicit
      is LOCK prefixed.  I've measured the asm LOCK operation to take approx
      8ns on this E5-2630 CPU.  Performance increase corrolate with this
      measurement.
      
      Performance data with CLONE_SKB==100000, rx-usecs=30:
       (single CPU performance, ixgbe 10Gbit/s, E5-2630)
       * Prev:  5454050 pps --> 183.35ns (1/5454050*10^9)
       * Now:   5684009 pps --> 175.93ns (1/5684009*10^9)
       * Diff:  +229959 pps -->  -7.42ns
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      baac167b
    • J
      pktgen: document tuning for max NIC performance · 9ceb87fc
      Jesper Dangaard Brouer 提交于
      Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring
      running full.  Thus, the TX ring is artificially limiting pktgen.
      (Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy"
      counters.)
      
      Using ixgbe, the real reason behind the TX ring running full, is due
      to TX ring not being cleaned up fast enough. The ixgbe driver combines
      TX+RX ring cleanups, and the cleanup interval is affected by the
      ethtool --coalesce setting of parameter "rx-usecs".
      
      Do not increase the default NIC TX ring buffer or default cleanup
      interval.  Instead simply document that pktgen needs special NIC
      tuning for maximum packet per sec performance.
      
      Performance results with pktgen with clone_skb=100000.
      TX ring size 512 (default), adjusting "rx-usecs":
       (Single CPU performance, E5-2630, ixgbe)
       - 3935002 pps - rx-usecs:  1 (irqs:  9346)
       - 5132350 pps - rx-usecs: 10 (irqs: 99157)
       - 5375111 pps - rx-usecs: 20 (irqs: 50154)
       - 5454050 pps - rx-usecs: 30 (irqs: 33872)
       - 5496320 pps - rx-usecs: 40 (irqs: 26197)
       - 5502510 pps - rx-usecs: 50 (irqs: 21527)
      
      TX ring size adjusting (ethtool -G), "rx-usecs==1" (default):
       - 3935002 pps - tx-size:  512
       - 5354401 pps - tx-size:  768
       - 5356847 pps - tx-size: 1024
       - 5327595 pps - tx-size: 1536
       - 5356779 pps - tx-size: 2048
       - 5353438 pps - tx-size: 4096
      
      Notice after commit 6f25cd47 (pktgen: fix xmit test for BQL enabled
      devices) pktgen uses netif_xmit_frozen_or_drv_stopped() and ignores
      the BQL "stack" pause (QUEUE_STATE_STACK_XOFF) flag.  This allow us to put
      more pressure on the TX ring buffers.
      
      It is the ixgbe_maybe_stop_tx() call that stops the transmits, and
      pktgen respecting this in the call to netif_xmit_frozen_or_drv_stopped(txq).
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ceb87fc
    • J
      openvswitch: introduce rtnl ops stub · 5b9e7e16
      Jiri Pirko 提交于
      This stub now allows userspace to see IFLA_INFO_KIND for ovs master and
      IFLA_INFO_SLAVE_KIND for slave.
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b9e7e16
    • J
      rtnetlink: allow to register ops without ops->setup set · b0ab2fab
      Jiri Pirko 提交于
      So far, it is assumed that ops->setup is filled up. But there might be
      case that ops might make sense even without ->setup. In that case,
      forbid to newlink and dellink.
      
      This allows to register simple rtnl link ops containing only ->kind.
      That allows consistent way of passing device kind (either device-kind or
      slave-kind) to userspace.
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0ab2fab
    • Y
      net: fix some typos in comment · 9bf2b8c2
      Ying Xue 提交于
      In commit 37112105("net:
      QDISC_STATE_RUNNING dont need atomic bit ops") the
      __QDISC_STATE_RUNNING is renamed to __QDISC___STATE_RUNNING,
      but the old names existing in comment are not replaced with
      the new name completely.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9bf2b8c2
    • B
      ipv6: Allow accepting RA from local IP addresses. · d9333196
      Ben Greear 提交于
      This can be used in virtual networking applications, and
      may have other uses as well.  The option is disabled by
      default.
      
      A specific use case is setting up virtual routers, bridges, and
      hosts on a single OS without the use of network namespaces or
      virtual machines.  With proper use of ip rules, routing tables,
      veth interface pairs and/or other virtual interfaces,
      and applications that can bind to interfaces and/or IP addresses,
      it is possibly to create one or more virtual routers with multiple
      hosts attached.  The host interfaces can act as IPv6 systems,
      with radvd running on the ports in the virtual routers.  With the
      option provided in this patch enabled, those hosts can now properly
      obtain IPv6 addresses from the radvd.
      Signed-off-by: NBen Greear <greearb@candelatech.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d9333196