1. 13 6月, 2014 16 次提交
    • D
      Merge branch 'fec' · fba0e1a3
      David S. Miller 提交于
      Fugang Duan says:
      
      ====================
      net: fec: Enable Software TSO to improve the tx performance
      
      Add SG and software TSO support for FEC.
      This feature allows to improve outbound throughput performance.
      Tested on imx6dl sabresd board, running iperf tcp tests shows:
              * 82% improvement comparing with NO SG & TSO patch
      
      $ ethtool -K eth0 sg on
      $ ethtool -K eth0 tso on
      [  3] local 10.192.242.108 port 35388 connected with 10.192.242.167 port 5001
      [ ID] Interval       Transfer     Bandwidth
      [  3]  0.0- 3.0 sec   181 MBytes   506 Mbits/sec
      * cpu loading is 30%
      
      $ ethtool -K eth0 sg off
      $ ethtool -K eth0 tso off
      [  3] local 10.192.242.108 port 52618 connected with 10.192.242.167 port 5001
      [ ID] Interval       Transfer     Bandwidth
      [  3]  0.0- 3.0 sec  99.5 MBytes   278 Mbits/sec
      
      FEC HW support IP header and TCP/UDP hw checksum, support multi buffer descriptor transfer
      one frame, but don't support HW TSO. And imx6q/dl SOC FEC Gbps speed has HW bus Bandwidth
      limitation (400Mbps ~ 700Mbps), imx6sx SOC FEC Gbps speed has no HW bandwidth limitation.
      
      The patch set just enable TSO feature, which is done following the mv643xx_eth driver.
      
      Test result analyze:
      imx6dl sabresd board: there have 82% improvement, since imx6dl FEC HW has bandwidth limitation,
                            the performance with SW TSO is a milestone.
      
      Addition test:
      imx6sx sdb board:
      upstream still don't support imx6sx due to some patches being upstream... they use same FEC IP.
      Use the SW TSO patches test imx6sx sdb board in internal kernel tree:
      No SW TSO patch: tx bandwidth 840Mbps, cpu loading is 100%.
      SW TSO patch:    tx bandwidth 942Mbps, cpu loading is 65%.
      It means the patch set have great improvement for imx6sx FEC performance.
      
      V2:
      * From Frank Li's suggestion:
      	Change the API "fec_enet_txdesc_entry_free" name to "fec_enet_get_free_txdesc_num".
      * Summary David Laight and Eric Dumazet's thoughts:
      	RX BD entry number change to 256.
      * From ezequiel's suggestion:
      	Follow the latest TSO fixes from his solution to rework the queue stop/wake-up.
      	Avoid unmapping the TSO header buffers.
      * From Eric Dumazet's suggestion:
      	Avoid more bytes copy, just copying the unaligned part of the payload into first
      	descriptor. The suggestion will bring more complex for the driver, and imx6dl FEC
      	DMA need 16 bytes alignment, but cpu loading is not problem that cpu loading is
      	30%, the current performance is so better. Later chip like imx6sx Gigbit FEC DMA
      	support byte alignment, so there don't exist memory copy. So, the V2 version drop
      	the suggestion.
      	Anyway, thanks for Eric's response and suggestion.
      
      V3:
      * From David Laight's feedback:
      	Decide to drop RX BD entry number change for the SW TSO patch set.
      	I will generate one separate patch to increase RX BDs entry for interrupt coalescing feature which
      	will be supported in my later patch set.
      
      V4:
      * From David Laight's feedback:
      	Remove the conditional in .fec_enet_get_bd_index().
      
      V5:
      * Patch #4 update:
        From David Laight's feedback:
      	"expect fec_enet_get_free_txdesc_num() to return one less than it does currently."
      	Change the function:
      	Return space available, 0..size-1.  it always leave one free entry. Which is same as linux circ_buf.
      
      Thanks for Eric and ezequiel's help and idea.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fba0e1a3
    • N
      net: fec: Add software TSO support · 79f33912
      Nimrod Andy 提交于
      Add software TSO support for FEC.
      This feature allows to improve outbound throughput performance.
      
      Tested on imx6dl sabresd board, running iperf tcp tests shows:
      - 16.2% improvement comparing with FEC SG patch
      - 82% improvement comparing with NO SG & TSO patch
      
      $ ethtool -K eth0 tso on
      $ iperf -c 10.192.242.167 -t 3 &
      [  3] local 10.192.242.108 port 35388 connected with 10.192.242.167 port 5001
      [ ID] Interval       Transfer     Bandwidth
      [  3]  0.0- 3.0 sec   181 MBytes   506 Mbits/sec
      
      During the testing, CPU loading is 30%.
      Since imx6dl FEC Bandwidth is limited to SOC system bus bandwidth, the
      performance with SW TSO is a milestone.
      
      CC: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: David Laight <David.Laight@ACULAB.COM>
      CC: Li Frank <B20596@freescale.com>
      Signed-off-by: NFugang Duan <B38611@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79f33912
    • N
      net: fec: Add Scatter/gather support · 6e909283
      Nimrod Andy 提交于
      Add Scatter/gather support for FEC.
      This feature allows to improve outbound throughput performance.
      
      Tested on imx6dl sabresd board:
      Running iperf tests shows a 55.4% improvement.
      
      $ ethtool -K eth0 sg off
      $ iperf -c 10.192.242.167 -t 3 &
      [  3] local 10.192.242.108 port 52618 connected with 10.192.242.167 port 5001
      [ ID] Interval       Transfer     Bandwidth
      [  3]  0.0- 3.0 sec  99.5 MBytes   278 Mbits/sec
      
      $ ethtool -K eth0 sg on
      $ iperf -c 10.192.242.167 -t 3 &
      [  3] local 10.192.242.108 port 52617 connected with 10.192.242.167 port 5001
      [ ID] Interval       Transfer     Bandwidth
      [  3]  0.0- 3.0 sec   154 MBytes   432 Mbits/sec
      
      CC: Li Frank <B20596@freescale.com>
      Signed-off-by: NFugang Duan <B38611@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e909283
    • N
      net: fec: Increase buffer descriptor entry number · 55d0218a
      Nimrod Andy 提交于
      In order to support SG, software TSO, let's increase BD entry number.
      
      CC: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: David Laight <David.Laight@ACULAB.COM>
      Signed-off-by: NFugang Duan <B38611@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55d0218a
    • N
      net: fec: Factorize feature setting · 09d1e541
      Nimrod Andy 提交于
      In order to enhance the code readable, let's factorize the
      feature list.
      Signed-off-by: NFugang Duan <B38611@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09d1e541
    • N
      net: fec: Enable IP header hardware checksum · 96c50caa
      Nimrod Andy 提交于
      IP header checksum is calcalated by network layer in default.
      To support software TSO, it is better to use HW calculate the
      IP header checksum.
      
      FEC hw checksum feature request the checksum field in frame
      is zero, otherwise the calculative CRC is not correct.
      
      For segmentated TCP packet, HW calculate the IP header checksum again,
      it doesn't bring any impact. For SW TSO, HW calculated checksum bring
      better performance.
      Signed-off-by: NFugang Duan <B38611@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96c50caa
    • N
      net: fec: Factorize the .xmit transmit function · 61a4427b
      Nimrod Andy 提交于
      Make the code more readable and easy to support other features like
      SG, TSO, moving the common transmit function to one api.
      
      And the patch also factorize the getting BD index to it own function.
      
      CC: David Laight <David.Laight@ACULAB.COM>
      Signed-off-by: NFugang Duan <B38611@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61a4427b
    • L
      bridge: fix compile error when compiling without IPv6 support · 3993c4e1
      Linus Lüssing 提交于
      Some fields in "struct net_bridge" aren't available when compiling the
      kernel without IPv6 support. Therefore adding a check/macro to skip the
      complaining code sections in that case.
      
      Introduced by 2cd41431
      ("bridge: memorize and export selected IGMP/MLD querier port")
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NLinus Lüssing <linus.luessing@web.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3993c4e1
    • L
      bridge: fix smatch warning / potential null pointer dereference · 6c03ee8b
      Linus Lüssing 提交于
      "New smatch warnings:
        net/bridge/br_multicast.c:1368 br_ip6_multicast_query() error:
          we previously assumed 'group' could be null (see line 1349)"
      
      In the rare (sort of broken) case of a query having a Maximum
      Response Delay of zero, we could create a potential null pointer
      dereference.
      
      Fixing this by skipping the multicast specific MLD Query parsing again
      if no multicast group address is available.
      
      Introduced by dc4eb53a
      ("bridge: adhere to querier election mechanism specified by RFCs")
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NLinus Lüssing <linus.luessing@web.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c03ee8b
    • F
      via-rhine: fix full-duplex with autoneg disable · 17958438
      François Cachereul 提交于
      With some specific configuration (VT6105M on Soekris 5510 and depending
      on the device at the other end), fragmented packets were not transmitted
      when forcing 100 full-duplex with autoneg disable.
      
      This fix now write full-duplex chips register when forcing full or
      half-duplex not only when autoneg is enable.
      Signed-off-by: NFrançois Cachereul <f.cachereul@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17958438
    • D
      Merge branch 'bnx2x' · a4d3de0d
      David S. Miller 提交于
      Yuval Mintz says:
      
      ====================
      bnx2x: Bug fixes patch series
      
      This patch series contains various bug fixes - 2 link related fixes,
      one sriov-related issue and an additional fix for a theoretical bug
      on new boards.
      
      Please consider applying these patches to `net'.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4d3de0d
    • A
      bnx2x: Enlarge the dorq threshold for VFs · f2cfa997
      Ariel Elior 提交于
      A malicious VF might try to starve the other VFs & PF by creating
      contineous doorbell floods. In order to negate this, HW has a threshold of
      doorbells per client, which will stop the client doorbells from arriving
      if crossed.
      
      The threshold currently configured for VFs is too low - under extreme traffic
      scenarios, it's possible for a VF to reach the threshold and thus for its
      fastpath to stop working.
      Signed-off-by: NAriel Elior <ariel.elior@qlogic.com>
      Signed-off-by: NYuval Mintz <yuval.mintz@qlogic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2cfa997
    • Y
      bnx2x: Check for UNDI in uncommon branch · b17b0ca1
      Yuval Mintz 提交于
      If L2FW utilized by the UNDI driver has the same version number as that
      of the regular FW, a driver loading after UNDI and receiving an uncommon
      answer from management will mistakenly assume the loaded FW matches its
      own requirement and try to exist the flow via FLR.
      Signed-off-by: NYuval Mintz <yuval.mintz@qlogic.com>
      Signed-off-by: NAriel Elior <ariel.elior@qlogic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b17b0ca1
    • Y
      bnx2x: Fix 1G-baseT link · a2755be5
      Yaniv Rosner 提交于
      Set the phy access mode even in case of link-flap avoidance.
      Signed-off-by: NYaniv Rosner <yaniv.rosner@qlogic.com>
      Signed-off-by: NYuval Mintz <yuval.mintz@qlogic.com>
      Signed-off-by: NAriel Elior <ariel.elior@qlogic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2755be5
    • Y
      bnx2x: Fix link for KR with swapped polarity lane · dad91ee4
      Yaniv Rosner 提交于
      This avoids clearing the RX polarity setting in KR mode when polarity lane
      is swapped, as otherwise this will result in failed link.
      Signed-off-by: NYaniv Rosner <yaniv.rosner@qlogic.com>
      Signed-off-by: NYuval Mintz <yuval.mintz@qlogic.com>
      Signed-off-by: NAriel Elior <ariel.elior@qlogic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dad91ee4
    • X
      sctp: Fix sk_ack_backlog wrap-around problem · d3217b15
      Xufeng Zhang 提交于
      Consider the scenario:
      For a TCP-style socket, while processing the COOKIE_ECHO chunk in
      sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check,
      a new association would be created in sctp_unpack_cookie(), but afterwards,
      some processing maybe failed, and sctp_association_free() will be called to
      free the previously allocated association, in sctp_association_free(),
      sk_ack_backlog value is decremented for this socket, since the initial
      value for sk_ack_backlog is 0, after the decrement, it will be 65535,
      a wrap-around problem happens, and if we want to establish new associations
      afterward in the same socket, ABORT would be triggered since sctp deem the
      accept queue as full.
      Fix this issue by only decrementing sk_ack_backlog for associations in
      the endpoint's list.
      Fix-suggested-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NXufeng Zhang <xufeng.zhang@windriver.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3217b15
  2. 12 6月, 2014 24 次提交
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 902455e0
      David S. Miller 提交于
      Conflicts:
      	net/core/rtnetlink.c
      	net/core/skbuff.c
      
      Both conflicts were very simple overlapping changes.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      902455e0
    • D
      net/core: Add VF link state control policy · c5b46160
      Doug Ledford 提交于
      Commit 1d8faf48 (net/core: Add VF link state control) added VF link state
      control to the netlink VF nested structure, but failed to add a proper entry
      for the new structure into the VF policy table.  Add the missing entry so
      the table and the actual data copied into the netlink nested struct are in
      sync.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5b46160
    • A
      39f33367
    • S
      net/fsl: Make xgmac_mdio read error message useful · 55fd3641
      Shruti Kanetkar 提交于
      Print the device address, the register number and the PHY ID for
      which the MDIO read operation failed
      Signed-off-by: NShruti Kanetkar <Shruti@Freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55fd3641
    • F
      net_sched: drr: warn when qdisc is not work conserving · 6e765a00
      Florian Westphal 提交于
      The DRR scheduler requires that items on the active list are work
      conserving, i.e. do not hold on to skbs for throttling purposes, etc.
      Attaching e.g. tbf renders DRR useless because all other classes on the
      active list are delayed as well.
      
      So, warn users that this configuration won't work as expected; we
      already do this in couple of other qdiscs, see e.g.
      
      commit b00355db
      ('pkt_sched: sch_hfsc: sch_htb: Add non-work-conserving warning handler')
      
      The 'const' change is needed to avoid compiler warning ("discards 'const'
      qualifier from pointer target type").
      
      tested with:
      drr_hier() {
              parent=$1
              classes=$2
              for i in  $(seq 1 $classes); do
                      classid=$parent$(printf %x $i)
                      tc class add dev eth0 parent $parent classid $classid drr
      		tc qdisc add dev eth0 parent $classid tbf rate 64kbit burst 256kbit limit 64kbit
              done
      }
      tc qdisc add dev eth0 root handle 1: drr
      drr_hier 1: 32
      tc filter add dev eth0 protocol all pref 1 parent 1: handle 1 flow hash keys dst perturb 1 divisor 32
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e765a00
    • D
      Merge branch 'inet_csums' · f3591fd4
      David S. Miller 提交于
      Tom Herbert says:
      
      ====================
      net: Checksum offload changes - Part IV
      
      I am working on overhauling RX checksum offload. Goals of this effort
      are:
      
      - Specify what exactly it means when driver returns CHECKSUM_UNNECESSARY
      - Preserve CHECKSUM_COMPLETE through encapsulation layers
      - Don't do skb_checksum more than once per packet
      - Unify GRO and non-GRO csum verification as much as possible
      - Unify the checksum functions (checksum_init)
      - Simply code
      
      What is in this fourth patch set:
      
      - Preserve CHECKSUM_COMPLETE instead of changing it to
        CHECKSUM_UNNECESSARY. This allows correct reuse in validating multiple
        csums in a packet.
      - When SW needs to compute the packet checksum, save it as
        CHECKSUM_COMPLETE. Also mark that checksum was compute by SW.
      - Add skb_gro_postpull_rcsum to udp and vxlan to make GRO work with
        CHECKSUM_COMPLETE.
      
      v2: Removed patch setting skb_encapsulation when validating checksum
          in tcp_gro_receive
      
      Please review carefully and test if possible, mucking with basic
      checksum functions is always a little precarious :-)
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3591fd4
    • T
      net: Add skb_gro_postpull_rcsum to udp and vxlan · 6bae1d4c
      Tom Herbert 提交于
      Need to gro_postpull_rcsum for GRO to work with checksum complete.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bae1d4c
    • T
      net: Save software checksum complete · 7e3cead5
      Tom Herbert 提交于
      In skb_checksum complete, if we need to compute the checksum for the
      packet (via skb_checksum) save the result as CHECKSUM_COMPLETE.
      Subsequent checksum verification can use this.
      
      Also, added csum_complete_sw flag to distinguish between software and
      hardware generated checksum complete, we should always be able to trust
      the software computation.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e3cead5
    • T
      net: Preserve CHECKSUM_COMPLETE at validation · 5d0c2b95
      Tom Herbert 提交于
      Currently when the first checksum in a packet is validated using
      CHECKSUM_COMPLETE, ip_summed is overwritten to be CHECKSUM_UNNECESSARY
      so that any subsequent checksums in the packet are not correctly
      validated.
      
      This patch adds csum_valid flag in sk_buff and uses that to indicate
      validated checksum instead of setting CHECKSUM_UNNECESSARY. The bit
      is set accordingly in the skb_checksum_validate_* functions. The flag
      is checked in skb_checksum_complete, so that validation is communicated
      between checksum_init and checksum_complete sequence in TCP and UDP.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d0c2b95
    • D
      Merge branch 'qlcnic-next' · 1054cc15
      David S. Miller 提交于
      Shahed Shaikh says:
      
      ====================
      This series contains an enhancement in the area of firmware minidump collection
      and optimization of ring count validation function.
      
      Please apply this series to net-next.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1054cc15
    • S
      qlcnic: Update version to 5.3.60 · 038782d6
      Shahed Shaikh 提交于
      Signed-off-by: NShahed Shaikh <shahed.shaikh@qlogic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      038782d6
    • S
      qlcnic: Optimize ring count validations · 18e0d625
      Shahed Shaikh 提交于
      - Check interrupt mode at the start of qlcnic_set_channels().
      - Do not validate ring count if they are not going to change.
      Signed-off-by: NShahed Shaikh <shahed.shaikh@qlogic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18e0d625
    • S
      qlcnic: Pre-allocate DMA buffer used for minidump collection · 4da005cf
      Shahed Shaikh 提交于
      Pre-allocate the physically contiguous DMA buffer used for
      minidump collection at driver load time, rather than at
      run time, to minimize allocation failures. Driver will allocate
      the buffer at load time if PEX DMA support capability is indicated
      by the adapter.
      Signed-off-by: NShahed Shaikh <shahed.shaikh@qlogic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4da005cf
    • D
      ip_vti: fix sparse warnings for VTI_ISVTI · efd0f11d
      Dmitry Popov 提交于
      This patch fixes the following sparse warnings:
      
      net/ipv4/ip_tunnel.c:245:53: warning: restricted __be16 degrades to integer
      net/ipv4/ip_vti.c:321:19: warning: incorrect type in assignment (different base types)
      net/ipv4/ip_vti.c:321:19:    expected restricted __be16 [addressable] [assigned] [usertype] i_flags
      net/ipv4/ip_vti.c:321:19:    got int
      net/ipv4/ip_vti.c:447:24: warning: incorrect type in assignment (different base types)
      net/ipv4/ip_vti.c:447:24:    expected restricted __be16 [usertype] i_flags
      net/ipv4/ip_vti.c:447:24:    got int
      
      Since VTI_ISVTI is always used with ip_tunnel_parm->i_flags (which is __be16),
      we can __force cast VTI_ISVTI to __be16 in header file.
      Signed-off-by: NDmitry Popov <ixaphire@qrator.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efd0f11d
    • D
      drivers: net: davinci_cpdma: double free on error · 2f87208e
      Dan Carpenter 提交于
      We recently change the kzalloc() to devm_kzalloc() so freeing "ctlr"
      here could lead to a double free.
      
      Fixes: e1943128 ('drivers: net: davinci_cpdma: Convert kzalloc() to devm_kzalloc().')
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f87208e
    • D
      amd-xgbe: unwind on error in xgbe_mdio_register() · 8fc908c3
      Dan Carpenter 提交于
      There is a typo here so we return directly instead of unwinding.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fc908c3
    • V
      mrf24j40: add device managed APIs · 0aaf43f5
      Varka Bhadram 提交于
      adds the device managed APIs so that no need worry about
      freeing the resources.
      Signed-off-by: NVarka Bhadram <varkab@cdac.in>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0aaf43f5
    • S
      ceph: remove bogus extern · f6479449
      stephen hemminger 提交于
      Sparse complained about this bogus extern on definition of
      a function.
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6479449
    • A
      net: filter: document internal instruction encoding · 783e327b
      Alexei Starovoitov 提交于
      This patch adds a description of eBPFs instruction encoding in order
      to bring the documentation in line with the implementation.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      783e327b
    • A
      net: filter: mention eBPF terminology as well · e4ad4032
      Alexei Starovoitov 提交于
      Since the term eBPF is used anyway on mailing list discussions, lets
      also document that in the main BPF documentation file and replace a
      couple of occurrences with eBPF terminology to be more clear.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4ad4032
    • E
      ipv4: fix a race in ip4_datagram_release_cb() · 9709674e
      Eric Dumazet 提交于
      Alexey gave a AddressSanitizer[1] report that finally gave a good hint
      at where was the origin of various problems already reported by Dormando
      in the past [2]
      
      Problem comes from the fact that UDP can have a lockless TX path, and
      concurrent threads can manipulate sk_dst_cache, while another thread,
      is holding socket lock and calls __sk_dst_set() in
      ip4_datagram_release_cb() (this was added in linux-3.8)
      
      It seems that all we need to do is to use sk_dst_check() and
      sk_dst_set() so that all the writers hold same spinlock
      (sk->sk_dst_lock) to prevent corruptions.
      
      TCP stack do not need this protection, as all sk_dst_cache writers hold
      the socket lock.
      
      [1]
      https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel
      
      AddressSanitizer: heap-use-after-free in ipv4_dst_check
      Read of size 2 by thread T15453:
       [<ffffffff817daa3a>] ipv4_dst_check+0x1a/0x90 ./net/ipv4/route.c:1116
       [<ffffffff8175b789>] __sk_dst_check+0x89/0xe0 ./net/core/sock.c:531
       [<ffffffff81830a36>] ip4_datagram_release_cb+0x46/0x390 ??:0
       [<ffffffff8175eaea>] release_sock+0x17a/0x230 ./net/core/sock.c:2413
       [<ffffffff81830882>] ip4_datagram_connect+0x462/0x5d0 ??:0
       [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
       [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
       [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
       [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
      ./arch/x86/kernel/entry_64.S:629
      
      Freed by thread T15455:
       [<ffffffff8178d9b8>] dst_destroy+0xa8/0x160 ./net/core/dst.c:251
       [<ffffffff8178de25>] dst_release+0x45/0x80 ./net/core/dst.c:280
       [<ffffffff818304c1>] ip4_datagram_connect+0xa1/0x5d0 ??:0
       [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
       [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
       [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
       [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
      ./arch/x86/kernel/entry_64.S:629
      
      Allocated by thread T15453:
       [<ffffffff8178d291>] dst_alloc+0x81/0x2b0 ./net/core/dst.c:171
       [<ffffffff817db3b7>] rt_dst_alloc+0x47/0x50 ./net/ipv4/route.c:1406
       [<     inlined    >] __ip_route_output_key+0x3e8/0xf70
      __mkroute_output ./net/ipv4/route.c:1939
       [<ffffffff817dde08>] __ip_route_output_key+0x3e8/0xf70 ./net/ipv4/route.c:2161
       [<ffffffff817deb34>] ip_route_output_flow+0x14/0x30 ./net/ipv4/route.c:2249
       [<ffffffff81830737>] ip4_datagram_connect+0x317/0x5d0 ??:0
       [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
       [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
       [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
       [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
      ./arch/x86/kernel/entry_64.S:629
      
      [2]
      <4>[196727.311203] general protection fault: 0000 [#1] SMP
      <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
      <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
      <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
      <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
      <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
      <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
      <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
      <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
      <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
      <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
      <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
      <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
      <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      <4>[196727.311713] Stack:
      <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
      <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
      <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
      <4>[196727.311885] Call Trace:
      <4>[196727.311907]  <IRQ>
      <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
      <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
      <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
      <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
      <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
      <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
      <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
      <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
      <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
      <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
      <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
      <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
      <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
      <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
      <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
      <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
      <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
      <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
      <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
      <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
      <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
      <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
      <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
      <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
      <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
      <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
      <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
      <4>[196727.312722]  <EOI>
      <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
      <4>[196727.313100]  RSP <ffff885effd23a70>
      <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
      <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
      Reported-by: NAlexey Preobrazhensky <preobr@google.com>
      Reported-by: Ndormando <dormando@rydia.ne>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: 8141ed9f ("ipv4: Add a socket release callback for datagram sockets")
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9709674e
    • D
      net: filter: add test_bpf module under MAINTAINERS' networking section · a101ccd1
      Daniel Borkmann 提交于
      Add lib/test_bpf.c entry to maintainers file under networking.
      All changes were posted via netdev for review, so make sure
      other people Cc it as well when they call get_maintainer.pl.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a101ccd1
    • O
      net: add __pskb_copy_fclone and pskb_copy_for_clone · bad93e9d
      Octavian Purdila 提交于
      There are several instances where a pskb_copy or __pskb_copy is
      immediately followed by an skb_clone.
      
      Add a couple of new functions to allow the copy skb to be allocated
      from the fclone cache and thus speed up subsequent skb_clone calls.
      
      Cc: Alexander Smirnov <alex.bluesman.smirnov@gmail.com>
      Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
      Cc: Marek Lindner <mareklindner@neomailbox.ch>
      Cc: Simon Wunderlich <sw@simonwunderlich.de>
      Cc: Antonio Quartulli <antonio@meshcoding.com>
      Cc: Marcel Holtmann <marcel@holtmann.org>
      Cc: Gustavo Padovan <gustavo@padovan.org>
      Cc: Johan Hedberg <johan.hedberg@gmail.com>
      Cc: Arvid Brodin <arvid.brodin@alten.se>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Cc: Lauro Ramos Venancio <lauro.venancio@openbossa.org>
      Cc: Aloisio Almeida Jr <aloisio.almeida@openbossa.org>
      Cc: Samuel Ortiz <sameo@linux.intel.com>
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      Cc: Allan Stephens <allan.stephens@windriver.com>
      Cc: Andrew Hendry <andrew.hendry@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Reviewed-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: NOctavian Purdila <octavian.purdila@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bad93e9d
    • J
      sfc: PIO:Restrict to 64bit arch and use 64-bit writes. · daf37b55
      Jon Cooper 提交于
      Fixes:ee45fd92
      ("sfc: Use TX PIO for sufficiently small packets")
      
      The linux net driver uses memcpy_toio() in order to copy into
      the PIO buffers.
      Even on a 64bit machine this causes 32bit accesses to a write-
      combined memory region.
      There are hardware limitations that mean that only 64bit
      naturally aligned accesses are safe in all cases.
      Due to being write-combined memory region two 32bit accesses
      may be coalesced to form a 64bit non 64bit aligned access.
      Solution was to open-code the memory copy routines using pointers
      and to only enable PIO for x86_64 machines.
      
      Not tested on platforms other than x86_64 because this patch
      disables the PIO feature on other platforms.
      Compile-tested on x86 to ensure that works.
      
      The WARN_ON_ONCE() code in the previous version of this patch
      has been moved into the internal sfc debug driver as the
      assertion was unnecessary in the upstream kernel code.
      
      This bug fix applies to v3.13 and v3.14 stable branches.
      Signed-off-by: NShradha Shah <sshah@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      daf37b55