1. 08 11月, 2013 6 次提交
  2. 07 11月, 2013 6 次提交
  3. 06 11月, 2013 10 次提交
    • J
      virtio-net: switch to use XPS to choose txq · 9bb8ca86
      Jason Wang 提交于
      We used to use a percpu structure vq_index to record the cpu to queue
      mapping, this is suboptimal since it duplicates the work of XPS and
      loses all other XPS functionality such as allowing user to configure
      their own transmission steering strategy.
      
      So this patch switches to use XPS and suggest a default mapping when
      the number of cpus is equal to the number of queues. With XPS support,
      there's no need for keeping per-cpu vq_index and .ndo_select_queue(),
      so they were removed also.
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9bb8ca86
    • D
      ipv6: drop the judgement in rt6_alloc_cow() · 249a3630
      Duan Jiong 提交于
      Now rt6_alloc_cow() is only called by ip6_pol_route() when
      rt->rt6i_flags doesn't contain both RTF_NONEXTHOP and RTF_GATEWAY,
      and rt->rt6i_flags hasn't been changed in ip6_rt_copy().
      So there is no neccessary to judge whether rt->rt6i_flags contains
      RTF_GATEWAY or not.
      Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      249a3630
    • H
      ipv6: fix headroom calculation in udp6_ufo_fragment · 0e033e04
      Hannes Frederic Sowa 提交于
      Commit 1e2bd517 ("udp6: Fix udp
      fragmentation for tunnel traffic.") changed the calculation if
      there is enough space to include a fragment header in the skb from a
      skb->mac_header dervived one to skb_headroom. Because we already peeled
      off the skb to transport_header this is wrong. Change this back to check
      if we have enough room before the mac_header.
      
      This fixes a panic Saran Neti reported. He used the tbf scheduler which
      skb_gso_segments the skb. The offsets get negative and we panic in memcpy
      because the skb was erroneously not expanded at the head.
      Reported-by: NSaran Neti <Saran.Neti@telus.com>
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e033e04
    • J
      net: mv643xx_eth: Add missing phy_addr_set in DT mode · 1cce16d3
      Jason Gunthorpe 提交于
      Commit cc9d4598 'net: mv643xx_eth: use of_phy_connect if phy_node
      present' made the call to phy_scan optional, if the DT has a link to
      the phy node.
      
      However phy_scan has the side effect of calling phy_addr_set, which
      writes the phy MDIO address to the ethernet controller. If phy_addr_set
      is not called, and the bootloader has not set the correct address then
      the driver will fail to function.
      
      Tested on Kirkwood.
      Signed-off-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Acked-by: NSebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
      Tested-by: NArnaud Ebalard <arno@natisbad.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cce16d3
    • H
      ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE · 482fc609
      Hannes Frederic Sowa 提交于
      Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery,
      their sockets won't accept and install new path mtu information and they
      will always use the interface mtu for outgoing packets. It is guaranteed
      that the packet is not fragmented locally. But we won't set the DF-Flag
      on the outgoing frames.
      
      Florian Weimer had the idea to use this flag to ensure DNS servers are
      never generating outgoing fragments. They may well be fragmented on the
      path, but the server never stores or usees path mtu values, which could
      well be forged in an attack.
      
      (The root of the problem with path MTU discovery is that there is
      no reliable way to authenticate ICMP Fragmentation Needed But DF Set
      messages because they are sent from intermediate routers with their
      source addresses, and the IMCP payload will not always contain sufficient
      information to identify a flow.)
      
      Recent research in the DNS community showed that it is possible to
      implement an attack where DNS cache poisoning is feasible by spoofing
      fragments. This work was done by Amir Herzberg and Haya Shulman:
      <https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf>
      
      This issue was previously discussed among the DNS community, e.g.
      <http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>,
      without leading to fixes.
      
      This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode
      regarding local fragmentation with UFO/CORK" for the enforcement of the
      non-fragmentable checks. If other users than ip_append_page/data should
      use this semantic too, we have to add a new flag to IPCB(skb)->flags to
      suppress local fragmentation and check for this in ip_finish_output.
      
      Many thanks to Florian Weimer for the idea and feedback while implementing
      this patch.
      
      Cc: David S. Miller <davem@davemloft.net>
      Suggested-by: NFlorian Weimer <fweimer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      482fc609
    • D
      Merge branch 'huawei_cdc_ncm' · b9155501
      David S. Miller 提交于
      Bjørn Mork says:
      
      ====================
      The huawei_cdc_ncm driver.
      
      Enrico has been kind enough to let me repost his driver with the changes
      requested by Oliver Neukum during the last review of this series.
      
      The changes I have made from Enricos original v5 series to this version
      are:
      
      v6:
       - fix to avoid corrupting drvstate->pmcount
       - fix error return value from huawei_cdc_ncm_suspend()
       - drop redundant testing for subdriver->suspend during resume
       - broke a few lines to keep within the 80 columns recommendation
       - rebased on top of current net-next
      
      Enrico's orginal introduction to the v5 series follows below.  It explains
      the background much better than I can.
      
      Bjørn
      
      [quote Enrico Mioso]
      
      So this is a new, revised, edition of the huawei_cdc_ncm.c driver, which
      supports devices resembling the NCM standard, but using it also as a mean
      to encapsulate other protocols, as is the case for the Huawei E3131 and
      E3251 modem devices.
      Some precisations are needed however - and I encourage discussion on this: and
      that's why I'm sending this message with a broader CC.
      Merging those patches might change:
      - the way Modem Manager interacts with those devices
      - some regressions might be possible if there are some unknown firmware
        variants around (Franko?)
      
      First of all: I observed the behaviours of two devices.
      Huawei E3131: this device doesn't accept NDIS setup requests unless they're
      sent via the embedded AT channel exposed by this driver.
      So actually we gain funcionality in this case!
      
      The second case, is the Huawei E3251: which works with standard NCM driver,
      still exposing an AT embedded channel. Whith this patch set applied, you gain
      some funcionality, loosing the ability to catch standard NCM events for now.
      The device will work in both ways with no problems, but this has to be
      acknowledged and discussed. Might be we can develop this driver further to
      change this, when more devices are tested.
      
      We where thinking Huawei changed their interfaces on new devices - but probably
      this driver only works around a nice firmware bug present in E3131, which
      prevented the modem from being used in NDIS mode.
      
      I think committing this is definitely wortth-while, since it will allow for
      more Huawei devices to be used without serial connection. Some devices like the
      E3251 also, reports some status information only via the embedded AT channel,
      at least in my case.
      Note: I'm not subscribed to any list except the Modem Manager's one, so please
      CC me, thanks!!
      
      [/quote]
      
      Enrico Mioso (3):
        net: cdc_ncm: Export cdc_ncm_{tx,rx}_fixup functions for re-use
        net: huawei_cdc_ncm: Introduce the huawei_cdc_ncm driver
        net: cdc_ncm: remove non-standard NCM device IDs
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9155501
    • E
      net: cdc_ncm: remove non-standard NCM device IDs · 9fea037d
      Enrico Mioso 提交于
      Remove device IDs of NCM-like (but not NCM-conformant) devices, that are
      handled by the huawwei_cdc_ncm driver now.
      Signed-off-by: NEnrico Mioso <mrkiko.rs@gmail.com>
      Signed-off-by: NBjørn Mork <bjorn@mork.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fea037d
    • E
      net: huawei_cdc_ncm: Introduce the huawei_cdc_ncm driver · 41c47d8c
      Enrico Mioso 提交于
      This driver supports devices using the NCM protocol as an encapsulation layer
      for other protocols, like the E3131 Huawei 3G modem. This drivers approach was
      heavily inspired by the qmi_wwan/cdc_mbim approach & code model.
      Signed-off-by: NEnrico Mioso <mrkiko.rs@gmail.com>
      Signed-off-by: NBjørn Mork <bjorn@mork.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41c47d8c
    • E
      net: cdc_ncm: Export cdc_ncm_{tx, rx}_fixup functions for re-use · 2f69702c
      Enrico Mioso 提交于
      Some drivers implementing NCM-like protocols, may re-use those functions, as is
      the case in the huawei_cdc_ncm driver.
      Export them via EXPORT_SYMBOL_GPL, in accordance with how other functions have
      been exported.
      Signed-off-by: NEnrico Mioso <mrkiko.rs@gmail.com>
      Signed-off-by: NBjørn Mork <bjorn@mork.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f69702c
    • F
      ipv6: remove old conditions on flow label sharing · b579035f
      Florent Fourcot 提交于
      The code of flow label in Linux Kernel follows
      the rules of RFC 1809 (an informational one) for
      conditions on flow label sharing. There rules are
      not in the last proposed standard for flow label
      (RFC 6437), or in the previous one (RFC 3697).
      
      Since this code does not follow any current or
      old standard, we can remove it.
      
      With this removal, the ipv6_opt_cmp function is
      now a dead code and it can be removed too.
      
      Changelog to v1:
       * add justification for the change
       * remove the condition on IPv6 options
      
      [ Remove ipv6_hdr_cmp and it is now unused as well. -DaveM ]
      Signed-off-by: NFlorent Fourcot <florent.fourcot@enst-bretagne.fr>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b579035f
  4. 05 11月, 2013 18 次提交
    • D
      Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next · cfce0a2b
      David S. Miller 提交于
      John W. Linville says:
      
      ====================
      Please accept the following pull request intended for the 3.13 tree...
      
      I had intended to pass most of these to you as much as two weeks ago.
      Unfortunately, I failed to account for the effects of bad Internet
      connections and my own fatique/laziness while traveling.  On the bright
      side, at least these have been baking in linux-next for some time!
      
      For the mac80211 bits, Johannes says:
      
      "This time I have two fixes for P2P (which requires not using CCK rates)
      and a workaround for APs with broken WMM information."
      
      For the iwlwifi bits, Johannes says:
      
      "I have a few fixes for warnings/issues: one from Alex, fixing scan
      timings, one from Emmanuel fixing a WARN_ON in the DVM driver, one from
      Stanislaw removing a trigger-happy WARN_ON in the MVM driver and a
      change from myself to try to recover when the device isn't processing
      commands quickly."
      
      And:
      
      "For this round, I have a lot of changes:
       * power management improvements
       * BT coexistence improvements/updates
       * new device support
       * VHT support
       * IBSS support (though due to a small bug it requires new firmware)
       * various other fixes/improvements."
      
      For the Bluetooth bits, Gustavo says:
      
      "More patches for 3.12, busy times for Bluetooth. More than a 100 commits since
      the last pull. The bulk of work comes from Johan and Marcel, they are doing
      fixes and improvements all over the Bluetooth subsystem, as the diffstat can
      show."
      
      For the ath10k and ath6kl bits, Kalle says:
      
      "Bartosz added support to ath10k for our 10.x AP firmware branch, which
      gives us AP specific features and fixes. We still support the main
      firmware branch as well just like before, ath10k detects runtime what
      firmware is used. Unfortunately the firmware interface in 10.x branch is
      somewhat different so there was quite a lot of changes in ath10k for
      this.
      
      Michal and Sujith did some performance improvements in ath10k. Vladimir
      fixed a compiler warning and Fengguang removed an extra semicolon."
      
      For the NFC bits, Samuel says:
      
      "It's a fairly big one, with the following highlights:
      
      - NFC digital layer implementation: Most NFC chipsets implement the NFC
        digital layer in firmware, but others have more basic functionalities
        and expect the host to implement the digital layer. This layer sits
        below the NFC core.
      
      - Sony's port100 support: This is "soft" NFC USB dongle that expects the
        digital layer to be implemented on the host. This is the first user of
        our NFC digital stack implementation.
      
      - Secure element API: We now provide a netlink API for enabling,
        disabling and discovering NFC attached (embedded or UICC ones) secure
        elements. With some userspace help, this allows us to support NFC
        payments.
        Only the pn544 driver currently supports that API.
      
      - NCI SPI fixes and improvements: In order to support NCI devices over
        SPI, we fixed and improved our NCI/SPI implementation. The currently
        most deployed NFC NCI chipset, Broadcom's bcm2079x, supports that mode
        and we're planning to use our NCI/SPI framework to implement a
        driver for it.
      
      - pn533 fragmentation support in target mode: This was the only missing
        feature from our pn533 impementation. We now support fragmentation in
        both Tx and Rx modes, in target mode."
      
      On top of all that, brcmfmac and rt2x00 both get the usual flurry
      of updates.  A few other drivers get hit here or there as well.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfce0a2b
    • J
      virtio-net: coalesce rx frags when possible during rx · ba275241
      Jason Wang 提交于
      Commit 2613af0e (virtio_net: migrate mergeable
      rx buffers to page frag allocators) try to increase the payload/truesize for
      MTU-sized traffic. But this will introduce the extra overhead for GSO packets
      received because of the frag list. This commit tries to reduce this issue by
      coalesce the possible rx frags when possible during rx. Test result shows the
      about 15% improvement on full size GSO packet receiving (and even better than
      before commit 2613af0e).
      
      Before this commit:
      ./netperf -H 192.168.100.4
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.4
      () port 0 AF_INET : demo
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    20303.87
      
      After this commit:
      ./netperf -H 192.168.100.4
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.4
      () port 0 AF_INET : demo
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    23841.26
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michael Dalton <mwdalton@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba275241
    • J
      net: introduce skb_coalesce_rx_frag() · f8e617e1
      Jason Wang 提交于
      Sometimes we need to coalesce the rx frags to avoid frag list. One example is
      virtio-net driver which tries to use small frags for both MTU sized packet and
      GSO packet. So this patch introduce skb_coalesce_rx_frag() to do this.
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michael Dalton <mwdalton@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8e617e1
    • D
      vxlan: Use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...)) · e50fddc8
      Duan Jiong 提交于
      trivial patch converting ERR_PTR(PTR_ERR()) into ERR_CAST().
      No functional changes.
      Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e50fddc8
    • J
      net: codel: Avoid undefined behavior from signed overflow · 1ba3aab3
      Jesper Dangaard Brouer 提交于
      As described in commit 5a581b36 (jiffies: Avoid undefined
      behavior from signed overflow), according to the C standard
      3.4.3p3, overflow of a signed integer results in undefined
      behavior.
      
      To fix this, do as the above commit, and do an unsigned
      subtraction, and interpreting the result as a signed
      two's-complement number.  This is based on the theory from
      RFC 1982 and is nicely described in wikipedia here:
       https://en.wikipedia.org/wiki/Serial_number_arithmetic#General_Solution
      
      A side-note, I have seen practical issues with the previous logic
      when dealing with 16-bit, on a 64-bit machine (gcc version
      4.4.5). This were 32-bit, which I have not observed issues with.
      
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NJesper Dangaard Brouer <netoptimizer@brouer.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ba3aab3
    • D
      Merge branch 'for-davem' of git://gitorious.org/linux-can/linux-can-next · 13521a57
      David S. Miller 提交于
      Marc Kleine-Budde says:
      
      ====================
      here's a pull request for net-next.
      
      It includes a patch by Oliver Hartkopp et al. that adds documentation
      for the broadcast manager to Documentation/networking/can.txt. Three
      patches by me that clean up the netlink handling code in the CAN core.
      And another patch that removes a not needed function from the ti_hecc
      driver.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13521a57
    • Y
      tcp: properly handle stretch acks in slow start · 9f9843a7
      Yuchung Cheng 提交于
      Slow start now increases cwnd by 1 if an ACK acknowledges some packets,
      regardless the number of packets. Consequently slow start performance
      is highly dependent on the degree of the stretch ACKs caused by
      receiver or network ACK compression mechanisms (e.g., delayed-ACK,
      GRO, etc).  But slow start algorithm is to send twice the amount of
      packets of packets left so it should process a stretch ACK of degree
      N as if N ACKs of degree 1, then exits when cwnd exceeds ssthresh. A
      follow up patch will use the remainder of the N (if greater than 1)
      to adjust cwnd in the congestion avoidance phase.
      
      In addition this patch retires the experimental limited slow start
      (LSS) feature. LSS has multiple drawbacks but questionable benefit. The
      fractional cwnd increase in LSS requires a loop in slow start even
      though it's rarely used. Configuring such an increase step via a global
      sysctl on different BDPS seems hard. Finally and most importantly the
      slow start overshoot concern is now better covered by the Hybrid slow
      start (hystart) enabled by default.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f9843a7
    • Y
      tcp: enable sockets to use MSG_FASTOPEN by default · 0d41cca4
      Yuchung Cheng 提交于
      Applications have started to use Fast Open (e.g., Chrome browser has
      such an optional flag) and the feature has gone through several
      generations of kernels since 3.7 with many real network tests. It's
      time to enable this flag by default for applications to test more
      conveniently and extensively.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d41cca4
    • D
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nftables · f8785c55
      David S. Miller 提交于
      Pablo Neira Ayuso says:
      
      ====================
      This batch contains fives nf_tables patches for your net-next tree,
      they are:
      
      * Fix possible use after free in the module removal path of the
        x_tables compatibility layer, from Dan Carpenter.
      
      * Add filter chain type for the bridge family, from myself.
      
      * Fix Kconfig dependencies of the nf_tables bridge family with
        the core, from myself.
      
      * Fix sparse warnings in nft_nat, from Tomasz Bursztyka.
      
      * Remove duplicated include in the IPv4 family support for nf_tables,
        from Wei Yongjun.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8785c55
    • D
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 72c39a0a
      David S. Miller 提交于
      Pablo Neira Ayuso says:
      
      ====================
      This is another batch containing Netfilter/IPVS updates for your net-next
      tree, they are:
      
      * Six patches to make the ipt_CLUSTERIP target support netnamespace,
        from Gao feng.
      
      * Two cleanups for the nf_conntrack_acct infrastructure, introducing
        a new structure to encapsulate conntrack counters, from Holger
        Eitzenberger.
      
      * Fix missing verdict in SCTP support for IPVS, from Daniel Borkmann.
      
      * Skip checksum recalculation in SCTP support for IPVS, also from
        Daniel Borkmann.
      
      * Fix behavioural change in xt_socket after IP early demux, from
        Florian Westphal.
      
      * Fix bogus large memory allocation in the bitmap port set type in ipset,
        from Jozsef Kadlecsik.
      
      * Fix possible compilation issues in the hash netnet set type in ipset,
        also from Jozsef Kadlecsik.
      
      * Define constants to identify netlink callback data in ipset dumps,
        again from Jozsef Kadlecsik.
      
      * Use sock_gen_put() in xt_socket to replace xt_socket_put_sk,
        from Eric Dumazet.
      
      * Improvements for the SH scheduler in IPVS, from Alexander Frolkin.
      
      * Remove extra delay due to unneeded rcu barrier in IPVS net namespace
        cleanup path, from Julian Anastasov.
      
      * Save some cycles in ip6t_REJECT by skipping checksum validation in
        packets leaving from our stack, from Stanislav Fomichev.
      
      * Fix IPVS_CMD_ATTR_MAX definition in IPVS, larger that required, from
        Julian Anastasov.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      72c39a0a
    • D
      netfilter: nft_compat: use _safe version of list_for_each · c359c415
      Dan Carpenter 提交于
      We need to use the _safe version of list_for_each_entry() here otherwise
      we have a use after free bug.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      c359c415
    • D
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch · 6fcf018a
      David S. Miller 提交于
      Jesse Gross says:
      
      ====================
      Open vSwitch
      
      A set of updates for net-next/3.13. Major changes are:
       * Restructure flow handling code to be more logically organized and
         easier to read.
       * Rehashing of the flow table is moved from a workqueue to flow
         installation time. Before, heavy load could block the workqueue for
         excessive periods of time.
       * Additional debugging information is provided to help diagnose megaflows.
       * It's now possible to match on TCP flags.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6fcf018a
    • D
      Merge branch 'mlx4' · 5a6e55c4
      David S. Miller 提交于
      Or Gerlitz says:
      
      ====================
      Mellanox driver updates
      
      This patch set from Jack Morgenstein does the following:
      
      1. Fix MAC/VLAN SRIOV implementation, and add wrapper functions for VLAN allocation
         and de-allocation (patches 1-6).
      
      2. Implements resource quotas when running under SRIOV (patches 7-10).
         Patch 7 is a small bug fix, and patches 8-10 implement the quotas.
      
      Quotas are implemented per resource type for VFs and the PF, to prevent
      any entity from simply grabbing all the resources for itself and leaving
      the other entities unable to obtain such resources.
      
      The series is against net-next commit ba486502 "ipv6: remove the unnecessary statement in find_match()"
      
      changes from V0:
       - dropped the 1st patch which needs to go to -stable and hence through net,
         not net-next
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a6e55c4
    • J
      net/mlx4_core: Implement resource quota enforcement · 146f3ef4
      Jack Morgenstein 提交于
      Implements resource quota grant decision when resources are requested,
      for the following resources:  QPs, CQs, SRQs, MPTs, MTTs, vlans, MACs,
      and Counters.
      
      When granting a resource, the quota system increases the allocated-count
      for that slave.
      
      When the slave later frees the resource, its allocated-count is reduced.
      
      A spinlock is used to protect the integrity of each resource's free-pool counter.
      (One slave may be in the process of being granted a resource while another
      slave has crashed, initiating cleanup of that slave's resource quotas).
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      146f3ef4
    • J
      net/mlx4_core: Fix quota handling in the QUERY_FUNC_CAP wrapper · eb456a68
      Jack Morgenstein 提交于
      In current kernels, the mlx4 driver running on a VM does not
      differentiate between max resource numbers for the HCA and
      max quotas -- it simply takes the quota values passed to it
      as max-resource values.
      
      However, the driver actually requires the VFs to be aware of
      the actual number of resources that the HCA was initialized with,
      for QPs, CQs, SRQs and MPTs.
      
      For QPs, CQs and SRQs, the reason is that in completion handling
      the driver must know which of the 24 bits are the actual resource
      number, and which are "padding" bits.
      
      For MPTs, also, the driver assumes knowledge of the number of MPTs
      in the system.
      
      The previous commit fixes the quota logic on the VM for the quota values
      passed to it by QUERY_FUNC_CAPS.
      
      For QPs, CQs, SRQs, and MPTs, it takes the max resource numbers
      from QUERY_HCA (and not QUERY_FUNC_CAPS).  The quotas passed
      in QUERY_FUNC_CAPS are used to report max resource number values
      in the response to ib_query_device.
      
      However, the Hypervisor driver must consider that VMs
      may be running previous kernels, and compatibility must be preserved.
      
      To resolve the incompatibility with previous kernels running on VMs,
      we deprecated the quota fields in mlx4_QUERY_FUNC_CAP.  In the
      deprecated fields, we pass the max-resource values from INIT_HCA
      
      The quota fields are moved to a new location, and the current kernel
      driver takes the proper values from that location. There is
      also a new flag in dword 0, bit 28 of the mlx4_QUERY_FUNC_CAP mailbox;
      if this flag is set, the (VM) driver takes the quota values from the
      new location.
      
      VMs running previous kernels will work properly, except that the max resource
      numbers reported in ib_query_device for these resources will be
      too high.  The Hypervisor driver will, however, enforce the quotas
      for these VMs.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb456a68
    • J
      mlx4: Structures and init/teardown for VF resource quotas · 5a0d0a61
      Jack Morgenstein 提交于
      This is step #1 for implementing SRIOV resource quotas for VFs.
      
      Quotas are implemented per resource type for VFs and the PF, to prevent
      any entity from simply grabbing all the resources for itself and leaving
      the other entities unable to obtain such resources.
      
      Resources which are allocated using quotas:  QPs, CQs, SRQs, MPTs, MTTs, MAC,
                                                   VLAN, and Counters.
      
      The quota system works as follows:
      Each entity (VF or PF) is given a max number of a given resource (its quota),
      and a guaranteed minimum number for each resource (starvation prevention).
      
      For QPs, CQs, SRQs, MPTs and MTTs:
      50% of the available quantity for the resource is divided equally among
      the PF and all the active VFs (i.e., the number of VFs in the mlx4_core module
      parameter "num_vfs"). This 50% represents the "guaranteed minimum" pool.
      The other 50% is the "free pool", allocated on a first-come-first-serve basis.
      For each VF/PF, resources are first allocated from its "guaranteed-minimum"
      pool. When that pool is exhausted, the driver attempts to allocate from
      the resource "free-pool".
      
      The quota (i.e., max) for the VFs and the PF is:
        The free-pool amount (50% of the real max) + the guaranteed minimum
      
      For MACs:
        Guarantee 2 MACs per VF/PF per port. As a result, since we have only
        128 MACs per port, reduce the allowable number of VFs from 64 to 63.
        Any remaining MACs are put into a free pool.
      
      For VLANs:
        For the PF, the per-port quota is 128 and guarantee is 64
           (to allow the PF to register at least a VLAN per VF in VST mode).
        For the VFs, the per-port quota is 64 and the guarantee is 0.
            We assume that VGT VFs are trusted not to abuse the VLAN resource.
      
      For Counters:
        For all functions (PF and VFs), the quota is 128 and the guarantee is 0.
      
      In this patch, we define the needed structures, which are added to the
      resource-tracker struct.  In addition, we do initialization
      for the resource quota, and adjust the query_device response to use quotas
      rather than resource maxima.
      
      As part of the implementation, we introduce a new field in
      mlx4_dev: quotas.  This field holds the resource quotas used
      to report maxima to the upper layers (ib_core, via query_device).
      
      The HCA maxima of these values are passed to the VFs (via
      QUERY_HCA) so that they may continue to use these in handling
      QPs, CQs, SRQs and MPTs.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a0d0a61
    • J
      net/mlx4_core: Fix checking order in MR table init · a30f1bc5
      Jack Morgenstein 提交于
      In procedure mlx4_init_mr_table(), slaves should do no processing,
      but should return success. This initialization is hypervisor-only.
      
      However, the check for num_mpts being a power-of-2 was performed
      before the check to return immediately if the driver is for a slave.
      This resulted in spurious failures.
      
      The order of performing the checks is reversed, so that if the
      driver is for a slave, no processing is done and success is returned.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a30f1bc5
    • J
      net/mlx4_core: Don't fail reg/unreg vlan for older guests · 2c957ff2
      Jack Morgenstein 提交于
      In upstream kernels under SRIOV, the vlan register/unregister calls
      were NOPs (doing nothing and returning OK). We detect these old
      calls from guests (via the comm channel), since previously the
      port number in mlx4_register_vlan was passed (improperly) in the
      out_param. This has been corrected so that the port number is now
      passed in bits 8..15 of the in_modifier field.
      
      For old calls, these bits will be zero, so if the passed port
      number is zero, we can still look at the out_param field to see
      if it contains a valid port number. If yes, the VM is running
      an old driver.
      
      Since for old drivers, the register/unregister_vlan wrappers were
      NOPs, we continue this policy -- the reason being that upstream
      had an additional bug in eth driver running on guests (where
      procedure mlx4_en_vlan_rx_kill_vid() had the following code:
      
      if (!mlx4_find_cached_vlan(mdev->dev, priv->port, vid, &idx))
              mlx4_unregister_vlan(mdev->dev, priv->port, idx);
      else
              en_err(priv, "could not find vid %d in cache\n", vid);
      
      On a VM, mlx4_find_cached_vlan() will always fail, since the
      vlan cache is located on the Hypervisor; on guests it is empty.
      
      Therefore, if we allow upstream guests to register vlans, we will
      have vlan leakage since the unregister will never be performed.
      Leaving vlan reg/unreg for old guest drivers as a NOP is not a
      feature regression, since in upstream the register/unregister
      vlan wrapper is a NOP.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c957ff2