1. 23 5月, 2014 16 次提交
    • M
      vlan: more careful checksum features handling · da08143b
      Michal Kubeček 提交于
      When combining real_dev's features and vlan_features, simple
      bitwise AND is used. This doesn't work well for checksum
      offloading features as if one set has NETIF_F_HW_CSUM and the
      other NETIF_F_IP_CSUM and/or NETIF_F_IPV6_CSUM, we end up with
      no checksum offloading. However, from the logical point of view
      (how can_checksum_protocol() works), NETIF_F_HW_CSUM contains
      the functionality of NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM so
      that the result should be IP/IPV6.
      
      Add helper function netdev_intersect_features() implementing
      this logic and use it in vlan_dev_fix_features().
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da08143b
    • N
      net: fec: correct the MDIO clock source · 98a6eeb8
      Nimrod Andy 提交于
      Since imx serials FEC/ENET MDIO clock source is internal ipg clock,
      and "ahb" clock is defined as FEC/ENET bus clock, so the patch just
      correct the fec driver MDIO clock source.
      Signed-off-by: NFugang Duan <B38611@freescale.com>
      Acked-by: NFrank Li <frank.li@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98a6eeb8
    • N
      net: fec: optimize the clock management to save power · e8fcfcd5
      Nimrod Andy 提交于
      Add below clock management to save fec power:
      - After probe, disable all clocks incluing ipg, ahb, enet_out, ptp clock.
      - Open ethx interface enable necessary clocks.
        Close ethx interface disable all clocks.
      
      The patch also encapsulates the all enet clocks enable/disable to
      .fec_enet_clk_enable(), which can reduce the repetitional code in
      driver.
      Signed-off-by: NFugang Duan <B38611@freescale.com>
      Acked-by: NFrank Li <Frank.li@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8fcfcd5
    • D
      Merge branch 'sw_tso' · fddb8872
      David S. Miller 提交于
      Ezequiel Garcia says:
      
      ====================
      net: Introduce a software TSO helper API
      
      Here's a first proposal for a generic software TSO helper API, following
      David's suggestion.
      
      There are at least two drivers that currently implement some form of software
      TSO: Solarflare network driver (drivers/net/ethernet/sfc) and Tilera GX
      network driver (drivers/net/ethernet/tile/tilegx.c).
      
      The rationale behind adding a generic API is to provide a boiler plate with the
      segmentation and other common tasks, making this support easier to add in other
      drivers.
      
      When designing the API, I've considered mainly two design choices:
      
        1. Implement a series of callbacks that each driver would implement
           and the net core code would call to fill in descriptors and egress
           that data.
      
        2. Implement an API for drivers to use in a driver's specific tx_tso
           function. This functions would exhaust a sk_buff payload, and use the
           API as helper for building the headers and segmented data.
      
      I've chosen (2), to avoid function pointers (which was Willy's concern) and
      because it seemed less fragile. Of course, this is argueable.
      
      The API is by no means complete, and lacks some features, however it allows
      to support TSO in mv643xx_eth and mvneta network drivers with some very
      good performance results. I've added this support as an example of the API
      in action.
      
      In particular the following needs some revisiting:
      
        1. IPv6 support is lacking.
      
        2. The required descriptor counting needs some verification. The current
           implementation might be too "sketchy". The tilegx one can be a good
           starting point.
      
        3. The implemenation assumes the hardware can compute the TCP and IP
           checksums for the built headers. However, some controllers may need
           some initial calculation (such as tilegx, for instance).
      
      Despite this, I hope this proposal is good enough to trigger some discussion
      and to check if I'm on the right track. Feedback is much appreciated!
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fddb8872
    • E
      net: mv643xx_eth: Implement software TSO · 3ae8f4e0
      Ezequiel Garcia 提交于
      Now that the TSO helper API has been introduced, this commit makes use
      of it to add support for software TSO in this driver.
      
      This feature allows to improve outbound throughput performance significantly.
      Running iperf tests shows a 30% improvement, tested on a Kirkwood Openblocks
      A6 board.
      
      $ ethtool -K eth0 tso off
      $ iperf -c 192.168.0.45 -t 3
      ------------------------------------------------------------
      Client connecting to 192.168.0.45, TCP port 5001
      TCP window size: 43.8 KByte (default)
      ------------------------------------------------------------
      [  3] local 192.168.0.159 port 46389 connected with 192.168.0.45 port 5001
      [ ID] Interval       Transfer     Bandwidth
      [  3]  0.0- 3.0 sec   217 MBytes   607 Mbits/sec
      
      $ ethtool -K eth0 tso on
      $ iperf -c 192.168.0.45 -t 3
      ------------------------------------------------------------
      Client connecting to 192.168.0.45, TCP port 5001
      TCP window size: 43.8 KByte (default)
      ------------------------------------------------------------
      [  3] local 192.168.0.159 port 46390 connected with 192.168.0.45 port 5001
      [ ID] Interval       Transfer     Bandwidth
      [  3]  0.0- 3.0 sec   336 MBytes   938 Mbits/sec
      
      This commit is just an example of the usage of the TSO API, it works fine
      but needs some more work. In particular, the descriptor unmapping path must
      avoid unmapping the TSO headers.
      Signed-off-by: NEzequiel Garcia <ezequiel.garcia@free-electrons.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ae8f4e0
    • E
      net: mv643xx_eth: Use dma_map_single() to map the skb fragments · 69ad0dd7
      Ezequiel Garcia 提交于
      Using dma_map_single() instead of skb_frag_dma_map() allows to unmap
      all the descriptors using dma_unmap_single(). This change allows
      to introduce software TSO in a less intrusive way.
      Signed-off-by: NEzequiel Garcia <ezequiel.garcia@free-electrons.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69ad0dd7
    • E
      net: mv643xx_eth: Factorize feature setting · 4d48d589
      Ezequiel Garcia 提交于
      In order to ease the addition of new features, let's factorize the
      feature list.
      Signed-off-by: NEzequiel Garcia <ezequiel.garcia@free-electrons.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d48d589
    • E
      net: mv643xx_eth: Avoid setting the initial TCP checksum · 84411f73
      Ezequiel Garcia 提交于
      As specified in the datasheet, the driver can set the "L4Chk_Mode" flag
      (bit 10) in the Tx descriptor command/status to specify that a frame is not
      IP fragmented and that the controller is in charge of generating the TCP/IP
      checksum. This must be used together with the "GL4chk" flag (bit 17).
      
      These two flags allow to avoid setting the initial TCP checksum in the l4i_chk
      field of the Tx descriptor, which is needed to support software TSO.
      Signed-off-by: NEzequiel Garcia <ezequiel.garcia@free-electrons.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84411f73
    • E
      net: mv643xx_eth: Factorize initial checksum and command preparation · 0a8fa933
      Ezequiel Garcia 提交于
      Make the code more readable by moving the initial checksum setup
      and the command/status preparation to its own function.
      Signed-off-by: NEzequiel Garcia <ezequiel.garcia@free-electrons.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a8fa933
    • E
      net: mvneta: Implement software TSO · 2adb719d
      Ezequiel Garcia 提交于
      Now that the TSO helper API has been introduced, this commit makes use
      of it to implement the TSO in this driver.
      
      Using iperf to test and vmstat to check the CPU usage, shows a substantial
      CPU usage drop when TSO is on (~15% vs. ~25%). HTTP-based tests performed
      by Willy Tarreau have shown performance improvements.
      Signed-off-by: NEzequiel Garcia <ezequiel.garcia@free-electrons.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2adb719d
    • E
      net: mvneta: Clean mvneta_tx() sk_buff handling · e19d2dda
      Ezequiel Garcia 提交于
      Rework mvneta_tx() so that the code that performs the final handling
      before a sk_buff is transmitted is done only if the numbers of fragments
      processed if positive.
      
      This is preparation work to add the support for software TSO.
      Signed-off-by: NEzequiel Garcia <ezequiel.garcia@free-electrons.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e19d2dda
    • E
      net: mvneta: Factorize feature setting · 01ef26ca
      Ezequiel Garcia 提交于
      In order to ease the addition of new features, let's factorize the
      feature list.
      Signed-off-by: NEzequiel Garcia <ezequiel.garcia@free-electrons.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01ef26ca
    • E
      net: Add a software TSO helper API · e876f208
      Ezequiel Garcia 提交于
      Although the implementation probably needs a lot of work, this initial API
      allows to implement software TSO in mvneta and mv643xx_eth drivers in a not
      so intrusive way.
      Signed-off-by: NEzequiel Garcia <ezequiel.garcia@free-electrons.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e876f208
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nftables · 8af750d7
      David S. Miller 提交于
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter/nftables updates for net-next
      
      The following patchset contains Netfilter/nftables updates for net-next,
      most relevantly they are:
      
      1) Add set element update notification via netlink, from Arturo Borrero.
      
      2) Put all object updates in one single message batch that is sent to
         kernel-space. Before this patch only rules where included in the batch.
         This series also introduces the generic transaction infrastructure so
         updates to all objects (tables, chains, rules and sets) are applied in
         an all-or-nothing fashion, these series from me.
      
      3) Defer release of objects via call_rcu to reduce the time required to
         commit changes. The assumption is that all objects are destroyed in
         reverse order to ensure that dependencies betweem them are fulfilled
         (ie. rules and sets are destroyed first, then chains, and finally
         tables).
      
      4) Allow to match by bridge port name, from Tomasz Bursztyka. This series
         include two patches to prepare this new feature.
      
      5) Implement the proper set selection based on the characteristics of the
         data. The new infrastructure also allows you to specify your preferences
         in terms of memory and computational complexity so the underlying set
         type is also selected according to your needs, from Patrick McHardy.
      
      6) Several cleanup patches for nft expressions, including one minor possible
         compilation breakage due to missing mark support, also from Patrick.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8af750d7
    • D
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next · 758bd61a
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates
      
      This series contains updates to i40e and i40evf.
      
      Shannon makes minor changes to the AdminQ interface to bring it up to
      date.  Removes the hard coding of stats struct size in ethtool, in prep
      for adding data fields which are configuration dependent.
      
      Catherine removes some unused and unneeded PCI bus defines.
      
      Jesse fixes the copyright headers and finishes up the removal of the PTP
      Tx work functionality which allows us to rely on the Tx timesync interrupt.
      
      Mitch provides a number of fixes and cleanups for i40e/i40evf based on
      suggestions from Ben Hutchings.  First is to use a macro parameter for
      ethtool stats instead of just assuming that a valid netdev variable
      exists.  Second is not to tell ethtool that the VF can do 10GbaseT, when
      it really has no idea what its link speed is, so set the supported value
      to 0 instead.  Make the ethtool_ops structure constant since it is
      extremely unlikely to change at runtime.  Ethtool consistently reports
      0 values for our ITR settings because we never actually use them, so
      fix this by setting the default values to the specified default values.
      
      Greg avoids a compile error by wrapping the call to i40e_alloc_vfs() in
      CONFIG_PCI_IOV because the function itself is wrapped in the same
      conditional compile block.
      
      Alexander Gordeev updates the driver to use the new pci_enable_msi_range()
      and pci_enable_msix_range() or pci_enable_msi_exact() and
      pci_enable_msix_exact().
      
      Jean Sacren provides a fix where the wrong error code was being passed to
      i40e_open().
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      758bd61a
    • N
      tcp: make cwnd-limited checks measurement-based, and gentler · ca8a2263
      Neal Cardwell 提交于
      Experience with the recent e114a710 ("tcp: fix cwnd limited
      checking to improve congestion control") has shown that there are
      common cases where that commit can cause cwnd to be much larger than
      necessary. This leads to TSO autosizing cooking skbs that are too
      large, among other things.
      
      The main problems seemed to be:
      
      (1) That commit attempted to predict the future behavior of the
      connection by looking at the write queue (if TSO or TSQ limit
      sending). That prediction sometimes overestimated future outstanding
      packets.
      
      (2) That commit always allowed cwnd to grow to twice the number of
      outstanding packets (even in congestion avoidance, where this is not
      needed).
      
      This commit improves both of these, by:
      
      (1) Switching to a measurement-based approach where we explicitly
      track the largest number of packets in flight during the past window
      ("max_packets_out"), and remember whether we were cwnd-limited at the
      moment we finished sending that flight.
      
      (2) Only allowing cwnd to grow to twice the number of outstanding
      packets ("max_packets_out") in slow start. In congestion avoidance
      mode we now only allow cwnd to grow if it was fully utilized.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca8a2263
  2. 22 5月, 2014 11 次提交
    • J
      wimax/i2400m: make return of 0 explicit · aff4b974
      Julia Lawall 提交于
      Delete unnecessary local variable whose value is always 0 and that hides
      the fact that the result is always 0.
      
      A simplified version of the semantic patch that fixes this problem is as
      follows: (http://coccinelle.lip6.fr/)
      
      // <smpl>
      @r exists@
      local idexpression ret;
      expression e;
      position p;
      @@
      
      -ret = 0;
      ... when != ret = e
      return
      - ret
      + 0
        ;
      // </smpl>
      Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aff4b974
    • A
      net: filter: cleanup invocation of internal BPF · 5fe821a9
      Alexei Starovoitov 提交于
      Kernel API for classic BPF socket filters is:
      
      sk_unattached_filter_create() - validate classic BPF, convert, JIT
      SK_RUN_FILTER() - run it
      sk_unattached_filter_destroy() - destroy socket filter
      
      Cleanup internal BPF kernel API as following:
      
      sk_filter_select_runtime() - final step of internal BPF creation.
        Try to JIT internal BPF program, if JIT is not available select interpreter
      SK_RUN_FILTER() - run it
      sk_filter_free() - free internal BPF program
      
      Disallow direct calls to BPF interpreter. Execution of the BPF program should
      be done with SK_RUN_FILTER() macro.
      
      Example of internal BPF create, run, destroy:
      
        struct sk_filter *fp;
      
        fp = kzalloc(sk_filter_size(prog_len), GFP_KERNEL);
        memcpy(fp->insni, prog, prog_len * sizeof(fp->insni[0]));
        fp->len = prog_len;
      
        sk_filter_select_runtime(fp);
      
        SK_RUN_FILTER(fp, ctx);
      
        sk_filter_free(fp);
      
      Sockets, seccomp, testsuite, tracing are using different ways to populate
      sk_filter, so first steps of program creation are not common.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5fe821a9
    • D
      Merge branch 'enic-next' · 21ea04fa
      David S. Miller 提交于
      Govindarajulu Varadarajan says:
      
      ====================
      enic: Add adaptive coalescing interrupt support
      
      This series add support for adaptive coalescing interrupt and updates
      enic Maintainers.
      
      v1->v2:
      	* Add commit log
      	* do vnic_intr_coalescing_timer_set only while enabling intr
      	* use ktime_get instead of hrtimer
      	* make enic_set_rx_coal_setting return type void
      	* change func name enic_apply_int_moderation to enic_calc_int_moderation
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21ea04fa
    • G
      MAINTAINERS: Update enic maintainers · c327e8f4
      Govindarajulu Varadarajan 提交于
      Cc: Sujith Sankar <ssujith@cisco.com>
      Cc: Christian Benvenuti <benve@cisco.com>
      Cc: Neel Patel <neepatel@cisco.com>
      Signed-off-by: NGovindarajulu Varadarajan <_govind@gmx.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c327e8f4
    • S
      enic: Add support for adaptive interrupt coalescing · 7c2ce6e6
      Sujith Sankar 提交于
      This patch adds support for adaptive interrupt coalescing.
      
      For small pkts with low pkt rate, we can decrease the coalescing interrupt
      dynamically which decreases the latency. This however increases the cpu
      utilization. Based on testing with different coal intr and pkt rate we came up
      with a table(mod_table) with rx_rate and coalescing interrupt value where we
      get low latency without significant increase in cpu. mod_table table stores
      the coalescing timer percentage value for different throughputs.
      
      Function enic_calc_int_moderation() calculates the desired coalescing intr timer
      value. This function is called in driver rx napi_poll. The actual value is set
      by enic_set_int_moderation() which is called when napi_poll is complete. i.e
      when we unmask the rx intr.
      
      Adaptive coal intr is support only when driver is using msix intr. Because
      intr is not shared.
      
      Struct mod_range is used to store only the default adaptive coalescing intr
      value.
      
      Adaptive coal intr calue is calculated by
      
      timer = range_start + ((rx_coal->range_end - range_start) *
      		       mod_table[index].range_percent / 100);
      
      rx_coal->range_end is the rx-usecs-high value set using ethtool.
      range_start is rx-usecs-low, set using ethtool, if rx_small_pkt_bytes_cnt is
      greater than 2 * rx_large_pkt_bytes_cnt. i.e small pkts are dominant. Else its
      rx-usecs-low + 3.
      
      Cc: Christian Benvenuti <benve@cisco.com>
      Cc: Neel Patel <neepatel@cisco.com>
      Signed-off-by: NSujith Sankar <ssujith@cisco.com>
      Signed-off-by: NGovindarajulu Varadarajan <_govind@gmx.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c2ce6e6
    • M
      vxge: Use time_before() · f6e92d10
      Manuel Schölling 提交于
      To be future-proof and for better readability the time comparisons are modified
      to use time_before() instead of plain, error-prone math.
      Signed-off-by: NManuel Schölling <manuel.schoelling@gmx.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6e92d10
    • H
      ieee802154: Introduce the use of the managed version of kzalloc · 12b5c38f
      Himangi Saraogi 提交于
      This patch moves data allocated using kzalloc to managed data allocated
      using devm_kzalloc and cleans now unnecessary kfrees in probe and remove
      functions. An explicit linux/device.h include is added to make sure
      the devm_*() routine declarations are unambiguously available.
      
      The following Coccinelle semantic patch was used for making the change:
      
      @platform@
      identifier p, probefn, removefn;
      @@
      struct platform_driver p = {
        .probe = probefn,
        .remove = removefn,
      };
      
      @prb@
      identifier platform.probefn, pdev;
      expression e, e1, e2;
      @@
      probefn(struct platform_device *pdev, ...) {
        <+...
      - e = kzalloc(e1, e2)
      + e = devm_kzalloc(&pdev->dev, e1, e2)
        ...
      ?-kfree(e);
        ...+>
      }
      
      @rem depends on prb@
      identifier platform.removefn;
      expression e;
      @@
      removefn(...) {
        <...
      - kfree(e);
        ...>
      }
      Signed-off-by: NHimangi Saraogi <himangi774@gmail.com>
      Acked-by: NJulia Lawall <julia.lawall@lip6.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12b5c38f
    • P
      atm: idt77252: Remove redundant error check · 7e910357
      Peter Senna Tschudin 提交于
      Remove double checks, convert printk to pr_warn, and move the call to
      pr_warn to the first check. The simplified version of the coccinelle
      semantic patch that find this issue is as follows:
      
      // <smpl>
      @@
      expression E; identifier pr; expression list es;
      @@
      while(...){
      ...
      -       if (E) break;
      +       if (E){
      +               pr(es);
      +               break;
      +       }
      ...
      }
      - if(E) pr(es);
      // </smpl>
      
      Tested by compilation only.
      Signed-off-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e910357
    • L
      ipv6: slight optimization in ip6_dst_gc · 14956643
      Li RongQing 提交于
      entries is always greater than rt_max_size here, since if entries is less
      than rt_max_size, the fib6_run_gc function will be skipped
      Signed-off-by: NLi RongQing <roy.qing.li@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14956643
    • X
      net-tun: restructure tun_do_read for better sleep/wakeup efficiency · 9e641bdc
      Xi Wang 提交于
      tun_do_read always adds current thread to wait queue, even if a packet
      is ready to read. This is inefficient because both sleeper and waker
      want to acquire the wait queue spin lock when packet rate is high.
      
      We restructure the read function and use common kernel networking
      routines to handle receive, sleep and wakeup. With the change
      available packets are checked first before the reading thread is added
      to the wait queue.
      
      Ran performance tests with the following configuration:
      
       - my packet generator -> tap1 -> br0 -> tap0 -> my packet consumer
       - sender pinned to one core and receiver pinned to another core
       - sender send small UDP packets (64 bytes total) as fast as it can
       - sandy bridge cores
       - throughput are receiver side goodput numbers
      
      The results are
      
      baseline: 731k pkts/sec, cpu utilization at 1.50 cpus
       changed: 783k pkts/sec, cpu utilization at 1.53 cpus
      
      The performance difference is largely determined by packet rate and
      inter-cpu communication cost. For example, if the sender and
      receiver are pinned to different cpu sockets, the results are
      
      baseline: 558k pkts/sec, cpu utilization at 1.71 cpus
       changed: 690k pkts/sec, cpu utilization at 1.67 cpus
      Co-authored-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NXi Wang <xii@google.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e641bdc
    • T
      net: tunnels - enable module autoloading · f98f89a0
      Tom Gundersen 提交于
      Enable the module alias hookup to allow tunnel modules to be autoloaded on demand.
      
      This is in line with how most other netdev kinds work, and will allow userspace
      to create tunnels without having CAP_SYS_MODULE.
      Signed-off-by: NTom Gundersen <teg@jklm.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f98f89a0
  3. 21 5月, 2014 13 次提交