1. 13 6月, 2013 8 次提交
    • C
      net: add doc for ip_early_demux sysctl · e3d73bce
      Cong Wang 提交于
      commit 6648bd7e (ipv4: Add sysctl knob to control
      early socket demux) introduced such sysctl, but forgot to add
      doc into Documentation/networking/ip-sysctl.txt. This patch adds it.
      
      Basically I grab the doc from the description of commit 41063e9d
      (ipv4: Early TCP socket demux.) and the above commit.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3d73bce
    • P
      tun: Turn tun_flow_init() into void fn · 944a1376
      Pavel Emelyanov 提交于
      This routine doesn't fail since 9fdc6bef (tuntap: dont use a private kmem_cache)
      so it makes sense to compact the code a little bit.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      944a1376
    • P
      tun: Report "persist" flag to userspace · 274038f8
      Pavel Emelyanov 提交于
      The TUN_PERSIST flag is not reported at all -- both TUNGETIFF, and sysfs
      "flags" attribute skip one. Knowing whether a device is persistent or not
      is critical for checkpoint-restore, thus I propose to add the read-only
      IFF_PERSIST one for this.
      
      Setting this new IFF_PERSIST is hardly possible, as TUNSETIFF doesn't check
      for unknown flags being zero and thus there can be trash.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      274038f8
    • E
      udp: fix two sparse errors · 7c0cadc6
      Eric Dumazet 提交于
      commit ba418fa3 ("soreuseport: UDP/IPv4 implementation")
      added following sparse errors :
      
      net/ipv4/udp.c:433:60: warning: cast from restricted __be16
      net/ipv4/udp.c:433:60: warning: incorrect type in argument 1 (different base types)
      net/ipv4/udp.c:433:60:    expected unsigned short [unsigned] [usertype] val
      net/ipv4/udp.c:433:60:    got restricted __be16 [usertype] sport
      net/ipv4/udp.c:433:60: warning: cast from restricted __be16
      net/ipv4/udp.c:433:60: warning: cast from restricted __be16
      net/ipv4/udp.c:514:60: warning: cast from restricted __be16
      net/ipv4/udp.c:514:60: warning: incorrect type in argument 1 (different base types)
      net/ipv4/udp.c:514:60:    expected unsigned short [unsigned] [usertype] val
      net/ipv4/udp.c:514:60:    got restricted __be16 [usertype] sport
      net/ipv4/udp.c:514:60: warning: cast from restricted __be16
      net/ipv4/udp.c:514:60: warning: cast from restricted __be16
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c0cadc6
    • E
      gro: remove a sparse error · 5b9b6263
      Eric Dumazet 提交于
      Fix following sparse error :
      
      net/ipv4/af_inet.c:1410:59: warning: restricted __be16 degrades to
      integer
      
      added in commit db8caf3d
      ("gro: should aggregate frames without DF")
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      From: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b9b6263
    • D
      Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next · 4a2e667a
      David S. Miller 提交于
      John W. Linville says:
      
      ====================
      This pull request is intended for the 3.11 stream...
      
      One big highlight is the cw1200 driver the ST-E CW1100 & CW1200
      WLAN chipsets.  This one has been lingering for a while, lacking
      some review comments.  Once started getting pulled into linux-next,
      it got a bit more attention and a number of improvements were made
      over the initial cut.  No doubt there will be more changes ahead,
      but I think it is looking alright at this point.
      
      Along with that, there is the usual flurry of updates to the mac80211
      core and the iwlwifi, mwifiex, ath9k, rt2x00, wil6210, and other
      drivers.  A few of the highlights are some rt2x00 refactoring/cleanup
      by Gabor Juhos, some rt2800 hardware support enhancements by Stanislaw
      Gruszka, some iwlwifi power management updates from Alexander Bondar,
      some enhanced bcma SPROM support from Rafał Miłecki, and a variety
      of other things here and there.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a2e667a
    • S
      sh_eth: split 'sh_eth_netdev_ops' · 8f728d79
      Sergei Shtylyov 提交于
      Commit 9f861341 (sh_eth: remove SH_ETH_HAS_TSU)
      removes 'const' from 'sh_eth_netdev_ops'  and modifies it in case TSU registers
      are present. I've originally suggested to Iwamatsu-san to split  this structure
      in two instead and afterwards Dave M. suggested doing the same.
      Split 'sh_eth_netdev_ops_tsu' from 'sh_eth_netdev_ops', making both 'const', and
      assigning 'ndev->detdev_ops'  depending on the presence of TSU registers.
      Signed-off-by: NSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f728d79
    • E
      igmp: fix new sparse errors · c70eba74
      Eric Dumazet 提交于
      Fix following sparse errors :
      
      net/ipv4/igmp.c:1222:25: warning: cast from restricted __be32
      net/ipv4/igmp.c:1234:31: warning: incorrect type in assignment (different address spaces)
      net/ipv4/igmp.c:1234:31:    expected struct ip_mc_list [noderef] <asn:4>*next_hash
      net/ipv4/igmp.c:1234:31:    got struct ip_mc_list *<noident>
      net/ipv4/igmp.c:1250:31: warning: incorrect type in assignment (different address spaces)
      net/ipv4/igmp.c:1250:31:    expected struct ip_mc_list [noderef] <asn:4>*next_hash
      net/ipv4/igmp.c:1250:31:    got struct ip_mc_list *<noident>
      net/ipv4/igmp.c:2380:37: warning: cast from restricted __be32
      
      These were added by commit e9897071
      ("igmp: hash a hash table to speedup ip_check_mc_rcu()")
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c70eba74
  2. 12 6月, 2013 22 次提交
  3. 11 6月, 2013 10 次提交
    • E
      net_sched: add 64bit rate estimators · 45203a3b
      Eric Dumazet 提交于
      struct gnet_stats_rate_est contains u32 fields, so the bytes per second
      field can wrap at 34360Mbit.
      
      Add a new gnet_stats_rate_est64 structure to get 64bit bps/pps fields,
      and switch the kernel to use this structure natively.
      
      This structure is dumped to user space as a new attribute :
      
      TCA_STATS_RATE_EST64
      
      Old tc command will now display the capped bps (to 34360Mbit), instead
      of wrapped values, and updated tc command will display correct
      information.
      
      Old tc command output, after patch :
      
      eric:~# tc -s -d qd sh dev lo
      qdisc pfifo 8001: root refcnt 2 limit 1000p
       Sent 80868245400 bytes 1978837 pkt (dropped 0, overlimits 0 requeues 0)
       rate 34360Mbit 189696pps backlog 0b 0p requeues 0
      
      This patch carefully reorganizes "struct Qdisc" layout to get optimal
      performance on SMP.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45203a3b
    • P
      net: pass correct parameter to skb_headers_offset_update() · b41abb42
      Peter Pan(潘卫平) 提交于
      Since commit 1a37e412(net: Use 16bits for *_headers fields of struct
      skbuff), skb->*_header are relative to skb->head,
      so copy_skb_header() should not call skb_headers_offset_update() now,
      and we should pass correct parameter to skb_headers_offset_update() in
      pskb_expand_head() and skb_copy_expand().
      Signed-off-by: NWeiping Pan <panweiping3@gmail.com>
      Reviewed-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b41abb42
    • G
      netlink: Add compare function for netlink_table · da12c90e
      Gao feng 提交于
      As we know, netlink sockets are private resource of
      net namespace, they can communicate with each other
      only when they in the same net namespace. this works
      well until we try to add namespace support for other
      subsystems which use netlink.
      
      Don't like ipv4 and route table.., it is not suited to
      make these subsytems belong to net namespace, Such as
      audit and crypto subsystems,they are more suitable to
      user namespace.
      
      So we must have the ability to make the netlink sockets
      in same user namespace can communicate with each other.
      
      This patch adds a new function pointer "compare" for
      netlink_table, we can decide if the netlink sockets can
      communicate with each other through this netlink_table
      self-defined compare function.
      
      The behavior isn't changed if we don't provide the compare
      function for netlink_table.
      Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
      Acked-by: NSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da12c90e
    • L
      xen-netfront: use skb_partial_csum_set() to simplify the codes · 8249152c
      Li RongQing 提交于
      use skb_partial_csum_set() to simplify the codes
      
      Cc: Jason Wang <jasowang@redhat.com>
      Signed-off-by: NLi RongQing <roy.qing.li@gmail.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8249152c
    • D
      Merge branch 'bridge_flags' · 2e422069
      David S. Miller 提交于
      Vlad Yasevich says:
      
      ====================
      The following series adds 2 new flags to bridge.  One flag allows
      the user to control whether mac learning is performed on the interface
      or not.  By default mac learning is on.
      The other flag allows the user to control whether unicast traffic
      is flooded (send without an fdb) to a given unicast port.  Default is
      on.
      
      Changes since v4:
       - Implemented Stephen's suggestions.
      
      Changes since v2:
       - removed unused "unlock" tag.
      
      Changes since v1:
       - Integrated suggestion from MST to not impact RTM_NEWNEIGH and to
         skip lookups when learning is disabled.
      
      Vlad Yasevich (2):
        bridge: Add flag to control mac learning.
        bridge: Add a flag to control unicast packet flood.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e422069
    • V
      bridge: Add a flag to control unicast packet flood. · 867a5943
      Vlad Yasevich 提交于
      Add a flag to control flood of unicast traffic.  By default, flood is
      on and the bridge will flood unicast traffic if it doesn't know
      the destination.  When the flag is turned off, unicast traffic
      without an FDB will not be forwarded to the specified port.
      Signed-off-by: NVlad Yasevich <vyasevic@redhat.com>
      Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      867a5943
    • V
      bridge: Add flag to control mac learning. · 9ba18891
      Vlad Yasevich 提交于
      Allow user to control whether mac learning is enabled on the port.
      By default, mac learning is enabled.  Disabling mac learning will
      cause new dynamic FDB entries to not be created for a particular port.
      Signed-off-by: NVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ba18891
    • C
      net: remove last caller of skb_tail_offset() and itself · 30f3a40f
      Cong Wang 提交于
      Similar to the following commits:
      
      commit 00f97da1 (netpoll: fix position of network header)
      commit 525cebed (pktgen: Fix position of ip and udp header)
      
      using skb_tail_offset() seems not correct since the offset
      is based on head pointer.
      
      With the last caller removed, skb_tail_offset() can be killed
      finally.
      
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Daniel Borkmann <dborkmann@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30f3a40f
    • D
      Merge branch 'll_poll' · 0a4db187
      David S. Miller 提交于
      Eliezer Tamir says:
      
      ====================
      This patch set adds the ability for the socket layer code to
      poll directly on an Ethernet device's RX queue.
      This eliminates the cost of the interrupt and context switch
      and with proper tuning allows us to get very close to the HW latency.
      
      This is a follow up to Jesse Brandeburg's Kernel Plumbers talk from
      last year
      http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf
      
      Patch 1 adds a napi_id and a hashing mechanism to lookup a napi by id.
      Patch 2 adds an ndo_ll_poll method and the code that supports it.
      Patch 3 adds support for busy-polling on UDP sockets.
      Patch 4 adds support for TCP.
      Patch 5 adds the ixgbe driver code implementing ndo_ll_poll.
      Patch 6 adds additional statistics to the ixgbe driver for ndo_ll_poll.
      
      Performance numbers:
           setup                         TCP_RR           UDP_RR
      kernel  Config     C3/6 rx-usecs tps cpu% S.dem  tps cpu% S.dem
      patched optimized  on   100      87k 3.13 11.4   94K 3.17 10.7
      patched optimized  on   0        71k 3.12 14.0   84k 3.19 12.0
      patched optimized  on   adaptive 80k 3.13 12.5   90k 3.46 12.2
      patched typical    on   100      72  3.13 14.0   79k 3.17 12.8
      patched typical    on   0        60k 2.13 16.5   71k 3.18 14.0
      patched typical    on   adaptive 67k 3.51 16.7   75k 3.36 14.5
      3.9     optimized  on   adaptive 25k 1.0  12.7   28k 0.98 11.2
      3.9     typical    off  0        48k 1.09  7.3   52k 1.11 4.18
      3.9     typical    0ff  adaptive 35k 1.12 4.08   38k 0.65 5.49
      3.9     optimized  off  adaptive 40k 0.82 4.83   43k 0.70 5.23
      3.9     optimized  off  0        57k 1.17 4.08   62k 1.04 3.95
      
      Test setup details:
      Machines: each with two Intel Xeon 2680 CPUs and X520 (82599) optical
      NICs
      Tests: Netperf tcp_rr and udp_rr, 1 byte (round trips per second)
      Kernel: unmodified 3.9 and patched 3.9
      Config: typical is derived from RH6.2, optimized is a stripped down
      config.
      Interrupt coalescing (ethtool rx-usecs) settings: 0=off, 1=adaptive,
      100 us
      When C3/6 states were turned on (via BIOS) the performance governor
      was used.
      
      These performance numbers were measured with v2 of the patch set.
      Performance of the optimized config with an rx-usecs setting of 100
      (the first line in the table above) was tracked during the evolution
      of the patches and has never varied by more than 1%.
      
      Design:
      A global hash table that allows us to look up a struct napi by a
      unique id was added.
      
      A napi_id field was added both to struct sk_buff and struct sk.
      This is used to track which NAPI we need to poll for a specific
      socket.
      
      The device driver marks every incoming skb with this id.
      This is propagated to the sk when the socket is looked up in the
      protocol handler.
      
      When the socket code does not find any more data on the socket queue,
      it now may call ndo_ll_poll which will crank the device's rx queue and
      feed incoming packets to the stack directly from the context of the
      socket.
      
      A sysctl value (net.core4.low_latency_poll) controls how many
      microseconds we busy-wait before giving up. (setting to 0 globally
      disables busy-polling)
      
      Locking:
      
      1. Locking between napi poll and ndo_ll_poll:
      Since what needs to be locked between a device's NAPI poll and
      ndo_ll_poll, is highly device / configuration dependent, we do this
      inside the Ethernet driver.
      For example, when packets for high priority connections are sent to
      separate rx queues, you might not need locking between napi poll and
      ndo_ll_poll at all.
      
      For ixgbe we only lock the RX queue.
      ndo_ll_poll does not touch the interrupt state or the TX queues.
      (earlier versions of this patchset did touch them,
      but this design is simpler and works better.)
      
      If a queue is actively polled by a socket (on another CPU) napi poll
      will not service it, but will wait until the queue can be locked
      and cleaned before doing a napi_complete().
      If a socket can't lock the queue because another CPU has it,
      either from napi or from another socket polling on the queue,
      the socket code can busy wait on the socket's skb queue.
      
      Ndo_ll_poll does not have preferential treatment for the data from the
      calling socket vs. data from others, so if another CPU is polling,
      you will see your data on this socket's queue when it arrives.
      
      Ndo_ll_poll is called with local BHs disabled, so it won't race on
      the same CPU with net_rx_action, which calls the napi poll method.
      
      2. Napi_hash
      The napi hash mechanism uses RCU.
      napi_by_id() must be called under rcu_read_lock().
      After a call to napi_hash_del(), caller must take care to wait an rcu
      grace period before freeing the memory containing the napi struct.
      (Ixgbe already had this because the queue vector structure uses rcu to
      protect the statistics counters in it.)
      
      how to test:
      
      1. The patchset should apply cleanly to net-next.
      (don't forget to configure INET_LL_RX_POLL).
      
      2. The ethtool -c setting for rx-usecs should be on the order of 100.
      
      3. Use ethtool -K to disable GRO and LRO
      (You are encouraged to try it both ways. If you find that your
      workload
      does better with GRO on do tell us.)
      
      4. Sysctl value net.core.low_latency_poll controls how long
      (in us) to busy-wait for more data, You are encouraged to play
      with this and see what works for you. The default is now 0 so you need
      to
      set it to turn the feature on. I recommend a value around 50.
      
      4. benchmark thread and IRQ should be bound to separate cores.
      Both cores should be on the same CPU NUMA node as the NIC.
      When the app and the IRQ run on the same CPU  you get a small penalty.
      If interrupt coalescing is set to a low value this penalty can be very
      large.
      
      5. If you suspect that your machine is not configured properly,
      use numademo to make sure that the CPU to memory BW is OK.
      numademo 128m memcpy local copy numbers should be more than
      8GB/s on a properly configured machine.
      
      Change log:
      v10
      - removed select/poll support. (we will work on this some more and try again)
      v9
      - correct sysctl proc_handler, reported by Eric Dumazet and Amir Vadai.
      - more int -> bool changes, reported by Eric Dumazet.
      - better mask testing in sock_poll(), reported by Eric Dumazet.
      
      v8
      - split out udp and select/poll into separate patches.
        what used to be patch 2/5 is now three patches.
      - type corrections from Amir Vadai and Cong Wang:
        one unsigned long that was left when changing to cycles_t
        int -> bool
      - more detailed patch descriptions.
      
      v7
      - suggested by Ben Hutchings and Eric Dumazet:
        type fixes, static for globals in net/core.c,
        avoid napi_id collisions in napi_hash_add()
      
      v6
      - many small fixes suggested by Eric Dumazet:
        data locality, typos, documentation
        protect napi_hash insert/delete with a spinlock (napi_gen_id is no
        longer atomic_t since it's only accessed with the spinlock held.)
      - added IPv6 TCP and UDP support (only minimally tested)
      
      v5
      - corrections suggested by Ben Hutchings:
        fixed typos, moved the config option and sysctl value from IPv4 to net
      - moved sk_mark_ll() to the protocol handlers
      - removed global id mechanism, replaced with a hashed napi_id.
        based on code sample from Eric Dumazet
        Note that ixgbe_free_q_vector() already waits an rcu grace period
        before freeing the q_vector, so nothing additional needs to be done
        when adding a call to napi_hash_del().
      - simple poll/select support
      
      v4
      - removed separate config option for TCP as suggested Eric Dumazet.
      - added linux mib counter for packets received through the low latency path,
        as suggested by Andi Kleen.
      - re-allow module unloading, remove module param, use a global generation id
        instead to prevent the use of a stale napi pointer, as suggested
        by Eric Dumazet
      - updated Documentation/networking/ip-sysctl.txt text
      
      v3
      - coding style changes suggested by Dave Miller
      
      v2
      - the sysctl knob is now in microseconds. The default value is now 0 (off).
      - for now the code depends at configure time on CONFIG_I86_TSC
      - the napi reference in struct skb is now a union with the dma cookie
        since the former is only used on RX and the latter on TX,
        as suggested by Eric Dumazet.
      - we do a better job at honoring non-blocking operations.
      - removed busy-polling support for tcp_read_sock()
      - remove dynamic disabling of GRO
      - coding style fixes
      - disallow unloading the device module after the feature has been used
      
      Credit:
      Jesse Brandeburg, Arun Chekhov Ilango, Julie Cummings,
      Alexander Duyck, Eric Geisler, Jason Neighbors, Yadong Li,
      Mike Polehn, Anil Vasudevan, Don Wood
      Special thanks for finding bugs in earlier versions:
      Willem de Bruijn and Andi Kleen
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a4db187
    • E
      ixgbe: add extra stats for ndo_ll_poll · 7e15b90f
      Eliezer Tamir 提交于
      Add additional statistics to the ixgbe driver for ndo_ll_poll
      Defined under LL_EXTENDED_STATS
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: NEliezer Tamir <eliezer.tamir@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e15b90f