1. 12 5月, 2010 2 次提交
  2. 27 4月, 2010 1 次提交
  3. 23 4月, 2010 1 次提交
  4. 20 4月, 2010 4 次提交
    • B
      netfilter: bridge-netfilter: fix refragmenting IP traffic encapsulated in PPPoE traffic · 6c79bf0f
      Bart De Schuymer 提交于
      The MTU for IP traffic encapsulated inside PPPoE traffic is smaller
      than the MTU of the Ethernet device (1500). Connection tracking
      gathers all IP packets and sometimes will refragment them in
      ip_fragment(). We then need to subtract the length of the
      encapsulating header from the mtu used in ip_fragment(). The check in
      br_nf_dev_queue_xmit() which determines if ip_fragment() has to be
      called is also updated for the PPPoE-encapsulated packets.
      nf_bridge_copy_header() is also updated to make sure the PPPoE data
      length field has the correct value.
      Signed-off-by: NBart De Schuymer <bdschuym@pandora.be>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      6c79bf0f
    • P
      netfilter: xt_TEE: resolve oif using netdevice notifiers · 22265a5c
      Patrick McHardy 提交于
      Replace the runtime oif name resolving by netdevice notifier based
      resolving. When an oif is given, a netdevice notifier is registered
      to resolve the name on NETDEV_REGISTER or NETDEV_CHANGE and unresolve
      it again on NETDEV_UNREGISTER or NETDEV_CHANGE to a different name.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      22265a5c
    • E
      rps: cleanups · e36fa2f7
      Eric Dumazet 提交于
      struct softnet_data holds many queues, so consistent use "sd" name
      instead of "queue" is better.
      
      Adds a rps_ipi_queued() helper to cleanup enqueue_to_backlog()
      
      Adds a _and_irq_disable suffix to net_rps_action() name, as David
      suggested.
      
      incr_input_queue_head() becomes input_queue_head_incr()
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e36fa2f7
    • E
      rps: shortcut net_rps_action() · 88751275
      Eric Dumazet 提交于
      net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if
      RPS is not active.
      
      Tom Herbert used two bitmasks to hold information needed to send IPI,
      but a single LIFO list seems more appropriate.
      
      Move all RPS logic into net_rps_action() to cleanup net_rx_action() code
      (remove two ifdefs)
      
      Move rps_remote_softirq_cpus into softnet_data to share its first cache
      line, filling an existing hole.
      
      In a future patch, we could call net_rps_action() from process_backlog()
      to make sure we send IPI before handling this cpu backlog.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88751275
  5. 19 4月, 2010 2 次提交
  6. 17 4月, 2010 1 次提交
    • T
      rfs: Receive Flow Steering · fec5e652
      Tom Herbert 提交于
      This patch implements receive flow steering (RFS).  RFS steers
      received packets for layer 3 and 4 processing to the CPU where
      the application for the corresponding flow is running.  RFS is an
      extension of Receive Packet Steering (RPS).
      
      The basic idea of RFS is that when an application calls recvmsg
      (or sendmsg) the application's running CPU is stored in a hash
      table that is indexed by the connection's rxhash which is stored in
      the socket structure.  The rxhash is passed in skb's received on
      the connection from netif_receive_skb.  For each received packet,
      the associated rxhash is used to look up the CPU in the hash table,
      if a valid CPU is set then the packet is steered to that CPU using
      the RPS mechanisms.
      
      The convolution of the simple approach is that it would potentially
      allow OOO packets.  If threads are thrashing around CPUs or multiple
      threads are trying to read from the same sockets, a quickly changing
      CPU value in the hash table could cause rampant OOO packets--
      we consider this a non-starter.
      
      To avoid OOO packets, this solution implements two types of hash
      tables: rps_sock_flow_table and rps_dev_flow_table.
      
      rps_sock_table is a global hash table.  Each entry is just a CPU
      number and it is populated in recvmsg and sendmsg as described above.
      This table contains the "desired" CPUs for flows.
      
      rps_dev_flow_table is specific to each device queue.  Each entry
      contains a CPU and a tail queue counter.  The CPU is the "current"
      CPU for a matching flow.  The tail queue counter holds the value
      of a tail queue counter for the associated CPU's backlog queue at
      the time of last enqueue for a flow matching the entry.
      
      Each backlog queue has a queue head counter which is incremented
      on dequeue, and so a queue tail counter is computed as queue head
      count + queue length.  When a packet is enqueued on a backlog queue,
      the current value of the queue tail counter is saved in the hash
      entry of the rps_dev_flow_table.
      
      And now the trick: when selecting the CPU for RPS (get_rps_cpu)
      the rps_sock_flow table and the rps_dev_flow table for the RX queue
      are consulted.  When the desired CPU for the flow (found in the
      rps_sock_flow table) does not match the current CPU (found in the
      rps_dev_flow table), the current CPU is changed to the desired CPU
      if one of the following is true:
      
      - The current CPU is unset (equal to RPS_NO_CPU)
      - Current CPU is offline
      - The current CPU's queue head counter >= queue tail counter in the
      rps_dev_flow table.  This checks if the queue tail has advanced
      beyond the last packet that was enqueued using this table entry.
      This guarantees that all packets queued using this entry have been
      dequeued, thus preserving in order delivery.
      
      Making each queue have its own rps_dev_flow table has two advantages:
      1) the tail queue counters will be written on each receive, so
      keeping the table local to interrupting CPU s good for locality.  2)
      this allows lockless access to the table-- the CPU number and queue
      tail counter need to be accessed together under mutual exclusion
      from netif_receive_skb, we assume that this is only called from
      device napi_poll which is non-reentrant.
      
      This patch implements RFS for TCP and connected UDP sockets.
      It should be usable for other flow oriented protocols.
      
      There are two configuration parameters for RFS.  The
      "rps_flow_entries" kernel init parameter sets the number of
      entries in the rps_sock_flow_table, the per rxqueue sysfs entry
      "rps_flow_cnt" contains the number of entries in the rps_dev_flow
      table for the rxqueue.  Both are rounded to power of two.
      
      The obvious benefit of RFS (over just RPS) is that it achieves
      CPU locality between the receive processing for a flow and the
      applications processing; this can result in increased performance
      (higher pps, lower latency).
      
      The benefits of RFS are dependent on cache hierarchy, application
      load, and other factors.  On simple benchmarks, we don't necessarily
      see improvement and sometimes see degradation.  However, for more
      complex benchmarks and for applications where cache pressure is
      much higher this technique seems to perform very well.
      
      Below are some benchmark results which show the potential benfit of
      this patch.  The netperf test has 500 instances of netperf TCP_RR
      test with 1 byte req. and resp.  The RPC test is an request/response
      test similar in structure to netperf RR test ith 100 threads on
      each host, but does more work in userspace that netperf.
      
      e1000e on 8 core Intel
         No RFS or RPS		104K tps at 30% CPU
         No RFS (best RPS config):    290K tps at 63% CPU
         RFS				303K tps at 61% CPU
      
      RPC test	tps	CPU%	50/90/99% usec latency	Latency StdDev
        No RFS/RPS	103K	48%	757/900/3185		4472.35
        RPS only:	174K	73%	415/993/2468		491.66
        RFS		223K	73%	379/651/1382		315.61
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fec5e652
  7. 15 4月, 2010 3 次提交
  8. 14 4月, 2010 5 次提交
    • G
      stmmac: new descriptor field for the driver's platform · e326e850
      Giuseppe CAVALLARO 提交于
      The new enh_desc is used for selecting the enhanced descriptors
      structure. There are several scenarios; some chips (mac10/100
      or gmac) want to use the enhanced descriptors; others want the normal
      ones.
      For example, on ST platforms: MAC10/100 uses the normal desc structure
      and the GMAC uses the enhanced one.
      It can be useful to get this information from the platform.
      This could also be decided at run-time looking at the chip's ID number;
      but it could happen that chips with the same ID want to use different
      descriptor structure.
      Signed-off-by: NGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e326e850
    • P
      ipv4: ipmr: support multiple tables · f0ad0860
      Patrick McHardy 提交于
      This patch adds support for multiple independant multicast routing instances,
      named "tables".
      
      Userspace multicast routing daemons can bind to a specific table instance by
      issuing a setsockopt call using a new option MRT_TABLE. The table number is
      stored in the raw socket data and affects all following ipmr setsockopt(),
      getsockopt() and ioctl() calls. By default, a single table (RT_TABLE_DEFAULT)
      is created with a default routing rule pointing to it. Newly created pimreg
      devices have the table number appended ("pimregX"), with the exception of
      devices created in the default table, which are named just "pimreg" for
      compatibility reasons.
      
      Packets are directed to a specific table instance using routing rules,
      similar to how regular routing rules work. Currently iif, oif and mark
      are supported as keys, source and destination addresses could be supported
      additionally.
      
      Example usage:
      
      - bind pimd/xorp/... to a specific table:
      
      uint32_t table = 123;
      setsockopt(fd, IPPROTO_IP, MRT_TABLE, &table, sizeof(table));
      
      - create routing rules directing packets to the new table:
      
      # ip mrule add iif eth0 lookup 123
      # ip mrule add oif eth0 lookup 123
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0ad0860
    • P
    • P
      ipv4: ipmr: remove net pointer from struct mfc_cache · d658f8a0
      Patrick McHardy 提交于
      Now that cache entries in unres_queue don't need to be distinguished by their
      network namespace pointer anymore, we can remove it from struct mfc_cache
      add pass the namespace as function argument to the functions that need it.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d658f8a0
    • P
      net: fib_rules: decouple address families from real address families · 0f87b1dd
      Patrick McHardy 提交于
      Decouple the address family values used for fib_rules from the real
      address families in socket.h. This allows to use fib_rules for
      code that is not a real address family without increasing AF_MAX/NPROTO.
      
      Values up to 127 are reserved for real address families and map directly
      to the corresponding AF value, values starting from 128 are for other
      uses. rtnetlink is changed to invoke the AF_UNSPEC dumpit/doit handlers
      for these families.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f87b1dd
  9. 13 4月, 2010 7 次提交
  10. 10 4月, 2010 2 次提交
    • D
      radix_tree_tag_get() is not as safe as the docs make out [ver #2] · ce82653d
      David Howells 提交于
      radix_tree_tag_get() is not safe to use concurrently with radix_tree_tag_set()
      or radix_tree_tag_clear().  The problem is that the double tag_get() in
      radix_tree_tag_get():
      
      		if (!tag_get(node, tag, offset))
      			saw_unset_tag = 1;
      		if (height == 1) {
      			int ret = tag_get(node, tag, offset);
      
      may see the value change due to the action of set/clear.  RCU is no protection
      against this as no pointers are being changed, no nodes are being replaced
      according to a COW protocol - set/clear alter the node directly.
      
      The documentation in linux/radix-tree.h, however, says that
      radix_tree_tag_get() is an exception to the rule that "any function modifying
      the tree or tags (...) must exclude other modifications, and exclude any
      functions reading the tree".
      
      The problem is that the next statement in radix_tree_tag_get() checks that the
      tag doesn't vary over time:
      
      			BUG_ON(ret && saw_unset_tag);
      
      This has been seen happening in FS-Cache:
      
      	https://www.redhat.com/archives/linux-cachefs/2010-April/msg00013.html
      
      To this end, remove the BUG_ON() from radix_tree_tag_get() and note in various
      comments that the value of the tag may change whilst the RCU read lock is held,
      and thus that the return value of radix_tree_tag_get() may not be relied upon
      unless radix_tree_tag_set/clear() and radix_tree_delete() are excluded from
      running concurrently with it.
      Reported-by: NRomain DEGEZ <romain.degez@smartjog.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce82653d
    • P
      slab: Generify kernel pointer validation · fc1c1833
      Pekka Enberg 提交于
      As suggested by Linus, introduce a kern_ptr_validate() helper that does some
      sanity checks to make sure a pointer is a valid kernel pointer.  This is a
      preparational step for fixing SLUB kmem_ptr_validate().
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc1c1833
  11. 09 4月, 2010 2 次提交
  12. 08 4月, 2010 5 次提交
    • C
      net: fix ethtool coding style errors and warnings · 97f8aefb
      chavey 提交于
      Fix coding style errors and warnings output while running checkpatch.pl
      on the files net/core/ethtool.c and include/linux/ethtool.h
      Signed-off-by: Nchavey <chavey@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97f8aefb
    • M
      virtio: disable multiport console support. · b7a41301
      Michael S. Tsirkin 提交于
      Move MULTIPORT feature and related config changes
      out of exported headers, and disable the feature
      at runtime.
      
      At this point, it seems less risky to keep code around
      until we can enable it than rip it out completely.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      b7a41301
    • P
      net: fix definition of netdev_for_each_mc_addr() · 18e225f2
      Pavel Roskin 提交于
      The first argument should be called ha, not mclist.  All callers use the
      name "ha", but if they used a different name, there would be a compile
      error.
      Signed-off-by: NPavel Roskin <proski@gnu.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18e225f2
    • J
      mac80211: clean up/fix aggregation code · 098a6070
      Johannes Berg 提交于
      The aggregation code has a number of quirks, like
      inventing an unneeded WLAN_BACK_TIMER value and
      leaking memory under certain circumstances during
      station destruction. Fix these issues by using
      the regular aggregation session teardown code and
      blocking new aggregation sessions, all before the
      station is really destructed.
      
      As a side effect, this gets rid of the long code
      block to destroy aggregation safely.
      
      Additionally, rename tid_state_rx which can only
      have the values IDLE and OPERATIONAL to
      tid_active_rx to make it easier to understand
      that there is no bitwise stuff going on on the
      RX side -- the TX side remains because it needs
      to keep track of the driver and peer states.
      Signed-off-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      098a6070
    • J
      cfg80211: Add local-state-change-only auth/deauth/disassoc · d5cdfacb
      Jouni Malinen 提交于
      cfg80211 is quite strict on allowing authentication and association
      commands only in certain states. In order to meet these requirements,
      user space applications may need to clear authentication or
      association state in some cases. Currently, this can be done with
      deauth/disassoc command, but that ends up sending out Deauthentication
      or Disassociation frame unnecessarily. Add a new nl80211 attribute to
      allow this sending of the frame be skipped, but with all other
      deauth/disassoc operations being completed.
      
      Similar state change is also needed for IEEE 802.11r FT protocol in
      the FT-over-DS case which does not use Authentication frame exchange
      in a transition to another BSS. For this to work with cfg80211, an
      authentication entry needs to be created for the target BSS without
      sending out an Authentication frame. The nl80211 authentication
      command can be used for this purpose, too, with the new attribute to
      indicate that the command is only for changing local state. This
      enables wpa_supplicant to complete FT-over-DS transition successfully.
      Signed-off-by: NJouni Malinen <j@w1.fi>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      d5cdfacb
  13. 07 4月, 2010 5 次提交