1. 29 11月, 2010 1 次提交
  2. 28 11月, 2010 1 次提交
    • T
      rtnl: make link af-specific updates atomic · cf7afbfe
      Thomas Graf 提交于
      As David pointed out correctly, updates to af-specific attributes
      are currently not atomic. If multiple changes are requested and
      one of them fails, previous updates may have been applied already
      leaving the link behind in a undefined state.
      
      This patch splits the function parse_link_af() into two functions
      validate_link_af() and set_link_at(). validate_link_af() is placed
      to validate_linkmsg() check for errors as early as possible before
      any changes to the link have been made. set_link_af() is called to
      commit the changes later.
      
      This method is not fail proof, while it is currently sufficient
      to make set_link_af() inerrable and thus 100% atomic, the
      validation function method will not be able to detect all error
      scenarios in the future, there will likely always be errors
      depending on states which are f.e. not protected by rtnl_mutex
      and thus may change between validation and setting.
      
      Also, instead of silently ignoring unknown address families and
      config blocks for address families which did not register a set
      function the errors EAFNOSUPPORT respectively EOPNOSUPPORT are
      returned to avoid comitting 4 out of 5 update requests without
      notifying the user.
      Signed-off-by: NThomas Graf <tgraf@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf7afbfe
  3. 25 11月, 2010 3 次提交
    • T
      xps: Transmit Packet Steering · 1d24eb48
      Tom Herbert 提交于
      This patch implements transmit packet steering (XPS) for multiqueue
      devices.  XPS selects a transmit queue during packet transmission based
      on configuration.  This is done by mapping the CPU transmitting the
      packet to a queue.  This is the transmit side analogue to RPS-- where
      RPS is selecting a CPU based on receive queue, XPS selects a queue
      based on the CPU (previously there was an XPS patch from Eric
      Dumazet, but that might more appropriately be called transmit completion
      steering).
      
      Each transmit queue can be associated with a number of CPUs which will
      use the queue to send packets.  This is configured as a CPU mask on a
      per queue basis in:
      
      /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus
      
      The mappings are stored per device in an inverted data structure that
      maps CPUs to queues.  In the netdevice structure this is an array of
      num_possible_cpu structures where each structure holds and array of
      queue_indexes for queues which that CPU can use.
      
      The benefits of XPS are improved locality in the per queue data
      structures.  Also, transmit completions are more likely to be done
      nearer to the sending thread, so this should promote locality back
      to the socket on free (e.g. UDP).  The benefits of XPS are dependent on
      cache hierarchy, application load, and other factors.  XPS would
      nominally be configured so that a queue would only be shared by CPUs
      which are sharing a cache, the degenerative configuration woud be that
      each CPU has it's own queue.
      
      Below are some benchmark results which show the potential benfit of
      this patch.  The netperf test has 500 instances of netperf TCP_RR test
      with 1 byte req. and resp.
      
      bnx2x on 16 core AMD
         XPS (16 queues, 1 TX queue per CPU)  1234K at 100% CPU
         No XPS (16 queues)                   996K at 100% CPU
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d24eb48
    • T
      xps: Improvements in TX queue selection · 3853b584
      Tom Herbert 提交于
      In dev_pick_tx, don't do work in calculating queue
      index or setting
      the index in the sock unless the device has more than one queue.  This
      allows the sock to be set only with a queue index of a multi-queue
      device which is desirable if device are stacked like in a tunnel.
      
      We also allow the mapping of a socket to queue to be changed.  To
      maintain in order packet transmission a flag (ooo_okay) has been
      added to the sk_buff structure.  If a transport layer sets this flag
      on a packet, the transmit queue can be changed for the socket.
      Presumably, the transport would set this if there was no possbility
      of creating OOO packets (for instance, there are no packets in flight
      for the socket).  This patch includes the modification in TCP output
      for setting this flag.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3853b584
    • E
      scm: lower SCM_MAX_FD · bba14de9
      Eric Dumazet 提交于
      Lower SCM_MAX_FD from 255 to 253 so that allocations for scm_fp_list are
      halved. (commit f8d570a4 added two pointers in this structure)
      
      scm_fp_dup() should not copy whole structure (and trigger kmemcheck
      warnings), but only the used part. While we are at it, only allocate
      needed size.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bba14de9
  4. 22 11月, 2010 1 次提交
    • E
      pktgen: allow faster module unload · 551eaff1
      Eric Dumazet 提交于
      Unloading pktgen module needs ~6 seconds on a 64 cpus machine, to stop
      64 kthreads.
      
      Add a pktgen_exiting variable to let kernel threads die faster, so that
      kthread_stop() doesnt have to wait too long for them. This variable is
      not tested in fast path.
      
      Note : Before exiting from pktgen_thread_worker(), we must make sure
      kthread_stop() is waiting for this thread to be stopped, like its done
      in kernel/softirq.c
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      551eaff1
  5. 20 11月, 2010 4 次提交
  6. 19 11月, 2010 4 次提交
  7. 18 11月, 2010 2 次提交
    • J
      net: zero kobject in rx_queue_release · 9ea19481
      John Fastabend 提交于
      netif_set_real_num_rx_queues() can decrement and increment
      the number of rx queues. For example ixgbe does this as
      features and offloads are toggled. Presumably this could
      also happen across down/up on most devices if the available
      resources changed (cpu offlined).
      
      The kobject needs to be zero'd in this case so that the
      state is not preserved across kobject_put()/kobject_init_and_add().
      
      This resolves the following error report.
      
      ixgbe 0000:03:00.0: eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX
      kobject (ffff880324b83210): tried to init an initialized object, something is seriously wrong.
      Pid: 1972, comm: lldpad Not tainted 2.6.37-rc18021qaz+ #169
      Call Trace:
       [<ffffffff8121c940>] kobject_init+0x3a/0x83
       [<ffffffff8121cf77>] kobject_init_and_add+0x23/0x57
       [<ffffffff8107b800>] ? mark_lock+0x21/0x267
       [<ffffffff813c6d11>] net_rx_queue_update_kobjects+0x63/0xc6
       [<ffffffff813b5e0e>] netif_set_real_num_rx_queues+0x5f/0x78
       [<ffffffffa0261d49>] ixgbe_set_num_queues+0x1c6/0x1ca [ixgbe]
       [<ffffffffa0262509>] ixgbe_init_interrupt_scheme+0x1e/0x79c [ixgbe]
       [<ffffffffa0274596>] ixgbe_dcbnl_set_state+0x167/0x189 [ixgbe]
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ea19481
    • T
      rtnetlink: Link address family API · f8ff182c
      Thomas Graf 提交于
      Each net_device contains address family specific data such as
      per device settings and statistics. We already expose this data
      via procfs/sysfs and partially netlink.
      
      The netlink method requires the requester to send one RTM_GETLINK
      request for each address family it wishes to receive data of
      and then merge this data itself.
      
      This patch implements a new API which combines all address family
      specific link data in a new netlink attribute IFLA_AF_SPEC.
      IFLA_AF_SPEC contains a sequence of nested attributes, one for each
      address family which in turn defines the structure of its own
      attribute. Example:
      
         [IFLA_AF_SPEC] = {
             [AF_INET] = {
                 [IFLA_INET_CONF] = ...,
             },
             [AF_INET6] = {
                 [IFLA_INET6_FLAGS] = ...,
                 [IFLA_INET6_CONF] = ...,
             }
         }
      
      The API also allows for address families to implement a function
      which parses the IFLA_AF_SPEC attribute sent by userspace to
      implement address family specific link options.
      Signed-off-by: NThomas Graf <tgraf@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8ff182c
  8. 16 11月, 2010 6 次提交
  9. 13 11月, 2010 1 次提交
    • T
      rtnetlink: Fix message size calculation for link messages · 369cf77a
      Thomas Graf 提交于
      nlmsg_total_size() calculates the length of a netlink message
      including header and alignment. nla_total_size() calculates the
      space an individual attribute consumes which was meant to be used
      in this context.
      
      Also, ensure to account for the attribute header for the
      IFLA_INFO_XSTATS attribute as implementations of get_xstats_size()
      seem to assume that we do so.
      
      The addition of two message headers minus the missing attribute
      header resulted in a calculated message size that was larger than
      required. Therefore we never risked running out of skb tailroom.
      Signed-off-by: NThomas Graf <tgraf@infradead.org>
      Acked-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      369cf77a
  10. 11 11月, 2010 2 次提交
  11. 10 11月, 2010 2 次提交
  12. 09 11月, 2010 1 次提交
  13. 07 11月, 2010 1 次提交
  14. 02 11月, 2010 1 次提交
  15. 29 10月, 2010 2 次提交
    • N
      pktgen: Limit how much data we copy onto the stack. · 448d7b5d
      Nelson Elhage 提交于
      A program that accidentally writes too much data to the pktgen file can overflow
      the kernel stack and oops the machine. This is only triggerable by root, so
      there's no security issue, but it's still an unfortunate bug.
      
      printk() won't print more than 1024 bytes in a single call, anyways, so let's
      just never copy more than that much data. We're on a fairly shallow stack, so
      that should be safe even with CONFIG_4KSTACKS.
      Signed-off-by: NNelson Elhage <nelhage@ksplice.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      448d7b5d
    • D
      net: Limit socket I/O iovec total length to INT_MAX. · 8acfe468
      David S. Miller 提交于
      This helps protect us from overflow issues down in the
      individual protocol sendmsg/recvmsg handlers.  Once
      we hit INT_MAX we truncate out the rest of the iovec
      by setting the iov_len members to zero.
      
      This works because:
      
      1) For SOCK_STREAM and SOCK_SEQPACKET sockets, partial
         writes are allowed and the application will just continue
         with another write to send the rest of the data.
      
      2) For datagram oriented sockets, where there must be a
         one-to-one correspondance between write() calls and
         packets on the wire, INT_MAX is going to be far larger
         than the packet size limit the protocol is going to
         check for and signal with -EMSGSIZE.
      
      Based upon a patch by Linus Torvalds.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8acfe468
  16. 28 10月, 2010 3 次提交
  17. 27 10月, 2010 1 次提交
    • E
      fib: fix fib_nl_newrule() · ebb9fed2
      Eric Dumazet 提交于
      Some panic reports in fib_rules_lookup() show a rule could have a NULL
      pointer as a next pointer in the rules_list.
      
      This can actually happen because of a bug in fib_nl_newrule() : It
      checks if current rule is the destination of unresolved gotos. (Other
      rules have gotos to this about to be inserted rule)
      
      Problem is it does the resolution of the gotos before the rule is
      inserted in the rules_list (and has a valid next pointer)
      
      Fix this by moving the rules_list insertion before the changes on gotos.
      
      A lockless reader can not any more follow a ctarget pointer, unless
      destination is ready (has a valid next pointer)
      Reported-by: NOleg A. Arkhangelsky <sysoleg@yandex.ru>
      Reported-by: NJoe Buehler <aspam@cox.net>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebb9fed2
  18. 26 10月, 2010 4 次提交