1. 29 1月, 2008 1 次提交
  2. 21 12月, 2007 1 次提交
  3. 14 11月, 2007 1 次提交
    • P
      [PKT_SCHED]: Check subqueue status before calling hard_start_xmit · 5f1a485d
      Peter P Waskiewicz Jr 提交于
      The only qdiscs that check subqueue state before dequeue'ing are PRIO
      and RR.  The other qdiscs, including the default pfifo_fast qdisc,
      will allow traffic bound for subqueue 0 through to hard_start_xmit.
      The check for netif_queue_stopped() is done above in pkt_sched.h, so
      it is unnecessary for qdisc_restart().  However, if the underlying
      driver is multiqueue capable, and only sets queue states on subqueues,
      this will allow packets to enter the driver when it's currently unable
      to process packets, resulting in expensive requeues and driver
      entries.  This patch re-adds the check for the subqueue status before
      calling hard_start_xmit, so we can try and avoid the driver entry when
      the queues are stopped.
      Signed-off-by: NPeter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f1a485d
  4. 11 11月, 2007 1 次提交
  5. 07 11月, 2007 2 次提交
    • R
      [PKT_SCHED] CLS_U32: Fix endianness problem with u32 classifier hash masks. · 543821c6
      Radu Rendec 提交于
      While trying to implement u32 hashes in my shaping machine I ran into
      a possible bug in the u32 hash/bucket computing algorithm
      (net/sched/cls_u32.c).
      
      The problem occurs only with hash masks that extend over the octet
      boundary, on little endian machines (where htonl() actually does
      something).
      
      Let's say that I would like to use 0x3fc0 as the hash mask. This means
      8 contiguous "1" bits starting at b6. With such a mask, the expected
      (and logical) behavior is to hash any address in, for instance,
      192.168.0.0/26 in bucket 0, then any address in 192.168.0.64/26 in
      bucket 1, then 192.168.0.128/26 in bucket 2 and so on.
      
      This is exactly what would happen on a big endian machine, but on
      little endian machines, what would actually happen with current
      implementation is 0x3fc0 being reversed (into 0xc03f0000) by htonl()
      in the userspace tool and then applied to 192.168.x.x in the u32
      classifier. When shifting right by 16 bits (rank of first "1" bit in
      the reversed mask) and applying the divisor mask (0xff for divisor
      256), what would actually remain is 0x3f applied on the "168" octet of
      the address.
      
      One could say is this can be easily worked around by taking endianness
      into account in userspace and supplying an appropriate mask (0xfc03)
      that would be turned into contiguous "1" bits when reversed
      (0x03fc0000). But the actual problem is the network address (inside
      the packet) not being converted to host order, but used as a
      host-order value when computing the bucket.
      
      Let's say the network address is written as n31 n30 ... n0, with n0
      being the least significant bit. When used directly (without any
      conversion) on a little endian machine, it becomes n7 ... n0 n8 ..n15
      etc in the machine's registers. Thus bits n7 and n8 would no longer be
      adjacent and 192.168.64.0/26 and 192.168.128.0/26 would no longer be
      consecutive.
      
      The fix is to apply ntohl() on the hmask before computing fshift,
      and in u32_hash_fold() convert the packet data to host order before
      shifting down by fshift.
      
      With helpful feedback from Jamal Hadi Salim and Jarek Poplawski.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      543821c6
    • E
      [PKT_SCHED]: Fix OOPS when removing devices from a teql queuing discipline · 4f9f8311
      Evgeniy Polyakov 提交于
      tecl_reset() is called from deactivate and qdisc is set to noop already,
      but subsequent teql_xmit does not know about it and dereference private
      data as teql qdisc and thus oopses.
      not catch it first :)
      Signed-off-by: NEvgeniy Polyakov <johnpol@2ka.mipt.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f9f8311
  6. 26 10月, 2007 1 次提交
  7. 24 10月, 2007 1 次提交
  8. 22 10月, 2007 2 次提交
  9. 20 10月, 2007 1 次提交
  10. 19 10月, 2007 2 次提交
  11. 18 10月, 2007 1 次提交
    • J
      [NET]: fix carrier-on bug? · bfaae0f0
      Jeff Garzik 提交于
      While looking at a net driver with the following construct,
      
      	if (!netif_carrier_ok(dev))
      		netif_carrier_on(dev);
      
      it stuck me that the netif_carrier_ok() check was redundant, since
      netif_carrier_on() checks bit __LINK_STATE_NOCARRIER anyway.  This is
      the same reason why netif_queue_stopped() need not be called prior to
      netif_wake_queue().
      
      This is true, but there is however an unwanted side effect from assuming
      that netif_carrier_on() can be called multiple times:  it touches the
      watchdog, regardless of pre-existing carrier state.
      
      The fix:  move watchdog-up inside the bit-cleared code path.
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bfaae0f0
  12. 16 10月, 2007 1 次提交
  13. 11 10月, 2007 11 次提交
    • P
      [NET_SCHED]: Show timer resolution instead of clock resolution in /proc/net/psched · 3c0cfc13
      Patrick McHardy 提交于
      The fourth parameter of /proc/net/psched is supposed to show the timer
      resultion and is used by HTB userspace to calculate the necessary
      burst rate. Currently we show the clock resolution, which results in a
      too low burst rate when the two differ.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c0cfc13
    • S
      [NET]: sparse warning fixes · cfcabdcc
      Stephen Hemminger 提交于
      Fix a bunch of sparse warnings. Mostly about 0 used as
      NULL pointer, and shadowed variable declarations.
      One notable case was that hash size should have been unsigned.
      Signed-off-by: NStephen Hemminger <shemminger@linux-foundation.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfcabdcc
    • H
      [PKT_SCHED]: Add stateless NAT · b4219952
      Herbert Xu 提交于
      Stateless NAT is useful in controlled environments where restrictions are
      placed on through traffic such that we don't need connection tracking to
      correctly NAT protocol-specific data.
      
      In particular, this is of interest when the number of flows or the number
      of addresses being NATed is large, or if connection tracking information
      has to be replicated and where it is not practical to do so.
      
      Previously we had stateless NAT functionality which was integrated into
      the IPv4 routing subsystem.  This was a great solution as long as the NAT
      worked on a subnet to subnet basis such that the number of NAT rules was
      relatively small.  The reason is that for SNAT the routing based system
      had to perform a linear scan through the rules.
      
      If the number of rules is large then major renovations would have take
      place in the routing subsystem to make this practical.
      
      For the time being, the least intrusive way of achieving this is to use
      the u32 classifier written by Alexey Kuznetsov along with the actions
      infrastructure implemented by Jamal Hadi Salim.
      
      The following patch is an attempt at this problem by creating a new nat
      action that can be invoked from u32 hash tables which would allow large
      number of stateless NAT rules that can be used/updated in constant time.
      
      The actual NAT code is mostly based on the previous stateless NAT code
      written by Alexey.  In future we might be able to utilise the protocol
      NAT code from netfilter to improve support for other protocols.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4219952
    • S
      [NET]: Move hardware header operations out of netdevice. · 3b04ddde
      Stephen Hemminger 提交于
      Since hardware header operations are part of the protocol class
      not the device instance, make them into a separate object and
      save memory.
      Signed-off-by: NStephen Hemminger <shemminger@linux-foundation.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b04ddde
    • S
      [NET]: Wrap netdevice hardware header creation. · 0c4e8581
      Stephen Hemminger 提交于
      Add inline for common usage of hardware header creation, and
      fix bug in IPV6 mcast where the assumption about negative return is
      an errno. Negative return from hard_header means not enough space
      was available,(ie -N bytes).
      Signed-off-by: NStephen Hemminger <shemminger@linux-foundation.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c4e8581
    • J
      [NET_SCHED]: explict hold dev tx lock · 8236632f
      Jamal Hadi Salim 提交于
      For N cpus, with full throttle traffic on all N CPUs, funneling traffic
      to the same ethernet device, the devices queue lock is contended by all
      N CPUs constantly. The TX lock is only contended by a max of 2 CPUS.
      In the current mode of operation, after all the work of entering the
      dequeue region, we may endup aborting the path if we are unable to get
      the tx lock and go back to contend for the queue lock. As N goes up,
      this gets worse.
      
      The changes in this patch result in a small increase in performance
      with a 4CPU (2xdual-core) with no irq binding. Both e1000 and tg3
      showed similar behavior;
      Signed-off-by: NJamal Hadi Salim <hadi@cyberus.ca>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8236632f
    • R
      [NET]: Nuke SET_MODULE_OWNER macro. · 10d024c1
      Ralf Baechle 提交于
      It's been a useless no-op for long enough in 2.6 so I figured it's time to
      remove it.  The number of people that could object because they're
      maintaining unified 2.4 and 2.6 drivers is probably rather small.
      
      [ Handled drivers added by netdev tree and some missed IRDA cases... -DaveM ]
      Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: NJeff Garzik <jeff@garzik.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10d024c1
    • J
      [NET_SCHED]: Cleanup L2T macros and handle oversized packets · e9bef55d
      Jesper Dangaard Brouer 提交于
      Change L2T (length to time) macros, in all rate based schedulers, to
      call a common function qdisc_l2t() that does the rate table lookup.
      This function handles if the packet size lookup is larger than the
      rate table, which often occurs with TSO enabled.
      Signed-off-by: NJesper Dangaard Brouer <hawk@comx.dk>
      Acked-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9bef55d
    • E
      [NET]: Make the device list and device lookups per namespace. · 881d966b
      Eric W. Biederman 提交于
      This patch makes most of the generic device layer network
      namespace safe.  This patch makes dev_base_head a
      network namespace variable, and then it picks up
      a few associated variables.  The functions:
      dev_getbyhwaddr
      dev_getfirsthwbytype
      dev_get_by_flags
      dev_get_by_name
      __dev_get_by_name
      dev_get_by_index
      __dev_get_by_index
      dev_ioctl
      dev_ethtool
      dev_load
      wireless_process_ioctl
      
      were modified to take a network namespace argument, and
      deal with it.
      
      vlan_ioctl_set and brioctl_set were modified so their
      hooks will receive a network namespace argument.
      
      So basically anthing in the core of the network stack that was
      affected to by the change of dev_base was modified to handle
      multiple network namespaces.  The rest of the network stack was
      simply modified to explicitly use &init_net the initial network
      namespace.  This can be fixed when those components of the network
      stack are modified to handle multiple network namespaces.
      
      For now the ifindex generator is left global.
      
      Fundametally ifindex numbers are per namespace, or else
      we will have corner case problems with migration when
      we get that far.
      
      At the same time there are assumptions in the network stack
      that the ifindex of a network device won't change.  Making
      the ifindex number global seems a good compromise until
      the network stack can cope with ifindex changes when
      you change namespaces, and the like.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      881d966b
    • E
      [NET]: Make /proc/net per network namespace · 457c4cbc
      Eric W. Biederman 提交于
      This patch makes /proc/net per network namespace.  It modifies the global
      variables proc_net and proc_net_stat to be per network namespace.
      The proc_net file helpers are modified to take a network namespace argument,
      and all of their callers are fixed to pass &init_net for that argument.
      This ensures that all of the /proc/net files are only visible and
      usable in the initial network namespace until the code behind them
      has been updated to be handle multiple network namespaces.
      
      Making /proc/net per namespace is necessary as at least some files
      in /proc/net depend upon the set of network devices which is per
      network namespace, and even more files in /proc/net have contents
      that are relevant to a single network namespace.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      457c4cbc
    • S
      [NET]: Make NAPI polling independent of struct net_device objects. · bea3348e
      Stephen Hemminger 提交于
      Several devices have multiple independant RX queues per net
      device, and some have a single interrupt doorbell for several
      queues.
      
      In either case, it's easier to support layouts like that if the
      structure representing the poll is independant from the net
      device itself.
      
      The signature of the ->poll() call back goes from:
      
      	int foo_poll(struct net_device *dev, int *budget)
      
      to
      
      	int foo_poll(struct napi_struct *napi, int budget)
      
      The caller is returned the number of RX packets processed (or
      the number of "NAPI credits" consumed if you want to get
      abstract).  The callee no longer messes around bumping
      dev->quota, *budget, etc. because that is all handled in the
      caller upon return.
      
      The napi_struct is to be embedded in the device driver private data
      structures.
      
      Furthermore, it is the driver's responsibility to disable all NAPI
      instances in it's ->stop() device close handler.  Since the
      napi_struct is privatized into the driver's private data structures,
      only the driver knows how to get at all of the napi_struct instances
      it may have per-device.
      
      With lots of help and suggestions from Rusty Russell, Roland Dreier,
      Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
      
      Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
      Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
      
      [ Ported to current tree and all drivers converted.  Integrated
        Stephen's follow-on kerneldoc additions, and restored poll_list
        handling to the old style to fix mutual exclusion issues.  -DaveM ]
      Signed-off-by: NStephen Hemminger <shemminger@linux-foundation.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bea3348e
  14. 08 10月, 2007 1 次提交
  15. 02 10月, 2007 1 次提交
  16. 21 9月, 2007 1 次提交
  17. 17 9月, 2007 1 次提交
  18. 15 9月, 2007 1 次提交
    • J
      [NET_SCHED] protect action config/dump from irqs · e1e992e5
      Jamal Hadi Salim 提交于
      (with no apologies to C Heston)
      
      On Mon, 2007-10-09 at 21:00 +0800, Herbert Xu wrote:
      On Sun, Sep 02, 2007 at 01:11:29PM +0000, Christian Kujau wrote:
      > >
      > > after upgrading to 2.6.23-rc5 (and applying davem's fix [0]), lockdep
      > > was quite noisy when I tried to shape my external (wireless) interface:
      > >
      > > [ 6400.534545] FahCore_78.exe/3552 just changed the state of lock:
      > > [ 6400.534713]  (&dev->ingress_lock){-+..}, at: [<c038d595>]
      > > netif_receive_skb+0x2d5/0x3c0
      > > [ 6400.534941] but this lock took another, soft-read-irq-unsafe lock in the
      > > past:
      > > [ 6400.535145]  (police_lock){-.--}
      >
      > This is a genuine dead-lock.  The police lock can be taken
      > for reading with softirqs on.  If a second CPU tries to take
      > the police lock for writing, while holding the ingress lock,
      > then a softirq on the first CPU can dead-lock when it tries
      > to get the ingress lock.
      Signed-off-by: NJamal Hadi Salim <hadi@cyberus.ca>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1e992e5
  19. 31 8月, 2007 1 次提交
    • L
      [NET_SCHED] sch_prio.c: remove duplicate call of tc_classify() · dbaaa07a
      Lucas Nussbaum 提交于
      When CONFIG_NET_CLS_ACT is enabled, tc_classify() is called twice in
      prio_classify(). This causes "interesting" behaviour: with the setup
      below, packets are duplicated, sent twice to ifb0, and then loop in and
      out of ifb0.
      
      The patch uses the previously calculated return value in the switch,
      which is probably what Patrick had in mind in commit
      bdba91ec -- maybe Patrick can
      double-check this?
      
      -- example setup --
      ifconfig ifb0 up
      tc qdisc add dev ifb0 root netem delay 2s
      tc qdisc add dev $ETH root handle 1: prio
      tc filter add dev $ETH parent 1: protocol ip prio 10 u32 \
       match ip dst 172.24.110.6/32 flowid 1:1 \
       action mirred egress redirect dev ifb0
      ping -c1 172.24.110.6
      Signed-off-by: NLucas Nussbaum <lucas.nussbaum@imag.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbaaa07a
  20. 14 8月, 2007 1 次提交
  21. 31 7月, 2007 3 次提交
  22. 18 7月, 2007 2 次提交
  23. 15 7月, 2007 2 次提交
    • P
      [NET_SCHED]: Kill CONFIG_NET_CLS_POLICE · c3bc7cff
      Patrick McHardy 提交于
      The NET_CLS_ACT option is now a full replacement for NET_CLS_POLICE,
      remove the old code. The config option will be kept around to select
      the equivalent NET_CLS_ACT options for a short time to allow easier
      upgrades.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3bc7cff
    • P
      [NET_SCHED]: act_api: qdisc internal reclassify support · 73ca4918
      Patrick McHardy 提交于
      The behaviour of NET_CLS_POLICE for TC_POLICE_RECLASSIFY was to return
      it to the qdisc, which could handle it internally or ignore it. With
      NET_CLS_ACT however, tc_classify starts over at the first classifier
      and never returns it to the qdisc. This makes it impossible to support
      qdisc-internal reclassification, which in turn makes it impossible to
      remove the old NET_CLS_POLICE code without breaking compatibility since
      we have two qdiscs (CBQ and ATM) that support this.
      
      This patch adds a tc_classify_compat function that handles
      reclassification the old way and changes CBQ and ATM to use it.
      
      This again is of course not fully backwards compatible with the previous
      NET_CLS_ACT behaviour. Unfortunately there is no way to fully maintain
      compatibility *and* support qdisc internal reclassification with
      NET_CLS_ACT, but this seems like the better choice over keeping the two
      incompatible options around forever.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73ca4918