1. 20 9月, 2016 1 次提交
    • J
      net sched ife action: Introduce skb tcindex metadata encap decap · 408fbc22
      Jamal Hadi Salim 提交于
      Sample use case of how this is encoded:
      user space via tuntap (or a connected VM/Machine/container)
      encodes the tcindex TLV.
      
      Sample use case of decoding:
      IFE action decodes it and the skb->tc_index is then used to classify.
      So something like this for encoded ICMP packets:
      
      .. first decode then reclassify... skb->tcindex will be set
      sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xbeef \
      u32 match u32 0 0 flowid 1:1 \
      action ife decode reclassify
      
      ...next match the decode icmp packet...
      sudo $TC filter add dev $ETH parent ffff: prio 4 protocol ip \
      u32 match ip protocol 1 0xff flowid 1:1 \
      action continue
      
      ... last classify it using the tcindex classifier and do someaction..
      sudo $TC filter add dev $ETH parent ffff: prio 5 protocol ip \
      handle 0x11 tcindex classid 1:1 \
      action blah..
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      408fbc22
  2. 16 9月, 2016 1 次提交
    • J
      net_sched: Introduce skbmod action · 86da71b5
      Jamal Hadi Salim 提交于
      This action is intended to be an upgrade from a usability perspective
      from pedit (as well as operational debugability).
      Compare this:
      
      sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
      u32 match ip protocol 1 0xff flowid 1:2 \
      action pedit munge offset -14 u8 set 0x02 \
      munge offset -13 u8 set 0x15 \
      munge offset -12 u8 set 0x15 \
      munge offset -11 u8 set 0x15 \
      munge offset -10 u16 set 0x1515 \
      pipe
      
      to:
      
      sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
      u32 match ip protocol 1 0xff flowid 1:2 \
      action skbmod dmac 02:15:15:15:15:15
      
      Also try to do a MAC address swap with pedit or worse
      try to debug a policy with destination mac, source mac and
      etherype. Then make few rules out of those and you'll get my point.
      
      In the future common use cases on pedit can be migrated to this action
      (as an example different fields in ip v4/6, transports like tcp/udp/sctp
      etc). For this first cut, this allows modifying basic ethernet header.
      
      The most important ethernet use case at the moment is when redirecting or
      mirroring packets to a remote machine. The dst mac address needs a re-write
      so that it doesnt get dropped or confuse an interconnecting (learning) switch
      or dropped by a target machine (which looks at the dst mac). And at times
      when flipping back the packet a swap of the MAC addresses is needed.
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86da71b5
  3. 11 9月, 2016 1 次提交
  4. 25 7月, 2016 1 次提交
  5. 02 3月, 2016 3 次提交
    • J
      Support to encoding decoding skb prio on IFE action · 200e10f4
      Jamal Hadi Salim 提交于
          Example usage:
          Set the skb priority using skbedit then allow it to be encoded
      
          sudo tc qdisc add dev $ETH root handle 1: prio
          sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
          u32 match ip protocol 1 0xff flowid 1:2 \
          action skbedit prio 17 \
          action ife encode \
          allow prio \
          dst 02:15:15:15:15:15
      
          Note: You dont need the skbedit action if you are already encoding the
          skb priority earlier. A zero skb priority will not be sent
      
          Alternative hard code static priority of decimal 33 (unlike skbedit)
          then mark of 0x12 every time the filter matches
      
          sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \
          u32 match ip protocol 1 0xff flowid 1:2 \
          action ife encode \
          type 0xDEAD \
          use prio 33 \
          use mark 0x12 \
          dst 02:15:15:15:15:15
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      200e10f4
    • J
      Support to encoding decoding skb mark on IFE action · 084e2f65
      Jamal Hadi Salim 提交于
      Example usage:
      Set the skb using skbedit then allow it to be encoded
      
      sudo tc qdisc add dev $ETH root handle 1: prio
      sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
      u32 match ip protocol 1 0xff flowid 1:2 \
      action skbedit mark 17 \
      action ife encode \
      allow mark \
      dst 02:15:15:15:15:15
      
      Note: You dont need the skbedit action if you are already encoding the
      skb mark earlier. A zero skb mark, when seen, will not be encoded.
      
      Alternative hard code static mark of 0x12 every time the filter matches
      
      sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \
      u32 match ip protocol 1 0xff flowid 1:2 \
      action ife encode \
      type 0xDEAD \
      use mark 0x12 \
      dst 02:15:15:15:15:15
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      084e2f65
    • J
      introduce IFE action · ef6980b6
      Jamal Hadi Salim 提交于
      This action allows for a sending side to encapsulate arbitrary metadata
      which is decapsulated by the receiving end.
      The sender runs in encoding mode and the receiver in decode mode.
      Both sender and receiver must specify the same ethertype.
      At some point we hope to have a registered ethertype and we'll
      then provide a default so the user doesnt have to specify it.
      For now we enforce the user specify it.
      
      Lets show example usage where we encode icmp from a sender towards
      a receiver with an skbmark of 17; both sender and receiver use
      ethertype of 0xdead to interop.
      
      YYYY: Lets start with Receiver-side policy config:
      xxx: add an ingress qdisc
      sudo tc qdisc add dev $ETH ingress
      
      xxx: any packets with ethertype 0xdead will be subjected to ife decoding
      xxx: we then restart the classification so we can match on icmp at prio 3
      sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xdead \
      u32 match u32 0 0 flowid 1:1 \
      action ife decode reclassify
      
      xxx: on restarting the classification from above if it was an icmp
      xxx: packet, then match it here and continue to the next rule at prio 4
      xxx: which will match based on skb mark of 17
      sudo tc filter add dev $ETH parent ffff: prio 3 protocol ip \
      u32 match ip protocol 1 0xff flowid 1:1 \
      action continue
      
      xxx: match on skbmark of 0x11 (decimal 17) and accept
      sudo tc filter add dev $ETH parent ffff: prio 4 protocol ip \
      handle 0x11 fw flowid 1:1 \
      action ok
      
      xxx: Lets show the decoding policy
      sudo tc -s filter ls dev $ETH parent ffff: protocol 0xdead
      xxx:
      filter pref 2 u32
      filter pref 2 u32 fh 800: ht divisor 1
      filter pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1  (rule hit 0 success 0)
        match 00000000/00000000 at 0 (success 0 )
              action order 1: ife decode action reclassify
               index 1 ref 1 bind 1 installed 14 sec used 14 sec
               type: 0x0
               Metadata: allow mark allow hash allow prio allow qmap
              Action statistics:
              Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
              backlog 0b 0p requeues 0
      xxx:
      Observe that above lists all metadatum it can decode. Typically these
      submodules will already be compiled into a monolithic kernel or
      loaded as modules
      
      YYYY: Lets show the sender side now ..
      
      xxx: Add an egress qdisc on the sender netdev
      sudo tc qdisc add dev $ETH root handle 1: prio
      xxx:
      xxx: Match all icmp packets to 192.168.122.237/24, then
      xxx: tag the packet with skb mark of decimal 17, then
      xxx: Encode it with:
      xxx:	ethertype 0xdead
      xxx:	add skb->mark to whitelist of metadatum to send
      xxx:	rewrite target dst MAC address to 02:15:15:15:15:15
      xxx:
      sudo $TC filter add dev $ETH parent 1: protocol ip prio 10  u32 \
      match ip dst 192.168.122.237/24 \
      match ip protocol 1 0xff \
      flowid 1:2 \
      action skbedit mark 17 \
      action ife encode \
      type 0xDEAD \
      allow mark \
      dst 02:15:15:15:15:15
      
      xxx: Lets show the encoding policy
      sudo tc -s filter ls dev $ETH parent 1: protocol ip
      xxx:
      filter pref 10 u32
      filter pref 10 u32 fh 800: ht divisor 1
      filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:2  (rule hit 0 success 0)
        match c0a87aed/ffffffff at 16 (success 0 )
        match 00010000/00ff0000 at 8 (success 0 )
      
      	action order 1:  skbedit mark 17
      	 index 6 ref 1 bind 1
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      	action order 2: ife encode action pipe
      	 index 3 ref 1 bind 1
      	 dst MAC: 02:15:15:15:15:15 type: 0xDEAD
       	 Metadata: allow mark
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      xxx:
      
      test by sending ping from sender to destination
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef6980b6
  6. 14 5月, 2015 1 次提交
  7. 20 1月, 2015 1 次提交
    • F
      net: sched: Introduce connmark action · 22a5dc0e
      Felix Fietkau 提交于
      This tc action allows you to retrieve the connection tracking mark
      This action has been used heavily by openwrt for a few years now.
      
      There are known limitations currently:
      
      doesn't work for initial packets, since we only query the ct table.
        Fine given use case is for returning packets
      
      no implicit defrag.
        frags should be rare so fix later..
      
      won't work for more complex tasks, e.g. lookup of other extensions
        since we have no means to store results
      
      we still have a 2nd lookup later on via normal conntrack path.
      This shouldn't break anything though since skb->nfct isn't altered.
      
      V2:
      remove unnecessary braces (Jiri)
      change the action identifier to 14 (Jiri)
      Fix some stylistic issues caught by checkpatch
      V3:
      Move module params to bottom (Cong)
      Get rid of tcf_hashinfo_init and friends and conform to newer API (Cong)
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NFelix Fietkau <nbd@openwrt.org>
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22a5dc0e
  8. 18 1月, 2015 1 次提交
  9. 22 11月, 2014 1 次提交
  10. 07 1月, 2014 1 次提交
    • V
      net: pkt_sched: PIE AQM scheme · d4b36210
      Vijay Subramanian 提交于
      Proportional Integral controller Enhanced (PIE) is a scheduler to address the
      bufferbloat problem.
      
      >From the IETF draft below:
      " Bufferbloat is a phenomenon where excess buffers in the network cause high
      latency and jitter. As more and more interactive applications (e.g. voice over
      IP, real time video streaming and financial transactions) run in the Internet,
      high latency and jitter degrade application performance. There is a pressing
      need to design intelligent queue management schemes that can control latency and
      jitter; and hence provide desirable quality of service to users.
      
      We present here a lightweight design, PIE(Proportional Integral controller
      Enhanced) that can effectively control the average queueing latency to a target
      value. Simulation results, theoretical analysis and Linux testbed results have
      shown that PIE can ensure low latency and achieve high link utilization under
      various congestion situations. The design does not require per-packet
      timestamp, so it incurs very small overhead and is simple enough to implement
      in both hardware and software.  "
      
      Many thanks to Dave Taht for extensive feedback, reviews, testing and
      suggestions. Thanks also to Stephen Hemminger and Eric Dumazet for reviews and
      suggestions.  Naeem Khademi and Dave Taht independently contributed to ECN
      support.
      
      For more information, please see technical paper about PIE in the IEEE
      Conference on High Performance Switching and Routing 2013. A copy of the paper
      can be found at ftp://ftpeng.cisco.com/pie/.
      
      Please also refer to the IETF draft submission at
      http://tools.ietf.org/html/draft-pan-tsvwg-pie-00
      
      All relevant code, documents and test scripts and results can be found at
      ftp://ftpeng.cisco.com/pie/.
      
      For problems with the iproute2/tc or Linux kernel code, please contact Vijay
      Subramanian (vijaynsu@cisco.com or subramanian.vijay@gmail.com) Mythili Prabhu
      (mysuryan@cisco.com)
      Signed-off-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NMythili Prabhu <mysuryan@cisco.com>
      CC: Dave Taht <dave.taht@bufferbloat.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4b36210
  11. 20 12月, 2013 1 次提交
    • T
      net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc · 10239edf
      Terry Lam 提交于
      This patch implements the first size-based qdisc that attempts to
      differentiate between small flows and heavy-hitters.  The goal is to
      catch the heavy-hitters and move them to a separate queue with less
      priority so that bulk traffic does not affect the latency of critical
      traffic.  Currently "less priority" means less weight (2:1 in
      particular) in a Weighted Deficit Round Robin (WDRR) scheduler.
      
      In essence, this patch addresses the "delay-bloat" problem due to
      bloated buffers. In some systems, large queues may be necessary for
      obtaining CPU efficiency, or due to the presence of unresponsive
      traffic like UDP, or just a large number of connections with each
      having a small amount of outstanding traffic. In these circumstances,
      HHF aims to reduce the HoL blocking for latency sensitive traffic,
      while not impacting the queues built up by bulk traffic.  HHF can also
      be used in conjunction with other AQM mechanisms such as CoDel.
      
      To capture heavy-hitters, we implement the "multi-stage filter" design
      in the following paper:
      C. Estan and G. Varghese, "New Directions in Traffic Measurement and
      Accounting", in ACM SIGCOMM, 2002.
      
      Some configurable qdisc settings through 'tc':
      - hhf_reset_timeout: period to reset counter values in the multi-stage
                           filter (default 40ms)
      - hhf_admit_bytes:   threshold to classify heavy-hitters
                           (default 128KB)
      - hhf_evict_timeout: threshold to evict idle heavy-hitters
                           (default 1s)
      - hhf_non_hh_weight: Weighted Deficit Round Robin (WDRR) weight for
                           non-heavy-hitters (default 2)
      - hh_flows_limit:    max number of heavy-hitter flow entries
                           (default 2048)
      
      Note that the ratio between hhf_admit_bytes and hhf_reset_timeout
      reflects the bandwidth of heavy-hitters that we attempt to capture
      (25Mbps with the above default settings).
      
      The false negative rate (heavy-hitter flows getting away unclassified)
      is zero by the design of the multi-stage filter algorithm.
      With 100 heavy-hitter flows, using four hashes and 4000 counters yields
      a false positive rate (non-heavy-hitters mistakenly classified as
      heavy-hitters) of less than 1e-4.
      Signed-off-by: NTerry Lam <vtlam@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10239edf
  12. 30 10月, 2013 1 次提交
    • D
      net: sched: cls_bpf: add BPF-based classifier · 7d1d65cb
      Daniel Borkmann 提交于
      This work contains a lightweight BPF-based traffic classifier that can
      serve as a flexible alternative to ematch-based tree classification, i.e.
      now that BPF filter engine can also be JITed in the kernel. Naturally, tc
      actions and policies are supported as well with cls_bpf. Multiple BPF
      programs/filter can be attached for a class, or they can just as well be
      written within a single BPF program, that's really up to the user how he
      wishes to run/optimize the code, e.g. also for inversion of verdicts etc.
      The notion of a BPF program's return/exit codes is being kept as follows:
      
           0: No match
          -1: Select classid given in "tc filter ..." command
        else: flowid, overwrite the default one
      
      As a minimal usage example with iproute2, we use a 3 band prio root qdisc
      on a router with sfq each as leave, and assign ssh and icmp bpf-based
      filters to band 1, http traffic to band 2 and the rest to band 3. For the
      first two bands we load the bytecode from a file, in the 2nd we load it
      inline as an example:
      
      echo 1 > /proc/sys/net/core/bpf_jit_enable
      
      tc qdisc del dev em1 root
      tc qdisc add dev em1 root handle 1: prio bands 3 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      
      tc qdisc add dev em1 parent 1:1 sfq perturb 16
      tc qdisc add dev em1 parent 1:2 sfq perturb 16
      tc qdisc add dev em1 parent 1:3 sfq perturb 16
      
      tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/ssh.bpf flowid 1:1
      tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/icmp.bpf flowid 1:1
      tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/http.bpf flowid 1:2
      tc filter add dev em1 parent 1: bpf run bytecode "`bpfc -f tc -i misc.ops`" flowid 1:3
      
      BPF programs can be easily created and passed to tc, either as inline
      'bytecode' or 'bytecode-file'. There are a couple of front-ends that can
      compile opcodes, for example:
      
      1) People familiar with tcpdump-like filters:
      
         tcpdump -iem1 -ddd port 22 | tr '\n' ',' > /etc/tc/ssh.bpf
      
      2) People that want to low-level program their filters or use BPF
         extensions that lack support by libpcap's compiler:
      
         bpfc -f tc -i ssh.ops > /etc/tc/ssh.bpf
      
         ssh.ops example code:
         ldh [12]
         jne #0x800, drop
         ldb [23]
         jneq #6, drop
         ldh [20]
         jset #0x1fff, drop
         ldxb 4 * ([14] & 0xf)
         ldh [%x + 14]
         jeq #0x16, pass
         ldh [%x + 16]
         jne #0x16, drop
         pass: ret #-1
         drop: ret #0
      
      It was chosen to load bytecode into tc, since the reverse operation,
      tc filter list dev em1, is then able to show the exact commands again.
      Possible follow-up work could also include a small expression compiler
      for iproute2. Tested with the help of bmon. This idea came up during
      the Netfilter Workshop 2013 in Copenhagen. Also thanks to feedback from
      Eric Dumazet!
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d1d65cb
  13. 30 8月, 2013 1 次提交
    • E
      pkt_sched: fq: Fair Queue packet scheduler · afe4fd06
      Eric Dumazet 提交于
      - Uses perfect flow match (not stochastic hash like SFQ/FQ_codel)
      - Uses the new_flow/old_flow separation from FQ_codel
      - New flows get an initial credit allowing IW10 without added delay.
      - Special FIFO queue for high prio packets (no need for PRIO + FQ)
      - Uses a hash table of RB trees to locate the flows at enqueue() time
      - Smart on demand gc (at enqueue() time, RB tree lookup evicts old
        unused flows)
      - Dynamic memory allocations.
      - Designed to allow millions of concurrent flows per Qdisc.
      - Small memory footprint : ~8K per Qdisc, and 104 bytes per flow.
      - Single high resolution timer for throttled flows (if any).
      - One RB tree to link throttled flows.
      - Ability to have a max rate per flow. We might add a socket option
        to add per socket limitation.
      
      Attempts have been made to add TCP pacing in TCP stack, but this
      seems to add complex code to an already complex stack.
      
      TCP pacing is welcomed for flows having idle times, as the cwnd
      permits TCP stack to queue a possibly large number of packets.
      
      This removes the 'slow start after idle' choice, hitting badly
      large BDP flows, and applications delivering chunks of data
      as video streams.
      
      Nicely spaced packets :
      Here interface is 10Gbit, but flow bottleneck is ~20Mbit
      
      cwin is big, yet FQ avoids the typical bursts generated by TCP
      (as in netperf TCP_RR -- -r 100000,100000)
      
      15:01:23.545279 IP A > B: . 78193:81089(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.545394 IP B > A: . ack 81089 win 3668 <nop,nop,timestamp 11597985 1115>
      15:01:23.546488 IP A > B: . 81089:83985(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.546565 IP B > A: . ack 83985 win 3668 <nop,nop,timestamp 11597986 1115>
      15:01:23.547713 IP A > B: . 83985:86881(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.547778 IP B > A: . ack 86881 win 3668 <nop,nop,timestamp 11597987 1115>
      15:01:23.548911 IP A > B: . 86881:89777(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.548949 IP B > A: . ack 89777 win 3668 <nop,nop,timestamp 11597988 1115>
      15:01:23.550116 IP A > B: . 89777:92673(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.550182 IP B > A: . ack 92673 win 3668 <nop,nop,timestamp 11597989 1115>
      15:01:23.551333 IP A > B: . 92673:95569(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.551406 IP B > A: . ack 95569 win 3668 <nop,nop,timestamp 11597991 1115>
      15:01:23.552539 IP A > B: . 95569:98465(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.552576 IP B > A: . ack 98465 win 3668 <nop,nop,timestamp 11597992 1115>
      15:01:23.553756 IP A > B: . 98465:99913(1448) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.554138 IP A > B: P 99913:100001(88) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.554204 IP B > A: . ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.554234 IP B > A: . 65248:68144(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.555620 IP B > A: . 68144:71040(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.557005 IP B > A: . 71040:73936(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.558390 IP B > A: . 73936:76832(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.559773 IP B > A: . 76832:79728(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.561158 IP B > A: . 79728:82624(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.562543 IP B > A: . 82624:85520(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.563928 IP B > A: . 85520:88416(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.565313 IP B > A: . 88416:91312(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.566698 IP B > A: . 91312:94208(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.568083 IP B > A: . 94208:97104(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.569467 IP B > A: . 97104:100000(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.570852 IP B > A: . 100000:102896(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.572237 IP B > A: . 102896:105792(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.573639 IP B > A: . 105792:108688(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.575024 IP B > A: . 108688:111584(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.576408 IP B > A: . 111584:114480(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.577793 IP B > A: . 114480:117376(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      
      TCP timestamps show that most packets from B were queued in the same ms
      timeframe (TSval 1159799{3,4}), but FQ managed to send them right
      in time to avoid a big burst.
      
      In slow start or steady state, very few packets are throttled [1]
      
      FQ gets a bunch of tunables as :
      
        limit : max number of packets on whole Qdisc (default 10000)
      
        flow_limit : max number of packets per flow (default 100)
      
        quantum : the credit per RR round (default is 2 MTU)
      
        initial_quantum : initial credit for new flows (default is 10 MTU)
      
        maxrate : max per flow rate (default : unlimited)
      
        buckets : number of RB trees (default : 1024) in hash table.
                     (consumes 8 bytes per bucket)
      
        [no]pacing : disable/enable pacing (default is enable)
      
      All of them can be changed on a live qdisc.
      
      $ tc qd add dev eth0 root fq help
      Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ]
                    [ quantum BYTES ] [ initial_quantum BYTES ]
                    [ maxrate RATE  ] [ buckets NUMBER ]
                    [ [no]pacing ]
      
      $ tc -s -d qd
      qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140
       Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14)
       backlog 0b 0p requeues 14
        511 flows, 511 inactive, 0 throttled
        110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit
      
      [1] Except if initial srtt is overestimated, as if using
      cached srtt in tcp metrics. We'll provide a fix for this issue.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      afe4fd06
  14. 12 7月, 2012 1 次提交
  15. 04 7月, 2012 1 次提交
  16. 13 5月, 2012 1 次提交
    • E
      fq_codel: Fair Queue Codel AQM · 4b549a2e
      Eric Dumazet 提交于
      Fair Queue Codel packet scheduler
      
      Principles :
      
      - Packets are classified (internal classifier or external) on flows.
      - This is a Stochastic model (as we use a hash, several flows might
                                    be hashed on same slot)
      - Each flow has a CoDel managed queue.
      - Flows are linked onto two (Round Robin) lists,
        so that new flows have priority on old ones.
      
      - For a given flow, packets are not reordered (CoDel uses a FIFO)
      - head drops only.
      - ECN capability is on by default.
      - Very low memory footprint (64 bytes per flow)
      
      tc qdisc ... fq_codel [ limit PACKETS ] [ flows number ]
                            [ target TIME ] [ interval TIME ] [ noecn ]
                            [ quantum BYTES ]
      
      defaults : 1024 flows, 10240 packets limit, quantum : device MTU
                 target : 5ms (CoDel default)
                 interval : 100ms (CoDel default)
      
      Impressive results on load :
      
      class htb 1:1 root leaf 10: prio 0 quantum 1514 rate 200000Kbit ceil 200000Kbit burst 1475b/8 mpu 0b overhead 0b cburst 1475b/8 mpu 0b overhead 0b level 0
       Sent 43304920109 bytes 33063109 pkt (dropped 0, overlimits 0 requeues 0)
       rate 201691Kbit 28595pps backlog 0b 312p requeues 0
       lended: 33063109 borrowed: 0 giants: 0
       tokens: -912 ctokens: -912
      
      class fq_codel 10:1735 parent 10:
       (dropped 1292, overlimits 0 requeues 0)
       backlog 15140b 10p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 7.1ms
      class fq_codel 10:4524 parent 10:
       (dropped 1291, overlimits 0 requeues 0)
       backlog 16654b 11p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 7.1ms
      class fq_codel 10:4e74 parent 10:
       (dropped 1290, overlimits 0 requeues 0)
       backlog 6056b 4p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 6.4ms dropping drop_next 92.0ms
      class fq_codel 10:628a parent 10:
       (dropped 1289, overlimits 0 requeues 0)
       backlog 7570b 5p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 5.4ms dropping drop_next 90.9ms
      class fq_codel 10:a4b3 parent 10:
       (dropped 302, overlimits 0 requeues 0)
       backlog 16654b 11p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 7.1ms
      class fq_codel 10:c3c2 parent 10:
       (dropped 1284, overlimits 0 requeues 0)
       backlog 13626b 9p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 5.9ms
      class fq_codel 10:d331 parent 10:
       (dropped 299, overlimits 0 requeues 0)
       backlog 15140b 10p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 7.0ms
      class fq_codel 10:d526 parent 10:
       (dropped 12160, overlimits 0 requeues 0)
       backlog 35870b 211p requeues 0
        deficit 1508 count 12160 lastcount 1 ldelay 15.3ms dropping drop_next 247us
      class fq_codel 10:e2c6 parent 10:
       (dropped 1288, overlimits 0 requeues 0)
       backlog 15140b 10p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 7.1ms
      class fq_codel 10:eab5 parent 10:
       (dropped 1285, overlimits 0 requeues 0)
       backlog 16654b 11p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 5.9ms
      class fq_codel 10:f220 parent 10:
       (dropped 1289, overlimits 0 requeues 0)
       backlog 15140b 10p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 7.1ms
      
      qdisc htb 1: root refcnt 6 r2q 10 default 1 direct_packets_stat 0 ver 3.17
       Sent 43331086547 bytes 33092812 pkt (dropped 0, overlimits 66063544 requeues 71)
       rate 201697Kbit 28602pps backlog 0b 260p requeues 71
      qdisc fq_codel 10: parent 1:1 limit 10240p flows 65536 target 5.0ms interval 100.0ms ecn
       Sent 43331086547 bytes 33092812 pkt (dropped 949359, overlimits 0 requeues 0)
       rate 201697Kbit 28602pps backlog 189352b 260p requeues 0
        maxpacket 1514 drop_overlimit 0 new_flow_count 5582 ecn_mark 125593
        new_flows_len 0 old_flows_len 11
      
      PING 172.30.42.18 (172.30.42.18) 56(84) bytes of data.
      64 bytes from 172.30.42.18: icmp_req=1 ttl=64 time=0.227 ms
      64 bytes from 172.30.42.18: icmp_req=2 ttl=64 time=0.165 ms
      64 bytes from 172.30.42.18: icmp_req=3 ttl=64 time=0.166 ms
      64 bytes from 172.30.42.18: icmp_req=4 ttl=64 time=0.151 ms
      64 bytes from 172.30.42.18: icmp_req=5 ttl=64 time=0.164 ms
      64 bytes from 172.30.42.18: icmp_req=6 ttl=64 time=0.172 ms
      64 bytes from 172.30.42.18: icmp_req=7 ttl=64 time=0.175 ms
      64 bytes from 172.30.42.18: icmp_req=8 ttl=64 time=0.183 ms
      64 bytes from 172.30.42.18: icmp_req=9 ttl=64 time=0.158 ms
      64 bytes from 172.30.42.18: icmp_req=10 ttl=64 time=0.200 ms
      
      10 packets transmitted, 10 received, 0% packet loss, time 8999ms
      rtt min/avg/max/mdev = 0.151/0.176/0.227/0.022 ms
      
      Much better than SFQ because of priority given to new flows, and fast
      path dirtying less cache lines.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b549a2e
  17. 11 5月, 2012 1 次提交
    • E
      codel: Controlled Delay AQM · 76e3cc12
      Eric Dumazet 提交于
      An implementation of CoDel AQM, from Kathleen Nichols and Van Jacobson.
      
      http://queue.acm.org/detail.cfm?id=2209336
      
      This AQM main input is no longer queue size in bytes or packets, but the
      delay packets stay in (FIFO) queue.
      
      As we don't have infinite memory, we still can drop packets in enqueue()
      in case of massive load, but mean of CoDel is to drop packets in
      dequeue(), using a control law based on two simple parameters :
      
      target : target sojourn time (default 5ms)
      interval : width of moving time window (default 100ms)
      
      Based on initial work from Dave Taht.
      
      Refactored to help future codel inclusion as a plugin for other linux
      qdisc (FQ_CODEL, ...), like RED.
      
      include/net/codel.h contains codel algorithm as close as possible than
      Kathleen reference.
      
      net/sched/sch_codel.c contains the linux qdisc specific glue.
      
      Separate structures permit a memory efficient implementation of fq_codel
      (to be sent as a separate work) : Each flow has its own struct
      codel_vars.
      
      timestamps are taken at enqueue() time with 1024 ns precision, allowing
      a range of 2199 seconds in queue, and 100Gb links support. iproute2 uses
      usec as base unit.
      
      Selected packets are dropped, unless ECN is enabled and packets can get
      ECN mark instead.
      
      Tested from 2Mb to 10Gb speeds with no particular problems, on ixgbe and
      tg3 drivers (BQL enabled).
      
      Usage: tc qdisc ... codel [ limit PACKETS ] [ target TIME ]
                                [ interval TIME ] [ ecn ]
      
      qdisc codel 10: parent 1:1 limit 2000p target 3.0ms interval 60.0ms ecn
       Sent 13347099587 bytes 8815805 pkt (dropped 0, overlimits 0 requeues 0)
       rate 202365Kbit 16708pps backlog 113550b 75p requeues 0
        count 116 lastcount 98 ldelay 4.3ms dropping drop_next 816us
        maxpacket 1514 ecn_mark 84399 drop_overlimit 0
      
      CoDel must be seen as a base module, and should be used keeping in mind
      there is still a FIFO queue. So a typical setup will probably need a
      hierarchy of several qdiscs and packet classifiers to be able to meet
      whatever constraints a user might have.
      
      One possible example would be to use fq_codel, which combines Fair
      Queueing and CoDel, in replacement of sfq / sfq_red.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDave Taht <dave.taht@bufferbloat.net>
      Cc: Kathleen Nichols <nichols@pollere.com>
      Cc: Van Jacobson <van@pollere.net>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76e3cc12
  18. 08 2月, 2012 1 次提交
    • S
      net/sched: sch_plug - Queue traffic until an explicit release command · c3059be1
      Shriram Rajagopalan 提交于
      The qdisc supports two operations - plug and unplug. When the
      qdisc receives a plug command via netlink request, packets arriving
      henceforth are buffered until a corresponding unplug command is received.
      Depending on the type of unplug command, the queue can be unplugged
      indefinitely or selectively.
      
      This qdisc can be used to implement output buffering, an essential
      functionality required for consistent recovery in checkpoint based
      fault-tolerance systems. Output buffering enables speculative execution
      by allowing generated network traffic to be rolled back. It is used to
      provide network protection for Xen Guests in the Remus high availability
      project, available as part of Xen.
      
      This module is generic enough to be used by any other system that wishes
      to add speculative execution and output buffering to its applications.
      
      This module was originally available in the linux 2.6.32 PV-OPS tree,
      used as dom0 for Xen.
      
      For more information, please refer to http://nss.cs.ubc.ca/remus/
      and http://wiki.xensource.com/xenwiki/Remus
      
      Changes in V3:
        * Removed debug output (printk) on queue overflow
        * Added TCQ_PLUG_RELEASE_INDEFINITE - that allows the user to
          use this qdisc, for simple plug/unplug operations.
        * Use of packet counts instead of pointers to keep track of
          the buffers in the queue.
      Signed-off-by: NShriram Rajagopalan <rshriram@cs.ubc.ca>
      Signed-off-by: NBrendan Cully <brendan@cs.ubc.ca>
      [author of the code in the linux 2.6.32 pvops tree]
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3059be1
  19. 05 4月, 2011 1 次提交
  20. 24 2月, 2011 1 次提交
    • E
      net_sched: SFB flow scheduler · e13e02a3
      Eric Dumazet 提交于
      This is the Stochastic Fair Blue scheduler, based on work from :
      
      W. Feng, D. Kandlur, D. Saha, K. Shin. Blue: A New Class of Active Queue
      Management Algorithms. U. Michigan CSE-TR-387-99, April 1999.
      
      http://www.thefengs.com/wuchang/blue/CSE-TR-387-99.pdf
      
      This implementation is based on work done by Juliusz Chroboczek
      
      General SFB algorithm can be found in figure 14, page 15:
      
      B[l][n] : L x N array of bins (L levels, N bins per level)
      enqueue()
      Calculate hash function values h{0}, h{1}, .. h{L-1}
      Update bins at each level
      for i = 0 to L - 1
         if (B[i][h{i}].qlen > bin_size)
            B[i][h{i}].p_mark += p_increment;
         else if (B[i][h{i}].qlen == 0)
            B[i][h{i}].p_mark -= p_decrement;
      p_min = min(B[0][h{0}].p_mark ... B[L-1][h{L-1}].p_mark);
      if (p_min == 1.0)
          ratelimit();
      else
          mark/drop with probabilty p_min;
      
      I did the adaptation of Juliusz code to meet current kernel standards,
      and various changes to address previous comments :
      
      http://thread.gmane.org/gmane.linux.network/90225
      http://thread.gmane.org/gmane.linux.network/90375
      
      Default flow classifier is the rxhash introduced by RPS in 2.6.35, but
      we can use an external flow classifier if wanted.
      
      tc qdisc add dev $DEV parent 1:11 handle 11:  \
              est 0.5sec 2sec sfb limit 128
      
      tc filter add dev $DEV protocol ip parent 11: handle 3 \
              flow hash keys dst divisor 1024
      
      Notes:
      
      1) SFB default child qdisc is pfifo_fast. It can be changed by another
      qdisc but a child qdisc MUST not drop a packet previously queued. This
      is because SFB needs to handle a dequeued packet in order to maintain
      its virtual queue states. pfifo_head_drop or CHOKe should not be used.
      
      2) ECN is enabled by default, unlike RED/CHOKe/GRED
      
      With help from Patrick McHardy & Andi Kleen
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Juliusz Chroboczek <Juliusz.Chroboczek@pps.jussieu.fr>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Andi Kleen <andi@firstfloor.org>
      CC: John W. Linville <linville@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e13e02a3
  21. 03 2月, 2011 1 次提交
    • S
      sched: CHOKe flow scheduler · 45e14433
      stephen hemminger 提交于
      CHOKe ("CHOose and Kill" or "CHOose and Keep") is an alternative
      packet scheduler based on the Random Exponential Drop (RED) algorithm.
      
      The core idea is:
        For every packet arrival:
        	Calculate Qave
      	if (Qave < minth)
      	     Queue the new packet
      	else
      	     Select randomly a packet from the queue
      	     if (both packets from same flow)
      	     then Drop both the packets
      	     else if (Qave > maxth)
      	          Drop packet
      	     else
      	       	  Admit packet with proability p (same as RED)
      
      See also:
        Rong Pan, Balaji Prabhakar, Konstantinos Psounis, "CHOKe: a stateless active
         queue management scheme for approximating fair bandwidth allocation",
        Proceeding of INFOCOM'2000, March 2000.
      
      Help from:
           Eric Dumazet <eric.dumazet@gmail.com>
           Patrick McHardy <kaber@trash.net>
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45e14433
  22. 20 1月, 2011 1 次提交
    • J
      net_sched: implement a root container qdisc sch_mqprio · b8970f0b
      John Fastabend 提交于
      This implements a mqprio queueing discipline that by default creates
      a pfifo_fast qdisc per tx queue and provides the needed configuration
      interface.
      
      Using the mqprio qdisc the number of tcs currently in use along
      with the range of queues alloted to each class can be configured. By
      default skbs are mapped to traffic classes using the skb priority.
      This mapping is configurable.
      
      Configurable parameters,
      
      struct tc_mqprio_qopt {
      	__u8    num_tc;
      	__u8    prio_tc_map[TC_BITMASK + 1];
      	__u8    hw;
      	__u16   count[TC_MAX_QUEUE];
      	__u16   offset[TC_MAX_QUEUE];
      };
      
      Here the count/offset pairing give the queue alignment and the
      prio_tc_map gives the mapping from skb->priority to tc.
      
      The hw bit determines if the hardware should configure the count
      and offset values. If the hardware bit is set then the operation
      will fail if the hardware does not implement the ndo_setup_tc
      operation. This is to avoid undetermined states where the hardware
      may or may not control the queue mapping. Also minimal bounds
      checking is done on the count/offset to verify a queue does not
      exceed num_tx_queues and that queue ranges do not overlap. Otherwise
      it is left to user policy or hardware configuration to create
      useful mappings.
      
      It is expected that hardware QOS schemes can be implemented by
      creating appropriate mappings of queues in ndo_tc_setup().
      
      One expected use case is drivers will use the ndo_setup_tc to map
      queue ranges onto 802.1Q traffic classes. This provides a generic
      mechanism to map network traffic onto these traffic classes and
      removes the need for lower layer drivers to know specifics about
      traffic types.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8970f0b
  23. 20 8月, 2010 1 次提交
    • G
      net/sched: add ACT_CSUM action to update packets checksums · eb4d4065
      Grégoire Baron 提交于
      net/sched: add ACT_CSUM action to update packets checksums
      
      ACT_CSUM can be called just after ACT_PEDIT in order to re-compute some
      altered checksums in IPv4 and IPv6 packets. The following checksums are
      supported by this patch:
       - IPv4: IPv4 header, ICMP, IGMP, TCP, UDP & UDPLite
       - IPv6: ICMPv6, TCP, UDP & UDPLite
      It's possible to request in the same action to update different kind of
      checksums, if the packets flow mix TCP, UDP and UDPLite, ...
      
      An example of usage is done in the associated iproute2 patch.
      
      Version 3 changes:
       - remove useless goto instructions
       - improve IPv6 hop options decoding
      
      Version 2 changes:
       - coding style correction
       - remove useless arguments of some functions
       - use stack in tcf_csum_dump()
       - add tcf_csum_skb_nextlayer() to factor code
      Signed-off-by: NGregoire Baron <baronchon@n7mm.org>
      Acked-by: Njamal <hadi@cyberus.ca>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb4d4065
  24. 06 9月, 2009 1 次提交
    • D
      net_sched: add classful multiqueue dummy scheduler · 6ec1c69a
      David S. Miller 提交于
      This patch adds a classful dummy scheduler which can be used as root qdisc
      for multiqueue devices and exposes each device queue as a child class.
      
      This allows to address queues individually and graft them similar to regular
      classes. Additionally it presents an accumulated view of the statistics of
      all real root qdiscs in the dummy root.
      
      Two new callbacks are added to the qdisc_ops and qdisc_class_ops:
      
      - cl_ops->select_queue selects the tx queue number for new child classes.
      
      - qdisc_ops->attach() overrides root qdisc device grafting to attach
        non-shared qdiscs to the queues.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ec1c69a
  25. 20 11月, 2008 1 次提交
  26. 08 11月, 2008 1 次提交
    • T
      pkt_sched: Control group classifier · f4009237
      Thomas Graf 提交于
      The classifier should cover the most common use case and will work
      without any special configuration.
      
      The principle of the classifier is to directly access the
      task_struct via get_current(). In order for this to work,
      classification requests from softirqs must be ignored. This is
      not a problem because the vast majority of packets in softirq
      context are not assigned to a task anyway. For this to work, a
      mechanism is needed to trace softirq context. 
      
      This repost goes back to the method of relying on the number of
      nested bh disable calls for the sake of not adding too much
      complexity and the option to come up with something more reliable
      if actually needed.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4009237
  27. 13 9月, 2008 2 次提交
  28. 01 2月, 2008 1 次提交
    • P
      [NET_SCHED]: Add flow classifier · e5dfb815
      Patrick McHardy 提交于
      Add new "flow" classifier, which is meant to extend the SFQ hashing
      capabilities without hard-coding new hash functions and also allows
      deterministic mappings of keys to classes, replacing some out of tree
      iptables patches like IPCLASSIFY (maps IPs to classes), IPMARK (maps
      IPs to marks, with fw filters to classes), ...
      
      Some examples:
      
      - Classic SFQ hash:
      
        tc filter add ... flow hash \
        	keys src,dst,proto,proto-src,proto-dst divisor 1024
      
      - Classic SFQ hash, but using information from conntrack to work properly in
        combination with NAT:
      
        tc filter add ... flow hash \
        	keys nfct-src,nfct-dst,proto,nfct-proto-src,nfct-proto-dst divisor 1024
      
      - Map destination IPs of 192.168.0.0/24 to classids 1-257:
      
        tc filter add ... flow map \
        	key dst addend -192.168.0.0 divisor 256
      
      - alternatively:
      
        tc filter add ... flow map \
        	key dst and 0xff
      
      - similar, but reverse ordered:
      
        tc filter add ... flow map \
        	key dst and 0xff xor 0xff
      
      Perturbation is currently not supported because we can't reliable kill the
      timer on destruction.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e5dfb815
  29. 11 10月, 2007 1 次提交
    • H
      [PKT_SCHED]: Add stateless NAT · b4219952
      Herbert Xu 提交于
      Stateless NAT is useful in controlled environments where restrictions are
      placed on through traffic such that we don't need connection tracking to
      correctly NAT protocol-specific data.
      
      In particular, this is of interest when the number of flows or the number
      of addresses being NATed is large, or if connection tracking information
      has to be replicated and where it is not practical to do so.
      
      Previously we had stateless NAT functionality which was integrated into
      the IPv4 routing subsystem.  This was a great solution as long as the NAT
      worked on a subnet to subnet basis such that the number of NAT rules was
      relatively small.  The reason is that for SNAT the routing based system
      had to perform a linear scan through the rules.
      
      If the number of rules is large then major renovations would have take
      place in the routing subsystem to make this practical.
      
      For the time being, the least intrusive way of achieving this is to use
      the u32 classifier written by Alexey Kuznetsov along with the actions
      infrastructure implemented by Jamal Hadi Salim.
      
      The following patch is an attempt at this problem by creating a new nat
      action that can be invoked from u32 hash tables which would allow large
      number of stateless NAT rules that can be used/updated in constant time.
      
      The actual NAT code is mostly based on the previous stateless NAT code
      written by Alexey.  In future we might be able to utilise the protocol
      NAT code from netfilter to improve support for other protocols.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4219952
  30. 15 7月, 2007 1 次提交
  31. 27 3月, 2007 1 次提交
  32. 03 12月, 2006 1 次提交
  33. 10 1月, 2006 1 次提交
  34. 06 7月, 2005 1 次提交
  35. 24 6月, 2005 1 次提交
  36. 25 4月, 2005 1 次提交
  37. 17 4月, 2005 1 次提交
    • L
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds 提交于
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4