1. 04 9月, 2013 3 次提交
  2. 03 9月, 2013 1 次提交
  3. 01 9月, 2013 6 次提交
  4. 31 8月, 2013 2 次提交
  5. 30 8月, 2013 9 次提交
    • J
      drivers:net: Convert dma_alloc_coherent(...__GFP_ZERO) to dma_zalloc_coherent · ede23fa8
      Joe Perches 提交于
      __GFP_ZERO is an uncommon flag and perhaps is better
      not used.  static inline dma_zalloc_coherent exists
      so convert the uses of dma_alloc_coherent with __GFP_ZERO
      to the more common kernel style with zalloc.
      
      Remove memset from the static inline dma_zalloc_coherent
      and add just one use of __GFP_ZERO instead.
      
      Trivially reduces the size of the existing uses of
      dma_zalloc_coherent.
      
      Realign arguments as appropriate.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Acked-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ede23fa8
    • E
      pkt_sched: fq: Fair Queue packet scheduler · afe4fd06
      Eric Dumazet 提交于
      - Uses perfect flow match (not stochastic hash like SFQ/FQ_codel)
      - Uses the new_flow/old_flow separation from FQ_codel
      - New flows get an initial credit allowing IW10 without added delay.
      - Special FIFO queue for high prio packets (no need for PRIO + FQ)
      - Uses a hash table of RB trees to locate the flows at enqueue() time
      - Smart on demand gc (at enqueue() time, RB tree lookup evicts old
        unused flows)
      - Dynamic memory allocations.
      - Designed to allow millions of concurrent flows per Qdisc.
      - Small memory footprint : ~8K per Qdisc, and 104 bytes per flow.
      - Single high resolution timer for throttled flows (if any).
      - One RB tree to link throttled flows.
      - Ability to have a max rate per flow. We might add a socket option
        to add per socket limitation.
      
      Attempts have been made to add TCP pacing in TCP stack, but this
      seems to add complex code to an already complex stack.
      
      TCP pacing is welcomed for flows having idle times, as the cwnd
      permits TCP stack to queue a possibly large number of packets.
      
      This removes the 'slow start after idle' choice, hitting badly
      large BDP flows, and applications delivering chunks of data
      as video streams.
      
      Nicely spaced packets :
      Here interface is 10Gbit, but flow bottleneck is ~20Mbit
      
      cwin is big, yet FQ avoids the typical bursts generated by TCP
      (as in netperf TCP_RR -- -r 100000,100000)
      
      15:01:23.545279 IP A > B: . 78193:81089(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.545394 IP B > A: . ack 81089 win 3668 <nop,nop,timestamp 11597985 1115>
      15:01:23.546488 IP A > B: . 81089:83985(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.546565 IP B > A: . ack 83985 win 3668 <nop,nop,timestamp 11597986 1115>
      15:01:23.547713 IP A > B: . 83985:86881(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.547778 IP B > A: . ack 86881 win 3668 <nop,nop,timestamp 11597987 1115>
      15:01:23.548911 IP A > B: . 86881:89777(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.548949 IP B > A: . ack 89777 win 3668 <nop,nop,timestamp 11597988 1115>
      15:01:23.550116 IP A > B: . 89777:92673(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.550182 IP B > A: . ack 92673 win 3668 <nop,nop,timestamp 11597989 1115>
      15:01:23.551333 IP A > B: . 92673:95569(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.551406 IP B > A: . ack 95569 win 3668 <nop,nop,timestamp 11597991 1115>
      15:01:23.552539 IP A > B: . 95569:98465(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.552576 IP B > A: . ack 98465 win 3668 <nop,nop,timestamp 11597992 1115>
      15:01:23.553756 IP A > B: . 98465:99913(1448) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.554138 IP A > B: P 99913:100001(88) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.554204 IP B > A: . ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.554234 IP B > A: . 65248:68144(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.555620 IP B > A: . 68144:71040(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.557005 IP B > A: . 71040:73936(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.558390 IP B > A: . 73936:76832(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.559773 IP B > A: . 76832:79728(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.561158 IP B > A: . 79728:82624(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.562543 IP B > A: . 82624:85520(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.563928 IP B > A: . 85520:88416(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.565313 IP B > A: . 88416:91312(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.566698 IP B > A: . 91312:94208(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.568083 IP B > A: . 94208:97104(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.569467 IP B > A: . 97104:100000(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.570852 IP B > A: . 100000:102896(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.572237 IP B > A: . 102896:105792(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.573639 IP B > A: . 105792:108688(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.575024 IP B > A: . 108688:111584(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.576408 IP B > A: . 111584:114480(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.577793 IP B > A: . 114480:117376(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      
      TCP timestamps show that most packets from B were queued in the same ms
      timeframe (TSval 1159799{3,4}), but FQ managed to send them right
      in time to avoid a big burst.
      
      In slow start or steady state, very few packets are throttled [1]
      
      FQ gets a bunch of tunables as :
      
        limit : max number of packets on whole Qdisc (default 10000)
      
        flow_limit : max number of packets per flow (default 100)
      
        quantum : the credit per RR round (default is 2 MTU)
      
        initial_quantum : initial credit for new flows (default is 10 MTU)
      
        maxrate : max per flow rate (default : unlimited)
      
        buckets : number of RB trees (default : 1024) in hash table.
                     (consumes 8 bytes per bucket)
      
        [no]pacing : disable/enable pacing (default is enable)
      
      All of them can be changed on a live qdisc.
      
      $ tc qd add dev eth0 root fq help
      Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ]
                    [ quantum BYTES ] [ initial_quantum BYTES ]
                    [ maxrate RATE  ] [ buckets NUMBER ]
                    [ [no]pacing ]
      
      $ tc -s -d qd
      qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140
       Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14)
       backlog 0b 0p requeues 14
        511 flows, 511 inactive, 0 throttled
        110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit
      
      [1] Except if initial srtt is overestimated, as if using
      cached srtt in tcp metrics. We'll provide a fix for this issue.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      afe4fd06
    • O
      can: gw: add a per rule limitation of frame hops · 391ac128
      Oliver Hartkopp 提交于
      Usually the received CAN frames can be processed/routed as much as 'max_hops'
      times (which is given at module load time of the can-gw module).
      Introduce a new configuration option to reduce the number of possible hops
      for a specific gateway rule to a value smaller then max_hops.
      Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      391ac128
    • D
      net: packet: add randomized fanout scheduler · 5df0ddfb
      Daniel Borkmann 提交于
      We currently allow for different fanout scheduling policies in pf_packet
      such as scheduling by skb's rxhash, round-robin, by cpu, and rollover.
      Also allow for a random, equidistributed selection of the socket from the
      fanout process group.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5df0ddfb
    • V
      net: add netdev_for_each_upper_dev_rcu() · 8b5be856
      Veaceslav Falico 提交于
      The new macro netdev_for_each_upper_dev_rcu(dev, upper, iter) iterates
      through the dev->upper_dev_list starting from the first element, using
      the netdev_upper_get_next_dev_rcu(dev, &iter).
      
      Must be called under RCU read lock.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Cong Wang <amwang@redhat.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b5be856
    • V
      net: add lower_dev_list to net_device and make a full mesh · 5d261913
      Veaceslav Falico 提交于
      This patch adds lower_dev_list list_head to net_device, which is the same
      as upper_dev_list, only for lower devices, and begins to use it in the same
      way as the upper list.
      
      It also changes the way the whole adjacent device lists work - now they
      contain *all* of upper/lower devices, not only the first level. The first
      level devices are distinguished by the bool neighbour field in
      netdev_adjacent, also added by this patch.
      
      There are cases when a device can be added several times to the adjacent
      list, the simplest would be:
      
           /---- eth0.10 ---\
      eth0-		       --- bond0
           \---- eth0.20 ---/
      
      where both bond0 and eth0 'see' each other in the adjacent lists two times.
      To avoid duplication of netdev_adjacent structures ref_nr is being kept as
      the number of times the device was added to the list.
      
      The 'full view' is achieved by adding, on link creation, all of the
      upper_dev's upper_dev_list devices as upper devices to all of the
      lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
      versa. On unlink they are removed using the same logic.
      
      I've tested it with thousands vlans/bonds/bridges, everything works ok and
      no observable lags even on a huge number of interfaces.
      
      Memory footprint for 128 devices interconnected with each other via both
      upper and lower (which is impossible, but for the comparison) lists would be:
      
      128*128*2*sizeof(netdev_adjacent) = 1.5MB
      
      but in the real world we usualy have at most several devices with slaves
      and a lot of vlans, so the footprint will be much lower.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Cong Wang <amwang@redhat.com>
      Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d261913
    • E
      tcp: TSO packets automatic sizing · 95bd09eb
      Eric Dumazet 提交于
      After hearing many people over past years complaining against TSO being
      bursty or even buggy, we are proud to present automatic sizing of TSO
      packets.
      
      One part of the problem is that tcp_tso_should_defer() uses an heuristic
      relying on upcoming ACKS instead of a timer, but more generally, having
      big TSO packets makes little sense for low rates, as it tends to create
      micro bursts on the network, and general consensus is to reduce the
      buffering amount.
      
      This patch introduces a per socket sk_pacing_rate, that approximates
      the current sending rate, and allows us to size the TSO packets so
      that we try to send one packet every ms.
      
      This field could be set by other transports.
      
      Patch has no impact for high speed flows, where having large TSO packets
      makes sense to reach line rate.
      
      For other flows, this helps better packet scheduling and ACK clocking.
      
      This patch increases performance of TCP flows in lossy environments.
      
      A new sysctl (tcp_min_tso_segs) is added, to specify the
      minimal size of a TSO packet (default being 2).
      
      A follow-up patch will provide a new packet scheduler (FQ), using
      sk_pacing_rate as an input to perform optional per flow pacing.
      
      This explains why we chose to set sk_pacing_rate to twice the current
      rate, allowing 'slow start' ramp up.
      
      sk_pacing_rate = 2 * cwnd * mss / srtt
      
      v2: Neal Cardwell reported a suspect deferring of last two segments on
      initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
      into account tp->xmit_size_goal_segs
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95bd09eb
    • H
      ipv6: drop fragmented ndisc packets by default (RFC 6980) · b800c3b9
      Hannes Frederic Sowa 提交于
      This patch implements RFC6980: Drop fragmented ndisc packets by
      default. If a fragmented ndisc packet is received the user is informed
      that it is possible to disable the check.
      
      Cc: Fernando Gont <fernando@gont.com.ar>
      Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b800c3b9
    • D
      net: sctp: reorder sctp_globals to reduce cacheline usage · 76bfd898
      Daniel Borkmann 提交于
      Reduce cacheline usage from 2 to 1 cacheline for sctp_globals structure. By
      reordering elements, we can close gaps and simply achieve the following:
      
      Current situation:
        /* size: 80, cachelines: 2, members: 10 */
        /* sum members: 57, holes: 4, sum holes: 16 */
        /* padding: 7 */
        /* last cacheline: 16 bytes */
      
      Afterwards:
        /* size: 64, cachelines: 1, members: 10 */
        /* padding: 7 */
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76bfd898
  6. 28 8月, 2013 4 次提交
  7. 27 8月, 2013 4 次提交
  8. 24 8月, 2013 2 次提交
  9. 23 8月, 2013 6 次提交
  10. 22 8月, 2013 3 次提交
    • P
      tun: Get skfilter layout · 76975e9c
      Pavel Emelyanov 提交于
      The only thing we may have from tun device is the fprog, whic contains
      the number of filter elements and a pointer to (user-space) memory
      where the elements are. The program itself may not be available if the
      device is persistent and detached.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76975e9c
    • P
      tun: Allow to skip filter on attach · 849c9b6f
      Pavel Emelyanov 提交于
      There's a small problem with sk-filters on tun devices. Consider
      an application doing this sequence of steps:
      
      fd = open("/dev/net/tun");
      ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" });
      ioctl(fd, TUNATTACHFILTER, &my_filter);
      ioctl(fd, TUNSETPERSIST, 1);
      close(fd);
      
      At that point the tun0 will remain in the system and will keep in
      mind that there should be a socket filter at address '&my_filter'.
      
      If after that we do
      
      fd = open("/dev/net/tun");
      ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" });
      
      we most likely receive the -EFAULT error, since tun_attach() would
      try to connect the filter back. But (!) if we provide a filter at
      address &my_filter, then tun0 will be created and the "new" filter
      would be attached, but application may not know about that.
      
      This may create certain problems to anyone using tun-s, but it's
      critical problem for c/r -- if we meet a persistent tun device
      with a filter in mind, we will not be able to attach to it to dump
      its state (flags, owner, address, vnethdr size, etc.).
      
      The proposal is to allow to attach to tun device (with TUNSETIFF)
      w/o attaching the filter to the tun-file's socket. After this
      attach app may e.g clean the device by dropping the filter, it
      doesn't want to have one, or (in case of c/r) get information
      about the device with tun ioctls.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      849c9b6f
    • P
      tun: Add ability to create tun device with given index · fb7589a1
      Pavel Emelyanov 提交于
      Tun devices cannot be created with ifidex user wants, but it's
      required by checkpoint-restore project.
      
      Long time ago such ability was implemented for rtnl_ops-based
      interface for creating links (9c7dafbf net: Allow to create links
      with given ifindex), but the only API for creating and managing
      tuntap devices is ioctl-based and is evolving with adding new ones
      (cde8b15f tuntap: add ioctl to attach or detach a file form tuntap
      device).
      
      Following that trend, here's how a new ioctl that sets the ifindex
      for device, that _will_ be created by TUNSETIFF ioctl looks like.
      So those who want a tuntap device with the ifindex N, should open
      the tun device, call ioctl(fd, TUNSETIFINDEX, &N), then call TUNSETIFF.
      If the index N is busy, then the register_netdev will find this out
      and the ioctl would be failed with -EBUSY.
      
      If setifindex is not called, then it will be generated as before.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb7589a1