1. 28 8月, 2013 2 次提交
  2. 27 8月, 2013 1 次提交
  3. 24 8月, 2013 1 次提交
    • A
      openvswitch: Mega flow implementation · 03f0d916
      Andy Zhou 提交于
      Add wildcarded flow support in kernel datapath.
      
      Wildcarded flow can improve OVS flow set up performance by avoid sending
      matching new flows to the user space program. The exact performance boost
      will largely dependent on wildcarded flow hit rate.
      
      In case all new flows hits wildcard flows, the flow set up rate is
      within 5% of that of linux bridge module.
      
      Pravin has made significant contributions to this patch. Including API
      clean ups and bug fixes.
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NAndy Zhou <azhou@nicira.com>
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      03f0d916
  4. 23 8月, 2013 1 次提交
    • S
      ipv4: expose IPV4_DEVCONF · 4a5a8aa6
      stephen hemminger 提交于
      IP sends device configuration (see inet_fill_link_af) as an array
      in the netlink information, but the indices in that array are not
      exposed to userspace through any current santized header file.
      
      It was available back in 2.6.32 (in /usr/include/linux/sysctl.h)
      but was broken by:
        commit 02291680
        Author: Eric W. Biederman <ebiederm@xmission.com>
        Date:   Sun Feb 14 03:25:51 2010 +0000
      
          net ipv4: Decouple ipv4 interface parameters from binary sysctl numbers
      
      Eric was solving the sysctl problem but then the indices were re-exposed
      by a later addition of devconf support for IPV4
      
        commit 9f0f7272
        Author: Thomas Graf <tgraf@infradead.org>
        Date:   Tue Nov 16 04:32:48 2010 +0000
      
          ipv4: AF_INET link address family
      
      Putting them in /usr/include/linux/ip.h seemed the logical match
      for the DEVCONF_ definitions for IPV6 in /usr/include/linux/ip6.h
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a5a8aa6
  5. 22 8月, 2013 3 次提交
    • P
      tun: Get skfilter layout · 76975e9c
      Pavel Emelyanov 提交于
      The only thing we may have from tun device is the fprog, whic contains
      the number of filter elements and a pointer to (user-space) memory
      where the elements are. The program itself may not be available if the
      device is persistent and detached.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76975e9c
    • P
      tun: Allow to skip filter on attach · 849c9b6f
      Pavel Emelyanov 提交于
      There's a small problem with sk-filters on tun devices. Consider
      an application doing this sequence of steps:
      
      fd = open("/dev/net/tun");
      ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" });
      ioctl(fd, TUNATTACHFILTER, &my_filter);
      ioctl(fd, TUNSETPERSIST, 1);
      close(fd);
      
      At that point the tun0 will remain in the system and will keep in
      mind that there should be a socket filter at address '&my_filter'.
      
      If after that we do
      
      fd = open("/dev/net/tun");
      ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" });
      
      we most likely receive the -EFAULT error, since tun_attach() would
      try to connect the filter back. But (!) if we provide a filter at
      address &my_filter, then tun0 will be created and the "new" filter
      would be attached, but application may not know about that.
      
      This may create certain problems to anyone using tun-s, but it's
      critical problem for c/r -- if we meet a persistent tun device
      with a filter in mind, we will not be able to attach to it to dump
      its state (flags, owner, address, vnethdr size, etc.).
      
      The proposal is to allow to attach to tun device (with TUNSETIFF)
      w/o attaching the filter to the tun-file's socket. After this
      attach app may e.g clean the device by dropping the filter, it
      doesn't want to have one, or (in case of c/r) get information
      about the device with tun ioctls.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      849c9b6f
    • P
      tun: Add ability to create tun device with given index · fb7589a1
      Pavel Emelyanov 提交于
      Tun devices cannot be created with ifidex user wants, but it's
      required by checkpoint-restore project.
      
      Long time ago such ability was implemented for rtnl_ops-based
      interface for creating links (9c7dafbf net: Allow to create links
      with given ifindex), but the only API for creating and managing
      tuntap devices is ioctl-based and is evolving with adding new ones
      (cde8b15f tuntap: add ioctl to attach or detach a file form tuntap
      device).
      
      Following that trend, here's how a new ioctl that sets the ifindex
      for device, that _will_ be created by TUNSETIFF ioctl looks like.
      So those who want a tuntap device with the ifindex N, should open
      the tun device, call ioctl(fd, TUNSETIFINDEX, &N), then call TUNSETIFF.
      If the index N is busy, then the register_netdev will find this out
      and the ioctl would be failed with -EBUSY.
      
      If setifindex is not called, then it will be generated as before.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb7589a1
  6. 20 8月, 2013 1 次提交
  7. 15 8月, 2013 1 次提交
    • J
      net_sched: restore "linklayer atm" handling · 8a8e3d84
      Jesper Dangaard Brouer 提交于
      commit 56b765b7 ("htb: improved accuracy at high rates")
      broke the "linklayer atm" handling.
      
       tc class add ... htb rate X ceil Y linklayer atm
      
      The linklayer setting is implemented by modifying the rate table
      which is send to the kernel.  No direct parameter were
      transferred to the kernel indicating the linklayer setting.
      
      The commit 56b765b7 ("htb: improved accuracy at high rates")
      removed the use of the rate table system.
      
      To keep compatible with older iproute2 utils, this patch detects
      the linklayer by parsing the rate table.  It also supports future
      versions of iproute2 to send this linklayer parameter to the
      kernel directly. This is done by using the __reserved field in
      struct tc_ratespec, to convey the choosen linklayer option, but
      only using the lower 4 bits of this field.
      
      Linklayer detection is limited to speeds below 100Mbit/s, because
      at high rates the rtab is gets too inaccurate, so bad that
      several fields contain the same values, this resembling the ATM
      detect.  Fields even start to contain "0" time to send, e.g. at
      1000Mbit/s sending a 96 bytes packet cost "0", thus the rtab have
      been more broken than we first realized.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a8e3d84
  8. 14 8月, 2013 4 次提交
  9. 13 8月, 2013 1 次提交
  10. 10 8月, 2013 1 次提交
  11. 09 8月, 2013 1 次提交
    • E
      net: add SNMP counters tracking incoming ECN bits · 1f07d03e
      Eric Dumazet 提交于
      With GRO/LRO processing, there is a problem because Ip[6]InReceives SNMP
      counters do not count the number of frames, but number of aggregated
      segments.
      
      Its probably too late to change this now.
      
      This patch adds four new counters, tracking number of frames, regardless
      of LRO/GRO, and on a per ECN status basis, for IPv4 and IPv6.
      
      Ip[6]NoECTPkts : Number of packets received with NOECT
      Ip[6]ECT1Pkts  : Number of packets received with ECT(1)
      Ip[6]ECT0Pkts  : Number of packets received with ECT(0)
      Ip[6]CEPkts    : Number of packets received with Congestion Experienced
      
      lph37:~# nstat | egrep "Pkts|InReceive"
      IpInReceives                    1634137            0.0
      Ip6InReceives                   3714107            0.0
      Ip6InNoECTPkts                  19205              0.0
      Ip6InECT0Pkts                   52651828           0.0
      IpExtInNoECTPkts                33630              0.0
      IpExtInECT0Pkts                 15581379           0.0
      IpExtInCEPkts                   6                  0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f07d03e
  12. 04 8月, 2013 1 次提交
    • S
      fib_rules: fix suppressor names and default values · 73f5698e
      Stefan Tomanek 提交于
      This change brings the suppressor attribute names into line; it also changes
      the data types to provide a more consistent interface.
      
      While -1 indicates that the suppressor is not enabled, values >= 0 for
      suppress_prefixlen or suppress_ifgroup  reject routing decisions violating the
      constraint.
      
      This changes the previously presented behaviour of suppress_prefixlen, where a
      prefix length _less_ than the attribute value was rejected. After this change,
      a prefix length less than *or* equal to the value is considered a violation of
      the rule constraint.
      
      It also changes the default values for default and newly added rules (disabling
      any suppression for those).
      Signed-off-by: NStefan Tomanek <stefan.tomanek@wertarbyte.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73f5698e
  13. 03 8月, 2013 2 次提交
  14. 02 8月, 2013 1 次提交
  15. 01 8月, 2013 1 次提交
    • S
      fib_rules: add .suppress operation · 7764a45a
      Stefan Tomanek 提交于
      This change adds a new operation to the fib_rules_ops struct; it allows the
      suppression of routing decisions if certain criteria are not met by its
      results.
      
      The first implemented constraint is a minimum prefix length added to the
      structures of routing rules. If a rule is added with a minimum prefix length
      >0, only routes meeting this threshold will be considered. Any other (more
      general) routing table entries will be ignored.
      
      When configuring a system with multiple network uplinks and default routes, it
      is often convinient to reference the main routing table multiple times - but
      omitting the default route. Using this patch and a modified "ip" utility, this
      can be achieved by using the following command sequence:
      
        $ ip route add table secuplink default via 10.42.23.1
      
        $ ip rule add pref 100            table main prefixlength 1
        $ ip rule add pref 150 fwmark 0xA table secuplink
      
      With this setup, packets marked 0xA will be processed by the additional routing
      table "secuplink", but only if no suitable route in the main routing table can
      be found. By using a minimal prefixlength of 1, the default route (/0) of the
      table "main" is hidden to packets processed by rule 100; packets traveling to
      destinations with more specific routing entries are processed as usual.
      Signed-off-by: NStefan Tomanek <stefan.tomanek@wertarbyte.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7764a45a
  16. 31 7月, 2013 2 次提交
  17. 28 7月, 2013 2 次提交
  18. 25 7月, 2013 2 次提交
    • E
      tcp: TCP_NOTSENT_LOWAT socket option · c9bee3b7
      Eric Dumazet 提交于
      Idea of this patch is to add optional limitation of number of
      unsent bytes in TCP sockets, to reduce usage of kernel memory.
      
      TCP receiver might announce a big window, and TCP sender autotuning
      might allow a large amount of bytes in write queue, but this has little
      performance impact if a large part of this buffering is wasted :
      
      Write queue needs to be large only to deal with large BDP, not
      necessarily to cope with scheduling delays (incoming ACKS make room
      for the application to queue more bytes)
      
      For most workloads, using a value of 128 KB or less is OK to give
      applications enough time to react to POLLOUT events in time
      (or being awaken in a blocking sendmsg())
      
      This patch adds two ways to set the limit :
      
      1) Per socket option TCP_NOTSENT_LOWAT
      
      2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
      not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
      Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
      
      This changes poll()/select()/epoll() to report POLLOUT
      only if number of unsent bytes is below tp->nosent_lowat
      
      Note this might increase number of sendmsg()/sendfile() calls
      when using non blocking sockets,
      and increase number of context switches for blocking sockets.
      
      Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
      defined as :
       Specify the minimum number of bytes in the buffer until
       the socket layer will pass the data to the protocol)
      
      Tested:
      
      netperf sessions, and watching /proc/net/protocols "memory" column for TCP
      
      With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
      used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
      
      lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
      TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      
      lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
      TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      
      Using 128KB has no bad effect on the throughput or cpu usage
      of a single flow, although there is an increase of context switches.
      
      A bonus is that we hold socket lock for a shorter amount
      of time and should improve latencies of ACK processing.
      
      lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
      OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
      Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
      Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
      Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
      Final       Final                                             %     Method %      Method
      1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB
      
       Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
      
                 412,514 context-switches
      
           200.034645535 seconds time elapsed
      
      lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
      OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
      Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
      Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
      Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
      Final       Final                                             %     Method %      Method
      1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB
      
       Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
      
               2,675,818 context-switches
      
           200.029651391 seconds time elapsed
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-By: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9bee3b7
    • D
      net: sctp: trivial: update mailing list address · 91705c61
      Daniel Borkmann 提交于
      The SCTP mailing list address to send patches or questions
      to is linux-sctp@vger.kernel.org and not
      lksctp-developers@lists.sourceforge.net anymore. Therefore,
      update all occurences.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91705c61
  19. 23 7月, 2013 1 次提交
  20. 17 7月, 2013 1 次提交
  21. 16 7月, 2013 3 次提交
  22. 11 7月, 2013 1 次提交
    • M
      dm: optimize use SRCU and RCU · 83d5e5b0
      Mikulas Patocka 提交于
      This patch removes "io_lock" and "map_lock" in struct mapped_device and
      "holders" in struct dm_table and replaces these mechanisms with
      sleepable-rcu.
      
      Previously, the code would call "dm_get_live_table" and "dm_table_put" to
      get and release table. Now, the code is changed to call "dm_get_live_table"
      and "dm_put_live_table". dm_get_live_table locks sleepable-rcu and
      dm_put_live_table unlocks it.
      
      dm_get_live_table_fast/dm_put_live_table_fast can be used instead of
      dm_get_live_table/dm_put_live_table. These *_fast functions use
      non-sleepable RCU, so the caller must not block between them.
      
      If the code changes active or inactive dm table, it must call
      dm_sync_table before destroying the old table.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      83d5e5b0
  23. 10 7月, 2013 1 次提交
    • M
      fatfs: add FAT_IOCTL_GET_VOLUME_ID · 6e5b93ee
      Mike Lockwood 提交于
      This patch, originally from Android kernel, adds vfat ioctl command
      FAT_IOCTL_GET_VOLUME_ID, with this command we can get the vfat volume ID
      using following code:
      
      	ioctl(fd, FAT_IOCTL_GET_VOLUME_ID, &volume_ID)
      
      This patch is a modified version of the patch by Mike Lockwood, with
      changes from Dmitry Pervushin, who noticed the original patch makes some
      volume IDs abiguous with error returns: for example, if volume id is
      0xFFFFFDAD, that matches -ENOIOCTLCMD, we get "FFFFFFFF" from the user
      space.
      
      So add a parameter to ioctl to get the correct volume ID.
      
      Android uses vfat volume ID to identify different sd card, when a new sd
      card is inserted to device, android can scan the media on it and pop up
      new contents.
      Signed-off-by: NBintian Wang <bintian.wang@linaro.org>
      Cc: dmitry pervushin <dpervushin@gmail.com>
      Cc: Mike Lockwood <lockwood@android.com>
      Cc: Colin Cross <ccross@android.com>
      Acked-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Sean McNeil <sean@mcneil.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e5b93ee
  24. 09 7月, 2013 1 次提交
  25. 04 7月, 2013 1 次提交
    • A
      ptrace: add ability to get/set signal-blocked mask · 29000cae
      Andrey Vagin 提交于
      crtools uses a parasite code for dumping processes.  The parasite code is
      injected into a process with help PTRACE_SEIZE.
      
      Currently crtools blocks signals from a parasite code.  If a process has
      pending signals, crtools wait while a process handles these signals.
      
      This method is not suitable for stopped tasks.  A stopped task can have a
      few pending signals, when we will try to execute a parasite code, we will
      need to drop SIGSTOP, but all other signals must remain pending, because a
      state of processes must not be changed during checkpointing.
      
      This patch adds two ptrace commands to set/get signal-blocked mask.
      
      I think gdb can use this commands too.
      
      [akpm@linux-foundation.org: be consistent with brace layout]
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29000cae
  26. 02 7月, 2013 1 次提交
  27. 01 7月, 2013 2 次提交