1. 03 2月, 2015 1 次提交
    • W
      net-timestamp: no-payload only sysctl · b245be1f
      Willem de Bruijn 提交于
      Tx timestamps are looped onto the error queue on top of an skb. This
      mechanism leaks packet headers to processes unless the no-payload
      options SOF_TIMESTAMPING_OPT_TSONLY is set.
      
      Add a sysctl that optionally drops looped timestamp with data. This
      only affects processes without CAP_NET_RAW.
      
      The policy is checked when timestamps are generated in the stack.
      It is possible for timestamps with data to be reported after the
      sysctl is set, if these were queued internally earlier.
      
      No vulnerability is immediately known that exploits knowledge
      gleaned from packet headers, but it may still be preferable to allow
      administrators to lock down this path at the cost of possible
      breakage of legacy applications.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes
        (v1 -> v2)
        - test socket CAP_NET_RAW instead of capable(CAP_NET_RAW)
        (rfc -> v1)
        - document the sysctl in Documentation/sysctl/net.txt
        - fix access control race: read .._OPT_TSONLY only once,
              use same value for permission check and skb generation.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b245be1f
  2. 17 11月, 2014 1 次提交
    • E
      net: provide a per host RSS key generic infrastructure · 960fb622
      Eric Dumazet 提交于
      RSS (Receive Side Scaling) typically uses Toeplitz hash and a 40 or 52 bytes
      RSS key.
      
      Some drivers use a constant (and well known key), some drivers use a random
      key per port, making bonding setups hard to tune. Well known keys increase
      attack surface, considering that number of queues is usually a power of two.
      
      This patch provides infrastructure to help drivers doing the right thing.
      
      netdev_rss_key_fill() should be used by drivers to initialize their RSS key,
      even if they provide ethtool -X support to let user redefine the key later.
      
      A new /proc/sys/net/core/netdev_rss_key file can be used to get the host
      RSS key even for drivers not providing ethtool -x support, in case some
      applications want to precisely setup flows to match some RX queues.
      
      Tested:
      
      myhost:~# cat /proc/sys/net/core/netdev_rss_key
      11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41:36:40:74:b6:15:ca:27:44:aa:b3:4d:72
      
      myhost:~# ethtool -x eth0
      RX flow hash indirection table for eth0 with 8 RX ring(s):
          0:      0     1     2     3     4     5     6     7
      RSS hash key:
      11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      960fb622
  3. 12 11月, 2014 1 次提交
    • J
      net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited · ba7a46f1
      Joe Perches 提交于
      Use the more common dynamic_debug capable net_dbg_ratelimited
      and remove the LIMIT_NETDEBUG macro.
      
      All messages are still ratelimited.
      
      Some KERN_<LEVEL> uses are changed to KERN_DEBUG.
      
      This may have some negative impact on messages that were
      emitted at KERN_INFO that are not not enabled at all unless
      DEBUG is defined or dynamic_debug is enabled.  Even so,
      these messages are now _not_ emitted by default.
      
      This also eliminates the use of the net_msg_warn sysctl
      "/proc/sys/net/core/warnings".  For backward compatibility,
      the sysctl is not removed, but it has no function.  The extern
      declaration of net_msg_warn is removed from sock.h and made
      static in net/core/sysctl_net_core.c
      
      Miscellanea:
      
      o Update the sysctl documentation
      o Remove the embedded uses of pr_fmt
      o Coalesce format fragments
      o Realign arguments
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba7a46f1
  4. 02 9月, 2014 1 次提交
    • E
      tipc: add name distributor resiliency queue · a5325ae5
      Erik Hugne 提交于
      TIPC name table updates are distributed asynchronously in a cluster,
      entailing a risk of certain race conditions. E.g., if two nodes
      simultaneously issue conflicting (overlapping) publications, this may
      not be detected until both publications have reached a third node, in
      which case one of the publications will be silently dropped on that
      node. Hence, we end up with an inconsistent name table.
      
      In most cases this conflict is just a temporary race, e.g., one
      node is issuing a publication under the assumption that a previous,
      conflicting, publication has already been withdrawn by the other node.
      However, because of the (rtt related) distributed update delay, this
      may not yet hold true on all nodes. The symptom of this failure is a
      syslog message: "tipc: Cannot publish {%u,%u,%u}, overlap error".
      
      In this commit we add a resiliency queue at the receiving end of
      the name table distributor. When insertion of an arriving publication
      fails, we retain it in this queue for a short amount of time, assuming
      that another update will arrive very soon and clear the conflict. If so
      happens, we insert the publication, otherwise we drop it.
      
      The (configurable) retention value defaults to 2000 ms. Knowing from
      experience that the situation described above is extremely rare, there
      is no risk that the queue will accumulate any large number of items.
      Signed-off-by: NErik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5325ae5
  5. 31 8月, 2013 1 次提交
    • S
      qdisc: allow setting default queuing discipline · 6da7c8fc
      stephen hemminger 提交于
      By default, the pfifo_fast queue discipline has been used by default
      for all devices. But we have better choices now.
      
      This patch allow setting the default queueing discipline with sysctl.
      This allows easy use of better queueing disciplines on all devices
      without having to use tc qdisc scripts. It is intended to allow
      an easy path for distributions to make fq_codel or sfq the default
      qdisc.
      
      This patch also makes pfifo_fast more of a first class qdisc, since
      it is now possible to manually override the default and explicitly
      use pfifo_fast. The behavior for systems who do not use the sysctl
      is unchanged, they still get pfifo_fast
      
      Also removes leftover random # in sysctl net core.
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6da7c8fc
  6. 02 8月, 2013 1 次提交
  7. 11 7月, 2013 1 次提交
  8. 09 7月, 2013 1 次提交
  9. 26 6月, 2013 1 次提交
    • E
      net: poll/select low latency socket support · 2d48d67f
      Eliezer Tamir 提交于
      select/poll busy-poll support.
      
      Split sysctl value into two separate ones, one for read and one for poll.
      updated Documentation/sysctl/net.txt
      
      Add a new poll flag POLL_LL. When this flag is set, sock_poll will call
      sk_poll_ll if possible. sock_poll sets this flag in its return value
      to indicate to select/poll when a socket that can busy poll is found.
      
      When poll/select have nothing to report, call the low-level
      sock_poll again until we are out of time or we find something.
      
      Once the system call finds something, it stops setting POLL_LL, so it can
      return the result to the user ASAP.
      Signed-off-by: NEliezer Tamir <eliezer.tamir@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d48d67f
  10. 24 6月, 2013 1 次提交
  11. 18 6月, 2013 1 次提交
    • Y
      tipc: change socket buffer overflow control to respect sk_rcvbuf · cc79dd1b
      Ying Xue 提交于
      As per feedback from the netdev community, we change the buffer
      overflow protection algorithm in receiving sockets so that it
      always respects the nominal upper limit set in sk_rcvbuf.
      
      Instead of scaling up from a small sk_rcvbuf value, which leads to
      violation of the configured sk_rcvbuf limit, we now calculate the
      weighted per-message limit by scaling down from a much bigger value,
      still in the same field, according to the importance priority of the
      received message.
      
      To allow for administrative tunability of the socket receive buffer
      size, we create a tipc_rmem sysctl variable to allow the user to
      configure an even bigger value via sysctl command.  It is a size of
      three (min/default/max) to be consistent with things like tcp_rmem.
      
      By default, the value initialized in tipc_rmem[1] is equal to the
      receive socket size needed by a TIPC_CRITICAL_IMPORTANCE message.
      This value is also set as the default value of sk_rcvbuf.
      Originally-by: NJon Maloy <jon.maloy@ericsson.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      [Ying: added sysctl variation to Jon's original patch]
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      [PG: don't compile sysctl.c if not config'd; add Documentation]
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc79dd1b
  12. 11 6月, 2013 1 次提交
  13. 20 5月, 2013 1 次提交
  14. 27 4月, 2012 1 次提交
  15. 28 4月, 2011 1 次提交
    • E
      net: filter: Just In Time compiler for x86-64 · 0a14842f
      Eric Dumazet 提交于
      In order to speedup packet filtering, here is an implementation of a
      JIT compiler for x86_64
      
      It is disabled by default, and must be enabled by the admin.
      
      echo 1 >/proc/sys/net/core/bpf_jit_enable
      
      It uses module_alloc() and module_free() to get memory in the 2GB text
      kernel range since we call helpers functions from the generated code.
      
      EAX : BPF A accumulator
      EBX : BPF X accumulator
      RDI : pointer to skb   (first argument given to JIT function)
      RBP : frame pointer (even if CONFIG_FRAME_POINTER=n)
      r9d : skb->len - skb->data_len (headlen)
      r8  : skb->data
      
      To get a trace of generated code, use :
      
      echo 2 >/proc/sys/net/core/bpf_jit_enable
      
      Example of generated code :
      
      # tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24
      
      flen=18 proglen=147 pass=3 image=ffffffffa00b5000
      JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60
      JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00
      JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00
      JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be
      JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0
      JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00
      JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00
      JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24
      JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31
      JIT code: ffffffffa00b5090: c0 c9 c3
      
      BPF program is 144 bytes long, so native program is almost same size ;)
      
      (000) ldh      [12]
      (001) jeq      #0x800           jt 2    jf 8
      (002) ld       [26]
      (003) and      #0xffffff00
      (004) jeq      #0xc0a81400      jt 16   jf 5
      (005) ld       [30]
      (006) and      #0xffffff00
      (007) jeq      #0xc0a81400      jt 16   jf 17
      (008) jeq      #0x806           jt 10   jf 9
      (009) jeq      #0x8035          jt 10   jf 17
      (010) ld       [28]
      (011) and      #0xffffff00
      (012) jeq      #0xc0a81400      jt 16   jf 13
      (013) ld       [38]
      (014) and      #0xffffff00
      (015) jeq      #0xc0a81400      jt 16   jf 17
      (016) ret      #65535
      (017) ret      #0
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a14842f
  16. 16 5月, 2010 1 次提交
    • E
      net: Consistent skb timestamping · 3b098e2d
      Eric Dumazet 提交于
      With RPS inclusion, skb timestamping is not consistent in RX path.
      
      If netif_receive_skb() is used, its deferred after RPS dispatch.
      
      If netif_rx() is used, its done before RPS dispatch.
      
      This can give strange tcpdump timestamps results.
      
      I think timestamping should be done as soon as possible in the receive
      path, to get meaningful values (ie timestamps taken at the time packet
      was delivered by NIC driver to our stack), even if NAPI already can
      defer timestamping a bit (RPS can help to reduce the gap)
      
      Tom Herbert prefer to sample timestamps after RPS dispatch. In case
      sampling is expensive (HPET/acpi_pm on x86), this makes sense.
      
      Let admins switch from one mode to another, using a new
      sysctl, /proc/sys/net/core/netdev_tstamp_prequeue
      
      Its default value (1), means timestamps are taken as soon as possible,
      before backlog queueing, giving accurate timestamps.
      
      Setting a 0 value permits to sample timestamps when processing backlog,
      after RPS dispatch, to lower the load of the pre-RPS cpu.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b098e2d
  17. 14 4月, 2009 1 次提交
  18. 03 4月, 2009 2 次提交
    • L
      documentation: fix unix_dgram_qlen description · 45dad7bd
      Li Xiaodong 提交于
      Previous description about system parameter in /proc/sys/net/unix/ is
      wrong (or missed).  Simply add a new description about unix_dgram_qlen
      according to latest kernel.
      Signed-off-by: NLi Xiaodong <lixd@cn.fujitsu.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45dad7bd
    • S
      documentation: update Documentation/filesystem/proc.txt and Documentation/sysctls · 760df93e
      Shen Feng 提交于
      Now /proc/sys is described in many places and much information is
      redundant.  This patch updates the proc.txt and move the /proc/sys
      desciption out to the files in Documentation/sysctls.
      
      Details are:
      
      merge
      -  2.1  /proc/sys/fs - File system data
      -  2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem
      -  2.17 /proc/sys/fs/epoll - Configuration options for the epoll interface
      with Documentation/sysctls/fs.txt.
      
      remove
      -  2.2  /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
      since it's not better then the Documentation/binfmt_misc.txt.
      
      merge
      -  2.3  /proc/sys/kernel - general kernel parameters
      with Documentation/sysctls/kernel.txt
      
      remove
      -  2.5  /proc/sys/dev - Device specific parameters
      since it's obsolete the sysfs is used now.
      
      remove
      -  2.6  /proc/sys/sunrpc - Remote procedure calls
      since it's not better then the Documentation/sysctls/sunrpc.txt
      
      move
      -  2.7  /proc/sys/net - Networking stuff
      -  2.9  Appletalk
      -  2.10 IPX
      to newly created Documentation/sysctls/net.txt.
      
      remove
      -  2.8  /proc/sys/net/ipv4 - IPV4 settings
      since it's not better then the Documentation/networking/ip-sysctl.txt.
      
      add
      - Chapter 3 Per-Process Parameters
      to descibe /proc/<pid>/xxx parameters.
      Signed-off-by: NShen Feng <shen@cn.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      760df93e