1. 06 12月, 2016 7 次提交
  2. 04 12月, 2016 5 次提交
    • I
      ipv4: fib: Replay events when registering FIB notifier · c3852ef7
      Ido Schimmel 提交于
      Commit b90eb754 ("fib: introduce FIB notification infrastructure")
      introduced a new notification chain to notify listeners (f.e., switchdev
      drivers) about addition and deletion of routes.
      
      However, upon registration to the chain the FIB tables can already be
      populated, which means potential listeners will have an incomplete view
      of the tables.
      
      Solve that by dumping the FIB tables and replaying the events to the
      passed notification block. The dump itself is done using RCU in order
      not to starve consumers that need RTNL to make progress.
      
      The integrity of the dump is ensured by reading the FIB change sequence
      counter before and after the dump under RTNL. This allows us to avoid
      the problematic situation in which the dumping process sends a ENTRY_ADD
      notification following ENTRY_DEL generated by another process holding
      RTNL.
      
      Callers of the registration function may pass a callback that is
      executed in case the dump was inconsistent with current FIB tables.
      
      The number of retries until a consistent dump is achieved is set to a
      fixed number to prevent callers from looping for long periods of time.
      In case current limit proves to be problematic in the future, it can be
      easily converted to be configurable using a sysctl.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3852ef7
    • I
      ipv4: fib: Allow for consistent FIB dumping · cacaad11
      Ido Schimmel 提交于
      The next patch will enable listeners of the FIB notification chain to
      request a dump of the FIB tables. However, since RTNL isn't taken during
      the dump, it's possible for the FIB tables to change mid-dump, which
      will result in inconsistency between the listener's table and the
      kernel's.
      
      Allow listeners to know about changes that occurred mid-dump, by adding
      a change sequence counter to each net namespace. The counter is
      incremented just before a notification is sent in the FIB chain.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cacaad11
    • I
      ipv4: fib: Convert FIB notification chain to be atomic · d3f706f6
      Ido Schimmel 提交于
      In order not to hold RTNL for long periods of time we're going to dump
      the FIB tables using RCU.
      
      Convert the FIB notification chain to be atomic, as we can't block in
      RCU critical sections.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3f706f6
    • I
      ipv4: fib: Export free_fib_info() · b423cb10
      Ido Schimmel 提交于
      The FIB notification chain is going to be converted to an atomic chain,
      which means switchdev drivers will have to offload FIB entries in
      deferred work, as hardware operations entail sleeping.
      
      However, while the work is queued fib info might be freed, so a
      reference must be taken. To release the reference (and potentially free
      the fib info) fib_info_put() will be called, which in turn calls
      free_fib_info().
      
      Export free_fib_info() so that modules will be able to invoke
      fib_info_put().
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b423cb10
    • P
      udp: be less conservative with sock rmem accounting · 363dc73a
      Paolo Abeni 提交于
      Before commit 850cbadd ("udp: use it's own memory accounting
      schema"), the udp protocol allowed sk_rmem_alloc to grow beyond
      the rcvbuf by the whole current packet's truesize. After said commit
      we allow sk_rmem_alloc to exceed the rcvbuf only if the receive queue
      is empty. As reported by Jesper this cause a performance regression
      for some (small) values of rcvbuf.
      
      This commit is intended to fix the regression restoring the old
      handling of the rcvbuf limit.
      Reported-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Fixes: 850cbadd ("udp: use it's own memory accounting schema")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      363dc73a
  3. 03 12月, 2016 4 次提交
  4. 02 12月, 2016 2 次提交
  5. 01 12月, 2016 2 次提交
  6. 30 11月, 2016 8 次提交
  7. 29 11月, 2016 1 次提交
  8. 26 11月, 2016 1 次提交
    • D
      net: ipv4, ipv6: run cgroup eBPF egress programs · 33b48679
      Daniel Mack 提交于
      If the cgroup associated with the receiving socket has an eBPF
      programs installed, run them from ip_output(), ip6_output() and
      ip_mc_output(). From mentioned functions we have two socket contexts
      as per 7026b1dd ("netfilter: Pass socket pointer down through
      okfn()."). We explicitly need to use sk instead of skb->sk here,
      since otherwise the same program would run multiple times on egress
      when encap devices are involved, which is not desired in our case.
      
      eBPF programs used in this context are expected to either return 1 to
      let the packet pass, or != 1 to drop them. The programs have access to
      the skb through bpf_skb_load_bytes(), and the payload starts at the
      network headers (L3).
      
      Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
      for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
      the feature is unused.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33b48679
  9. 25 11月, 2016 2 次提交
    • E
      tcp: enhance tcp_collapse_retrans() with skb_shift() · f8071cde
      Eric Dumazet 提交于
      In commit 2331ccc5 ("tcp: enhance tcp collapsing"),
      we made a first step allowing copying right skb to left skb head.
      
      Since all skbs in socket write queue are headless (but possibly the very
      first one), this strategy often does not work.
      
      This patch extends tcp_collapse_retrans() to perform frag shifting,
      thanks to skb_shift() helper.
      
      This helper needs to not BUG on non headless skbs, as callers are ok
      with that.
      
      Tested:
      
      Following packetdrill test now passes :
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 8>
         +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      +.100 < . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
         +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
         +0 write(4, ..., 200) = 200
         +0 > P. 1:201(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 201:401(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 401:601(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 601:801(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 801:1001(200) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1001:1101(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1101:1201(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1201:1301(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1301:1401(100) ack 1
      
      +.099 < . 1:1(0) ack 201 win 257
      +.001 < . 1:1(0) ack 201 win 257 <nop,nop,sack 1001:1401>
         +0 > P. 201:1001(800) ack 1
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8071cde
    • E
      udplite: call proper backlog handlers · 30c7be26
      Eric Dumazet 提交于
      In commits 93821778 ("udp: Fix rcv socket locking") and
      f7ad74fe ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into
      __udpv6_queue_rcv_skb") UDP backlog handlers were renamed, but UDPlite
      was forgotten.
      
      This leads to crashes if UDPlite header is pulled twice, which happens
      starting from commit e6afc8ac ("udp: remove headers from UDP packets
      before queueing")
      
      Bug found by syzkaller team, thanks a lot guys !
      
      Note that backlog use in UDP/UDPlite is scheduled to be removed starting
      from linux-4.10, so this patch is only needed up to linux-4.9
      
      Fixes: 93821778 ("udp: Fix rcv socket locking")
      Fixes: f7ad74fe ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb")
      Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30c7be26
  10. 24 11月, 2016 1 次提交
    • D
      netfilter: Update ip_route_me_harder to consider L3 domain · 6d8b49c3
      David Ahern 提交于
      ip_route_me_harder is not considering the L3 domain and sending lookups
      to the wrong table. For example consider the following output rule:
      
      iptables -I OUTPUT -p tcp --dport 12345 -j REJECT --reject-with tcp-reset
      
      using perf to analyze lookups via the fib_table_lookup tracepoint shows:
      
      vrf-test  1187 [001] 46887.295927: fib:fib_table_lookup: table 255 oif 0 iif 0 src 0.0.0.0 dst 10.100.1.254 tos 0 scope 0 flags 0
              ffffffff8143922c perf_trace_fib_table_lookup ([kernel.kallsyms])
              ffffffff81493aac fib_table_lookup ([kernel.kallsyms])
              ffffffff8148dda3 __inet_dev_addr_type ([kernel.kallsyms])
              ffffffff8148ddf6 inet_addr_type ([kernel.kallsyms])
              ffffffff8149e344 ip_route_me_harder ([kernel.kallsyms])
      
      and
      
      vrf-test  1187 [001] 46887.295933: fib:fib_table_lookup: table 255 oif 0 iif 1 src 10.100.1.254 dst 10.100.1.2 tos 0 scope 0 flags
              ffffffff8143922c perf_trace_fib_table_lookup ([kernel.kallsyms])
              ffffffff81493aac fib_table_lookup ([kernel.kallsyms])
              ffffffff814998ff fib4_rule_action ([kernel.kallsyms])
              ffffffff81437f35 fib_rules_lookup ([kernel.kallsyms])
              ffffffff81499758 __fib_lookup ([kernel.kallsyms])
              ffffffff8144f010 fib_lookup.constprop.34 ([kernel.kallsyms])
              ffffffff8144f759 __ip_route_output_key_hash ([kernel.kallsyms])
              ffffffff8144fc6a ip_route_output_flow ([kernel.kallsyms])
              ffffffff8149e39b ip_route_me_harder ([kernel.kallsyms])
      
      In both cases the lookups are directed to table 255 rather than the
      table associated with the device via the L3 domain. Update both
      lookups to pull the L3 domain from the dst currently attached to the
      skb.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      6d8b49c3
  11. 22 11月, 2016 4 次提交
  12. 20 11月, 2016 1 次提交
    • A
      net: fix bogus cast in skb_pagelen() and use unsigned variables · c72d8cda
      Alexey Dobriyan 提交于
      1) cast to "int" is unnecessary:
         u8 will be promoted to int before decrementing,
         small positive numbers fit into "int", so their values won't be changed
         during promotion.
      
         Once everything is int including loop counters, signedness doesn't
         matter: 32-bit operations will stay 32-bit operations.
      
         But! Someone tried to make this loop smart by making everything of
         the same type apparently in an attempt to optimise it.
         Do the optimization, just differently.
         Do the cast where it matters. :^)
      
      2) frag size is unsigned entity and sum of fragments sizes is also
         unsigned.
      
      Make everything unsigned, leave no MOVSX instruction behind.
      
      	add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-4 (-4)
      	function                                     old     new   delta
      	skb_cow_data                                 835     834      -1
      	ip_do_fragment                              2549    2548      -1
      	ip6_fragment                                3130    3128      -2
      	Total: Before=154865032, After=154865028, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c72d8cda
  13. 18 11月, 2016 2 次提交
    • A
      netns: make struct pernet_operations::id unsigned int · c7d03a00
      Alexey Dobriyan 提交于
      Make struct pernet_operations::id unsigned.
      
      There are 2 reasons to do so:
      
      1)
      This field is really an index into an zero based array and
      thus is unsigned entity. Using negative value is out-of-bound
      access by definition.
      
      2)
      On x86_64 unsigned 32-bit data which are mixed with pointers
      via array indexing or offsets added or subtracted to pointers
      are preffered to signed 32-bit data.
      
      "int" being used as an array index needs to be sign-extended
      to 64-bit before being used.
      
      	void f(long *p, int i)
      	{
      		g(p[i]);
      	}
      
        roughly translates to
      
      	movsx	rsi, esi
      	mov	rdi, [rsi+...]
      	call 	g
      
      MOVSX is 3 byte instruction which isn't necessary if the variable is
      unsigned because x86_64 is zero extending by default.
      
      Now, there is net_generic() function which, you guessed it right, uses
      "int" as an array index:
      
      	static inline void *net_generic(const struct net *net, int id)
      	{
      		...
      		ptr = ng->ptr[id - 1];
      		...
      	}
      
      And this function is used a lot, so those sign extensions add up.
      
      Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
      messing with code generation):
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      
      Unfortunately some functions actually grow bigger.
      This is a semmingly random artefact of code generation with register
      allocator being used differently. gcc decides that some variable
      needs to live in new r8+ registers and every access now requires REX
      prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
      used which is longer than [r8]
      
      However, overall balance is in negative direction:
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      	function                                     old     new   delta
      	nfsd4_lock                                  3886    3959     +73
      	tipc_link_build_proto_msg                   1096    1140     +44
      	mac80211_hwsim_new_radio                    2776    2808     +32
      	tipc_mon_rcv                                1032    1058     +26
      	svcauth_gss_legacy_init                     1413    1429     +16
      	tipc_bcbase_select_primary                   379     392     +13
      	nfsd4_exchange_id                           1247    1260     +13
      	nfsd4_setclientid_confirm                    782     793     +11
      		...
      	put_client_renew_locked                      494     480     -14
      	ip_set_sockfn_get                            730     716     -14
      	geneve_sock_add                              829     813     -16
      	nfsd4_sequence_done                          721     703     -18
      	nlmclnt_lookup_host                          708     686     -22
      	nfsd4_lockt                                 1085    1063     -22
      	nfs_get_client                              1077    1050     -27
      	tcf_bpf_init                                1106    1076     -30
      	nfsd4_encode_fattr                          5997    5930     -67
      	Total: Before=154856051, After=154854321, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d03a00
    • E
      udp: enable busy polling for all sockets · e68b6e50
      Eric Dumazet 提交于
      UDP busy polling is restricted to connected UDP sockets.
      
      This is because sk_busy_loop() only takes care of one NAPI context.
      
      There are cases where it could be extended.
      
      1) Some hosts receive traffic on a single NIC, with one RX queue.
      
      2) Some applications use SO_REUSEPORT and associated BPF filter
         to split the incoming traffic on one UDP socket per RX
      queue/thread/cpu
      
      3) Some UDP sockets are used to send/receive traffic for one flow, but
      they do not bother with connect()
      
      This patch records the napi_id of first received skb, giving more
      reach to busy polling.
      
      Tested:
      
      lpaa23:~# echo 70 >/proc/sys/net/core/busy_read
      lpaa24:~# echo 70 >/proc/sys/net/core/busy_read
      
      lpaa23:~# for f in `seq 1 10`; do ./super_netperf 1 -H lpaa24 -t UDP_RR -l 5; done
      
      Before patch :
         27867   28870   37324   41060   41215
         36764   36838   44455   41282   43843
      After patch :
         73920   73213   70147   74845   71697
         68315   68028   75219   70082   73707
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e68b6e50