1. 04 4月, 2019 1 次提交
  2. 02 4月, 2019 1 次提交
  3. 30 3月, 2019 8 次提交
  4. 28 3月, 2019 3 次提交
    • E
      inet: switch IP ID generator to siphash · df453700
      Eric Dumazet 提交于
      According to Amit Klein and Benny Pinkas, IP ID generation is too weak
      and might be used by attackers.
      
      Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix())
      having 64bit key and Jenkins hash is risky.
      
      It is time to switch to siphash and its 128bit keys.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAmit Klein <aksecurity@gmail.com>
      Reported-by: NBenny Pinkas <benny@pinkas.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df453700
    • E
      tcp: fix zerocopy and notsent_lowat issues · 4f661542
      Eric Dumazet 提交于
      My recent patch had at least three problems :
      
      1) TX zerocopy wants notification when skb is acknowledged,
         thus we need to call skb_zcopy_clear() if the skb is
         cached into sk->sk_tx_skb_cache
      
      2) Some applications might expect precise EPOLLOUT
         notifications, so we need to update sk->sk_wmem_queued
         and call sk_mem_uncharge() from sk_wmem_free_skb()
         in all cases. The SOCK_QUEUE_SHRUNK flag must also be set.
      
      3) Reuse of saved skb should have used skb_cloned() instead
        of simply checking if the fast clone has been freed.
      
      Fixes: 472c2e07 ("tcp: add one skb cache for tx")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f661542
    • K
      fou: Support binding FoU socket · 1713cb37
      Kristian Evensen 提交于
      An FoU socket is currently bound to the wildcard-address. While this
      works fine, there are several use-cases where the use of the
      wildcard-address is not desirable. For example, I use FoU on some
      multi-homed servers and would like to use FoU on only one of the
      interfaces.
      
      This commit adds support for binding FoU sockets to a given source
      address/interface, as well as connecting the socket to a given
      destination address/port. udp_tunnel already provides the required
      infrastructure, so most of the code added is for exposing and setting
      the different attributes (local address, peer address, etc.).
      
      The lookups performed when we add, delete or get an FoU-socket has also
      been updated to compare all the attributes a user can set. Since the
      comparison now involves several elements, I have added a separate
      comparison-function instead of open-coding.
      
      In order to test the code and ensure that the new comparison code works
      correctly, I started by creating a wildcard socket bound to port 1234 on
      my machine. I then tried to create a non-wildcarded socket bound to the
      same port, as well as fetching and deleting the socket (including source
      address, peer address or interface index in the netlink request).  Both
      the create, fetch and delete request failed. Deleting/fetching the
      socket was only successful when my netlink request attributes matched
      those used to create the socket.
      
      I then repeated the tests, but with a socket bound to a local ip
      address, a socket bound to a local address + interface, and a bound
      socket that was also «connected» to a peer. Add only worked when no
      socket with the matching source address/interface (or wildcard) existed,
      while fetch/delete was only successful when all attributes matched.
      
      In addition to testing that the new code work, I also checked that the
      current behavior is kept. If none of the new attributes are provided,
      then an FoU-socket is configured as before (i.e., wildcarded).  If any
      of the new attributes are provided, the FoU-socket is configured as
      expected.
      
      v1->v2:
      * Fixed building with IPv6 disabled (kbuild).
      * Fixed a return type warning and make the ugly comparison function more
      readable (kbuild).
      * Describe more in detail what has been tested (thanks David Miller).
      * Make peer port required if peer address is specified.
      Signed-off-by: NKristian Evensen <kristian.evensen@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1713cb37
  5. 24 3月, 2019 3 次提交
    • E
      tcp: add one skb cache for rx · 8b27dae5
      Eric Dumazet 提交于
      Often times, recvmsg() system calls and BH handling for a particular
      TCP socket are done on different cpus.
      
      This means the incoming skb had to be allocated on a cpu,
      but freed on another.
      
      This incurs a high spinlock contention in slab layer for small rpc,
      but also a high number of cache line ping pongs for larger packets.
      
      A full size GRO packet might use 45 page fragments, meaning
      that up to 45 put_page() can be involved.
      
      More over performing the __kfree_skb() in the recvmsg() context
      adds a latency for user applications, and increase probability
      of trapping them in backlog processing, since the BH handler
      might found the socket owned by the user.
      
      This patch, combined with the prior one increases the rpc
      performance by about 10 % on servers with large number of cores.
      
      (tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
       instead of 8 Mpps)
      
      This also increases single bulk flow performance on 40Gbit+ links,
      since in this case there are often two cpus working in tandem :
      
       - CPU handling the NIC rx interrupts, feeding the receive queue,
        and (after this patch) freeing the skbs that were consumed.
      
       - CPU in recvmsg() system call, essentially 100 % busy copying out
        data to user space.
      
      Having at most one skb in a per-socket cache has very little risk
      of memory exhaustion, and since it is protected by socket lock,
      its management is essentially free.
      
      Note that if rps/rfs is used, we do not enable this feature, because
      there is high chance that the same cpu is handling both the recvmsg()
      system call and the TCP rx path, but that another cpu did the skb
      allocations in the device driver right before the RPS/RFS logic.
      
      To properly handle this case, it seems we would need to record
      on which cpu skb was allocated, and use a different channel
      to give skbs back to this cpu.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b27dae5
    • E
      tcp: add one skb cache for tx · 472c2e07
      Eric Dumazet 提交于
      On hosts with a lot of cores, RPC workloads suffer from heavy contention on slab spinlocks.
      
          20.69%  [kernel]       [k] queued_spin_lock_slowpath
           5.64%  [kernel]       [k] _raw_spin_lock
           3.83%  [kernel]       [k] syscall_return_via_sysret
           3.48%  [kernel]       [k] __entry_text_start
           1.76%  [kernel]       [k] __netif_receive_skb_core
           1.64%  [kernel]       [k] __fget
      
      For each sendmsg(), we allocate one skb, and free it at the time ACK packet comes.
      
      In many cases, ACK packets are handled by another cpus, and this unfortunately
      incurs heavy costs for slab layer.
      
      This patch uses an extra pointer in socket structure, so that we try to reuse
      the same skb and avoid these expensive costs.
      
      We cache at most one skb per socket so this should be safe as far as
      memory pressure is concerned.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      472c2e07
    • E
      tcp: remove conditional branches from tcp_mstamp_refresh() · e6d14070
      Eric Dumazet 提交于
      tcp_clock_ns() (aka ktime_get_ns()) is using monotonic clock,
      so the checks we had in tcp_mstamp_refresh() are no longer
      relevant.
      
      This patch removes cpu stall (when the cache line is not hot)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6d14070
  6. 22 3月, 2019 3 次提交
    • J
      genetlink: make policy common to family · 3b0f31f2
      Johannes Berg 提交于
      Since maxattr is common, the policy can't really differ sanely,
      so make it common as well.
      
      The only user that did in fact manage to make a non-common policy
      is taskstats, which has to be really careful about it (since it's
      still using a common maxattr!). This is no longer supported, but
      we can fake it using pre_doit.
      
      This reduces the size of e.g. nl80211.o (which has lots of commands):
      
         text	   data	    bss	    dec	    hex	filename
       398745	  14323	   2240	 415308	  6564c	net/wireless/nl80211.o (before)
       397913	  14331	   2240	 414484	  65314	net/wireless/nl80211.o (after)
      --------------------------------
         -832      +8       0    -824
      
      Which is obviously just 8 bytes for each command, and an added 8
      bytes for the new policy pointer. I'm not sure why the ops list is
      counted as .text though.
      
      Most of the code transformations were done using the following spatch:
          @ops@
          identifier OPS;
          expression POLICY;
          @@
          struct genl_ops OPS[] = {
          ...,
           {
          -	.policy = POLICY,
           },
          ...
          };
      
          @@
          identifier ops.OPS;
          expression ops.POLICY;
          identifier fam;
          expression M;
          @@
          struct genl_family fam = {
                  .ops = OPS,
                  .maxattr = M,
          +       .policy = POLICY,
                  ...
          };
      
      This also gets rid of devlink_nl_cmd_region_read_dumpit() accessing
      the cb->data as ops, which we want to change in a later genl patch.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b0f31f2
    • J
      net: dst: remove gc leftovers · 02afc7ad
      Julian Wiedmann 提交于
      Get rid of some obsolete gc-related documentation and macros that were
      missed in commit 5b7c9a8f ("net: remove dst gc related code").
      
      CC: Wei Wang <weiwan@google.com>
      Signed-off-by: NJulian Wiedmann <jwi@linux.ibm.com>
      Acked-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02afc7ad
    • D
      ipv4: Allow amount of dirty memory from fib resizing to be controllable · 9ab948a9
      David Ahern 提交于
      fib_trie implementation calls synchronize_rcu when a certain amount of
      pages are dirty from freed entries. The number of pages was determined
      experimentally in 2009 (commit c3059477).
      
      At the current setting, synchronize_rcu is called often -- 51 times in a
      second in one test with an average of an 8 msec delay adding a fib entry.
      The total impact is a lot of slow down modifying the fib. This is seen
      in the output of 'time' - the difference between real time and sys+user.
      For example, using 720,022 single path routes and 'ip -batch'[1]:
      
          $ time ./ip -batch ipv4/routes-1-hops
          real    0m14.214s
          user    0m2.513s
          sys     0m6.783s
      
      So roughly 35% of the actual time to install the routes is from the ip
      command getting scheduled out, most notably due to synchronize_rcu (this
      is observed using 'perf sched timehist').
      
      This patch makes the amount of dirty memory configurable between 64k where
      the synchronize_rcu is called often (small, low end systems that are memory
      sensitive) to 64M where synchronize_rcu is called rarely during a large
      FIB change (for high end systems with lots of memory). The default is 512kB
      which corresponds to the current setting of 128 pages with a 4kB page size.
      
      As an example, at 16MB the worst interval shows 4 calls to synchronize_rcu
      in a second blocking for up to 30 msec in a single instance, and a total
      of almost 100 msec across the 4 calls in the second. The trade off is
      allowing FIB entries to consume more memory in a given time window but
      but with much better fib insertion rates (~30% increase in prefixes/sec).
      With this patch and net.ipv4.fib_sync_mem set to 16MB, the same batch
      file runs in:
      
          $ time ./ip -batch ipv4/routes-1-hops
          real    0m9.692s
          user    0m2.491s
          sys     0m6.769s
      
      So the dead time is reduced to about 1/2 second or <5% of the real time.
      
      [1] 'ip' modified to not request ACK messages which improves route
          insertion times by about 20%
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ab948a9
  7. 20 3月, 2019 1 次提交
    • G
      tcp: free request sock directly upon TFO or syncookies error · 9403cf23
      Guillaume Nault 提交于
      Since the request socket is created locally, it'd make more sense to
      use reqsk_free() instead of reqsk_put() in TFO and syncookies' error
      path.
      
      However, tcp_get_cookie_sock() may set ->rsk_refcnt before freeing the
      socket; tcp_conn_request() may also have non-null ->rsk_refcnt because
      of tcp_try_fastopen(). In both cases 'req' hasn't been exposed
      to the outside world and is safe to free immediately, but that'd
      trigger the WARN_ON_ONCE in reqsk_free().
      
      Define __reqsk_free() for these situations where we know nobody's
      referencing the socket, even though ->rsk_refcnt might be non-null.
      Now we can consolidate the error path of tcp_get_cookie_sock() and
      tcp_conn_request().
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9403cf23
  8. 12 3月, 2019 1 次提交
    • C
      tcp: Don't access TCP_SKB_CB before initializing it · f2feaefd
      Christoph Paasch 提交于
      Since commit eeea10b8 ("tcp: add
      tcp_v4_fill_cb()/tcp_v4_restore_cb()"), tcp_vX_fill_cb is only called
      after tcp_filter(). That means, TCP_SKB_CB(skb)->end_seq still points to
      the IP-part of the cb.
      
      We thus should not mock with it, as this can trigger bugs (thanks
      syzkaller):
      [   12.349396] ==================================================================
      [   12.350188] BUG: KASAN: slab-out-of-bounds in ip6_datagram_recv_specific_ctl+0x19b3/0x1a20
      [   12.351035] Read of size 1 at addr ffff88006adbc208 by task test_ip6_datagr/1799
      
      Setting end_seq is actually no more necessary in tcp_filter as it gets
      initialized later on in tcp_vX_fill_cb.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Fixes: eeea10b8 ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()")
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2feaefd
  9. 09 3月, 2019 3 次提交
    • G
      tcp: handle inet_csk_reqsk_queue_add() failures · 9d3e1368
      Guillaume Nault 提交于
      Commit 7716682c ("tcp/dccp: fix another race at listener
      dismantle") let inet_csk_reqsk_queue_add() fail, and adjusted
      {tcp,dccp}_check_req() accordingly. However, TFO and syncookies
      weren't modified, thus leaking allocated resources on error.
      
      Contrary to tcp_check_req(), in both syncookies and TFO cases,
      we need to drop the request socket. Also, since the child socket is
      created with inet_csk_clone_lock(), we have to unlock it and drop an
      extra reference (->sk_refcount is initially set to 2 and
      inet_csk_reqsk_queue_add() drops only one ref).
      
      For TFO, we also need to revert the work done by tcp_try_fastopen()
      (with reqsk_fastopen_remove()).
      
      Fixes: 7716682c ("tcp/dccp: fix another race at listener dismantle")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d3e1368
    • E
      fou, fou6: avoid uninit-value in gue_err() and gue6_err() · 5355ed63
      Eric Dumazet 提交于
      My prior commit missed the fact that these functions
      were using udp_hdr() (aka skb_transport_header())
      to get access to GUE header.
      
      Since pskb_transport_may_pull() does not exist yet, we have to add
      transport_offset to our pskb_may_pull() calls.
      
      BUG: KMSAN: uninit-value in gue_err+0x514/0xfa0 net/ipv4/fou.c:1032
      CPU: 1 PID: 10648 Comm: syz-executor.1 Not tainted 5.0.0+ #11
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x173/0x1d0 lib/dump_stack.c:113
       kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:600
       __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:313
       gue_err+0x514/0xfa0 net/ipv4/fou.c:1032
       __udp4_lib_err_encap_no_sk net/ipv4/udp.c:571 [inline]
       __udp4_lib_err_encap net/ipv4/udp.c:626 [inline]
       __udp4_lib_err+0x12e6/0x1d40 net/ipv4/udp.c:665
       udp_err+0x74/0x90 net/ipv4/udp.c:737
       icmp_socket_deliver net/ipv4/icmp.c:767 [inline]
       icmp_unreach+0xb65/0x1070 net/ipv4/icmp.c:884
       icmp_rcv+0x11a1/0x1950 net/ipv4/icmp.c:1066
       ip_protocol_deliver_rcu+0x584/0xbb0 net/ipv4/ip_input.c:208
       ip_local_deliver_finish net/ipv4/ip_input.c:234 [inline]
       NF_HOOK include/linux/netfilter.h:289 [inline]
       ip_local_deliver+0x624/0x7b0 net/ipv4/ip_input.c:255
       dst_input include/net/dst.h:450 [inline]
       ip_rcv_finish net/ipv4/ip_input.c:414 [inline]
       NF_HOOK include/linux/netfilter.h:289 [inline]
       ip_rcv+0x6bd/0x740 net/ipv4/ip_input.c:524
       __netif_receive_skb_one_core net/core/dev.c:4973 [inline]
       __netif_receive_skb net/core/dev.c:5083 [inline]
       process_backlog+0x756/0x10e0 net/core/dev.c:5923
       napi_poll net/core/dev.c:6346 [inline]
       net_rx_action+0x78b/0x1a60 net/core/dev.c:6412
       __do_softirq+0x53f/0x93a kernel/softirq.c:293
       invoke_softirq kernel/softirq.c:375 [inline]
       irq_exit+0x214/0x250 kernel/softirq.c:416
       exiting_irq+0xe/0x10 arch/x86/include/asm/apic.h:536
       smp_apic_timer_interrupt+0x48/0x70 arch/x86/kernel/apic/apic.c:1064
       apic_timer_interrupt+0x2e/0x40 arch/x86/entry/entry_64.S:814
       </IRQ>
      RIP: 0010:finish_lock_switch+0x2b/0x40 kernel/sched/core.c:2597
      Code: 48 89 e5 53 48 89 fb e8 63 e7 95 00 8b b8 88 0c 00 00 48 8b 00 48 85 c0 75 12 48 89 df e8 dd db 95 00 c6 00 00 c6 03 00 fb 5b <5d> c3 e8 4e e6 95 00 eb e7 66 90 66 2e 0f 1f 84 00 00 00 00 00 55
      RSP: 0018:ffff888081a0fc80 EFLAGS: 00000296 ORIG_RAX: ffffffffffffff13
      RAX: ffff88821fd6bd80 RBX: ffff888027898000 RCX: ccccccccccccd000
      RDX: ffff88821fca8d80 RSI: ffff888000000000 RDI: 00000000000004a0
      RBP: ffff888081a0fc80 R08: 0000000000000002 R09: ffff888081a0fb08
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
      R13: ffff88811130e388 R14: ffff88811130da00 R15: ffff88812fdb7d80
       finish_task_switch+0xfc/0x2d0 kernel/sched/core.c:2698
       context_switch kernel/sched/core.c:2851 [inline]
       __schedule+0x6cc/0x800 kernel/sched/core.c:3491
       schedule+0x15b/0x240 kernel/sched/core.c:3535
       freezable_schedule include/linux/freezer.h:172 [inline]
       do_nanosleep+0x2ba/0x980 kernel/time/hrtimer.c:1679
       hrtimer_nanosleep kernel/time/hrtimer.c:1733 [inline]
       __do_sys_nanosleep kernel/time/hrtimer.c:1767 [inline]
       __se_sys_nanosleep+0x746/0x960 kernel/time/hrtimer.c:1754
       __x64_sys_nanosleep+0x3e/0x60 kernel/time/hrtimer.c:1754
       do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
       entry_SYSCALL_64_after_hwframe+0x63/0xe7
      RIP: 0033:0x4855a0
      Code: 00 00 48 c7 c0 d4 ff ff ff 64 c7 00 16 00 00 00 31 c0 eb be 66 0f 1f 44 00 00 83 3d b1 11 5d 00 00 75 14 b8 23 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 04 e2 f8 ff c3 48 83 ec 08 e8 3a 55 fd ff
      RSP: 002b:0000000000a4fd58 EFLAGS: 00000246 ORIG_RAX: 0000000000000023
      RAX: ffffffffffffffda RBX: 0000000000085780 RCX: 00000000004855a0
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000a4fd60
      RBP: 00000000000007ec R08: 0000000000000001 R09: 0000000000ceb940
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000008
      R13: 0000000000a4fdb0 R14: 0000000000085711 R15: 0000000000a4fdc0
      
      Uninit was created at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:205 [inline]
       kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:159
       kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176
       kmsan_slab_alloc+0xe/0x10 mm/kmsan/kmsan_hooks.c:185
       slab_post_alloc_hook mm/slab.h:445 [inline]
       slab_alloc_node mm/slub.c:2773 [inline]
       __kmalloc_node_track_caller+0xe9e/0xff0 mm/slub.c:4398
       __kmalloc_reserve net/core/skbuff.c:140 [inline]
       __alloc_skb+0x309/0xa20 net/core/skbuff.c:208
       alloc_skb include/linux/skbuff.h:1012 [inline]
       alloc_skb_with_frags+0x186/0xa60 net/core/skbuff.c:5287
       sock_alloc_send_pskb+0xafd/0x10a0 net/core/sock.c:2091
       sock_alloc_send_skb+0xca/0xe0 net/core/sock.c:2108
       __ip_append_data+0x34cd/0x5000 net/ipv4/ip_output.c:998
       ip_append_data+0x324/0x480 net/ipv4/ip_output.c:1220
       icmp_push_reply+0x23d/0x7e0 net/ipv4/icmp.c:375
       __icmp_send+0x2ea3/0x30f0 net/ipv4/icmp.c:737
       icmp_send include/net/icmp.h:47 [inline]
       ipv4_link_failure+0x6d/0x230 net/ipv4/route.c:1190
       dst_link_failure include/net/dst.h:427 [inline]
       arp_error_report+0x106/0x1a0 net/ipv4/arp.c:297
       neigh_invalidate+0x359/0x8e0 net/core/neighbour.c:992
       neigh_timer_handler+0xdf2/0x1280 net/core/neighbour.c:1078
       call_timer_fn+0x285/0x600 kernel/time/timer.c:1325
       expire_timers kernel/time/timer.c:1362 [inline]
       __run_timers+0xdb4/0x11d0 kernel/time/timer.c:1681
       run_timer_softirq+0x2e/0x50 kernel/time/timer.c:1694
       __do_softirq+0x53f/0x93a kernel/softirq.c:293
      
      Fixes: 26fc181e ("fou, fou6: do not assume linear skbs")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Stefano Brivio <sbrivio@redhat.com>
      Cc: Sabrina Dubroca <sd@queasysnail.net>
      Acked-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5355ed63
    • X
      route: set the deleted fnhe fnhe_daddr to 0 in ip_del_fnhe to fix a race · ee60ad21
      Xin Long 提交于
      The race occurs in __mkroute_output() when 2 threads lookup a dst:
      
        CPU A                 CPU B
        find_exception()
                              find_exception() [fnhe expires]
                              ip_del_fnhe() [fnhe is deleted]
        rt_bind_exception()
      
      In rt_bind_exception() it will bind a deleted fnhe with the new dst, and
      this dst will get no chance to be freed. It causes a dev defcnt leak and
      consecutive dmesg warnings:
      
        unregister_netdevice: waiting for ethX to become free. Usage count = 1
      
      Especially thanks Jon to identify the issue.
      
      This patch fixes it by setting fnhe_daddr to 0 in ip_del_fnhe() to stop
      binding the deleted fnhe with a new dst when checking fnhe's fnhe_daddr
      and daddr in rt_bind_exception().
      
      It works as both ip_del_fnhe() and rt_bind_exception() are protected by
      fnhe_lock and the fhne is freed by kfree_rcu().
      
      Fixes: deed49df ("route: check and remove route cache when we get route")
      Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee60ad21
  10. 07 3月, 2019 4 次提交
    • S
      tcp: do not report TCP_CM_INQ of 0 for closed connections · 6466e715
      Soheil Hassas Yeganeh 提交于
      Returning 0 as inq to userspace indicates there is no more data to
      read, and the application needs to wait for EPOLLIN. For a connection
      that has received FIN from the remote peer, however, the application
      must continue reading until getting EOF (return value of 0
      from tcp_recvmsg) or an error, if edge-triggered epoll (EPOLLET) is
      being used. Otherwise, the application will never receive a new
      EPOLLIN, since there is no epoll edge after the FIN.
      
      Return 1 when there is no data left on the queue but the
      connection has received FIN, so that the applications continue
      reading.
      
      Fixes: b75eba76 (tcp: send in-queue bytes in cmsg upon read)
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6466e715
    • V
      tcp: detecting the misuse of .sendpage for Slab objects · a10674bf
      Vasily Averin 提交于
      sendpage was not designed for processing of the Slab pages,
      in some situations it can trigger BUG_ON on receiving side.
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a10674bf
    • A
      iptunnel: NULL pointer deref for ip_md_tunnel_xmit · f4b3ec4e
      Alan Maguire 提交于
      Naresh Kamboju noted the following oops during execution of selftest
      tools/testing/selftests/bpf/test_tunnel.sh on x86_64:
      
      [  274.120445] BUG: unable to handle kernel NULL pointer dereference
      at 0000000000000000
      [  274.128285] #PF error: [INSTR]
      [  274.131351] PGD 8000000414a0e067 P4D 8000000414a0e067 PUD 3b6334067 PMD 0
      [  274.138241] Oops: 0010 [#1] SMP PTI
      [  274.141734] CPU: 1 PID: 11464 Comm: ping Not tainted
      5.0.0-rc4-next-20190129 #1
      [  274.149046] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
      2.0b 07/27/2017
      [  274.156526] RIP: 0010:          (null)
      [  274.160280] Code: Bad RIP value.
      [  274.163509] RSP: 0018:ffffbc9681f83540 EFLAGS: 00010286
      [  274.168726] RAX: 0000000000000000 RBX: ffffdc967fa80a18 RCX: 0000000000000000
      [  274.175851] RDX: ffff9db2ee08b540 RSI: 000000000000000e RDI: ffffdc967fa809a0
      [  274.182974] RBP: ffffbc9681f83580 R08: ffff9db2c4d62690 R09: 000000000000000c
      [  274.190098] R10: 0000000000000000 R11: ffff9db2ee08b540 R12: ffff9db31ce7c000
      [  274.197222] R13: 0000000000000001 R14: 000000000000000c R15: ffff9db3179cf400
      [  274.204346] FS:  00007ff4ae7c5740(0000) GS:ffff9db31fa80000(0000)
      knlGS:0000000000000000
      [  274.212424] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  274.218162] CR2: ffffffffffffffd6 CR3: 00000004574da004 CR4: 00000000003606e0
      [  274.225292] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  274.232416] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  274.239541] Call Trace:
      [  274.241988]  ? tnl_update_pmtu+0x296/0x3b0
      [  274.246085]  ip_md_tunnel_xmit+0x1bc/0x520
      [  274.250176]  gre_fb_xmit+0x330/0x390
      [  274.253754]  gre_tap_xmit+0x128/0x180
      [  274.257414]  dev_hard_start_xmit+0xb7/0x300
      [  274.261598]  sch_direct_xmit+0xf6/0x290
      [  274.265430]  __qdisc_run+0x15d/0x5e0
      [  274.269007]  __dev_queue_xmit+0x2c5/0xc00
      [  274.273011]  ? dev_queue_xmit+0x10/0x20
      [  274.276842]  ? eth_header+0x2b/0xc0
      [  274.280326]  dev_queue_xmit+0x10/0x20
      [  274.283984]  ? dev_queue_xmit+0x10/0x20
      [  274.287813]  arp_xmit+0x1a/0xf0
      [  274.290952]  arp_send_dst.part.19+0x46/0x60
      [  274.295138]  arp_solicit+0x177/0x6b0
      [  274.298708]  ? mod_timer+0x18e/0x440
      [  274.302281]  neigh_probe+0x57/0x70
      [  274.305684]  __neigh_event_send+0x197/0x2d0
      [  274.309862]  neigh_resolve_output+0x18c/0x210
      [  274.314212]  ip_finish_output2+0x257/0x690
      [  274.318304]  ip_finish_output+0x219/0x340
      [  274.322314]  ? ip_finish_output+0x219/0x340
      [  274.326493]  ip_output+0x76/0x240
      [  274.329805]  ? ip_fragment.constprop.53+0x80/0x80
      [  274.334510]  ip_local_out+0x3f/0x70
      [  274.337992]  ip_send_skb+0x19/0x40
      [  274.341391]  ip_push_pending_frames+0x33/0x40
      [  274.345740]  raw_sendmsg+0xc15/0x11d0
      [  274.349403]  ? __might_fault+0x85/0x90
      [  274.353151]  ? _copy_from_user+0x6b/0xa0
      [  274.357070]  ? rw_copy_check_uvector+0x54/0x130
      [  274.361604]  inet_sendmsg+0x42/0x1c0
      [  274.365179]  ? inet_sendmsg+0x42/0x1c0
      [  274.368937]  sock_sendmsg+0x3e/0x50
      [  274.372460]  ___sys_sendmsg+0x26f/0x2d0
      [  274.376293]  ? lock_acquire+0x95/0x190
      [  274.380043]  ? __handle_mm_fault+0x7ce/0xb70
      [  274.384307]  ? lock_acquire+0x95/0x190
      [  274.388053]  ? __audit_syscall_entry+0xdd/0x130
      [  274.392586]  ? ktime_get_coarse_real_ts64+0x64/0xc0
      [  274.397461]  ? __audit_syscall_entry+0xdd/0x130
      [  274.401989]  ? trace_hardirqs_on+0x4c/0x100
      [  274.406173]  __sys_sendmsg+0x63/0xa0
      [  274.409744]  ? __sys_sendmsg+0x63/0xa0
      [  274.413488]  __x64_sys_sendmsg+0x1f/0x30
      [  274.417405]  do_syscall_64+0x55/0x190
      [  274.421064]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  274.426113] RIP: 0033:0x7ff4ae0e6e87
      [  274.429686] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00
      00 00 00 8b 05 ca d9 2b 00 48 63 d2 48 63 ff 85 c0 75 10 b8 2e 00 00
      00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 53 48 89 f3 48 83 ec 10 48 89 7c
      24 08
      [  274.448422] RSP: 002b:00007ffcd9b76db8 EFLAGS: 00000246 ORIG_RAX:
      000000000000002e
      [  274.455978] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007ff4ae0e6e87
      [  274.463104] RDX: 0000000000000000 RSI: 00000000006092e0 RDI: 0000000000000003
      [  274.470228] RBP: 0000000000000000 R08: 00007ffcd9bc40a0 R09: 00007ffcd9bc4080
      [  274.477349] R10: 000000000000060a R11: 0000000000000246 R12: 0000000000000003
      [  274.484475] R13: 0000000000000016 R14: 00007ffcd9b77fa0 R15: 00007ffcd9b78da4
      [  274.491602] Modules linked in: cls_bpf sch_ingress iptable_filter
      ip_tables algif_hash af_alg x86_pkg_temp_thermal fuse [last unloaded:
      test_bpf]
      [  274.504634] CR2: 0000000000000000
      [  274.507976] ---[ end trace 196d18386545eae1 ]---
      [  274.512588] RIP: 0010:          (null)
      [  274.516334] Code: Bad RIP value.
      [  274.519557] RSP: 0018:ffffbc9681f83540 EFLAGS: 00010286
      [  274.524775] RAX: 0000000000000000 RBX: ffffdc967fa80a18 RCX: 0000000000000000
      [  274.531921] RDX: ffff9db2ee08b540 RSI: 000000000000000e RDI: ffffdc967fa809a0
      [  274.539082] RBP: ffffbc9681f83580 R08: ffff9db2c4d62690 R09: 000000000000000c
      [  274.546205] R10: 0000000000000000 R11: ffff9db2ee08b540 R12: ffff9db31ce7c000
      [  274.553329] R13: 0000000000000001 R14: 000000000000000c R15: ffff9db3179cf400
      [  274.560456] FS:  00007ff4ae7c5740(0000) GS:ffff9db31fa80000(0000)
      knlGS:0000000000000000
      [  274.568541] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  274.574277] CR2: ffffffffffffffd6 CR3: 00000004574da004 CR4: 00000000003606e0
      [  274.581403] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  274.588535] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  274.595658] Kernel panic - not syncing: Fatal exception in interrupt
      [  274.602046] Kernel Offset: 0x14400000 from 0xffffffff81000000
      (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      [  274.612827] ---[ end Kernel panic - not syncing: Fatal exception in
      interrupt ]---
      [  274.620387] ------------[ cut here ]------------
      
      I'm also seeing the same failure on x86_64, and it reproduces
      consistently.
      
      >From poking around it looks like the skb's dst entry is being used
      to calculate the mtu in:
      
      mtu = skb_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu;
      
      ...but because that dst_entry  has an "ops" value set to md_dst_ops,
      the various ops (including mtu) are not set:
      
      crash> struct sk_buff._skb_refdst ffff928f87447700 -x
            _skb_refdst = 0xffffcd6fbf5ea590
      crash> struct dst_entry.ops 0xffffcd6fbf5ea590
        ops = 0xffffffffa0193800
      crash> struct dst_ops.mtu 0xffffffffa0193800
        mtu = 0x0
      crash>
      
      I confirmed that the dst entry also has dst->input set to
      dst_md_discard, so it looks like it's an entry that's been
      initialized via __metadata_dst_init alright.
      
      I think the fix here is to use skb_valid_dst(skb) - it checks
      for  DST_METADATA also, and with that fix in place, the
      problem - which was previously 100% reproducible - disappears.
      
      The below patch resolves the panic and all bpf tunnel tests pass
      without incident.
      
      Fixes: c8b34e68 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
      Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: NAlan Maguire <alan.maguire@oracle.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Tested-by: NAnders Roxell <anders.roxell@linaro.org>
      Reported-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Tested-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4b3ec4e
    • P
      ipv4/route: fail early when inet dev is missing · 22c74764
      Paolo Abeni 提交于
      If a non local multicast packet reaches ip_route_input_rcu() while
      the ingress device IPv4 private data (in_dev) is NULL, we end up
      doing a NULL pointer dereference in IN_DEV_MFORWARD().
      
      Since the later call to ip_route_input_mc() is going to fail if
      !in_dev, we can fail early in such scenario and avoid the dangerous
      code path.
      
      v1 -> v2:
       - clarified the commit message, no code changes
      Reported-by: NTianhao Zhao <tizhao@redhat.com>
      Fixes: e58e4159 ("net: Enable support for VRF with ipv4 multicast")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22c74764
  11. 05 3月, 2019 1 次提交
  12. 02 3月, 2019 2 次提交
  13. 01 3月, 2019 3 次提交
  14. 28 2月, 2019 3 次提交
  15. 27 2月, 2019 3 次提交
    • F
      netfilter: nat: remove nf_nat_l3proto.h and nf_nat_core.h · d2c5c103
      Florian Westphal 提交于
      The l3proto name is gone, its header file is the last trace.
      While at it, also remove nf_nat_core.h, its very small and all users
      include nf_nat.h too.
      
      before:
         text    data     bss     dec     hex filename
        22948    1612    4136   28696    7018 nf_nat.ko
      
      after removal of l3proto register/unregister functions:
         text	   data	    bss	    dec	    hex	filename
        22196	   1516	   4136	  27848	   6cc8 nf_nat.ko
      
      checkpatch complains about overly long lines, but line breaks
      do not make things more readable and the line length gets smaller
      here, not larger.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d2c5c103
    • F
      netfilter: nat: merge nf_nat_ipv4,6 into nat core · 3bf195ae
      Florian Westphal 提交于
      before:
         text    data     bss     dec     hex filename
        16566    1576    4136   22278    5706 nf_nat.ko
         3598	    844	      0	   4442	   115a	nf_nat_ipv6.ko
         3187	    844	      0	   4031	    fbf	nf_nat_ipv4.ko
      
      after:
         text    data     bss     dec     hex filename
        22948    1612    4136   28696    7018 nf_nat.ko
      
      ... with ipv4/v6 nat now provided directly via nf_nat.ko.
      
      Also changes:
             ret = nf_nat_ipv4_fn(priv, skb, state);
             if (ret != NF_DROP && ret != NF_STOLEN &&
      into
      	if (ret != NF_ACCEPT)
      		return ret;
      
      everywhere.
      
      The nat hooks never should return anything other than
      ACCEPT or DROP (and the latter only in rare error cases).
      
      The original code uses multi-line ANDing including assignment-in-if:
              if (ret != NF_DROP && ret != NF_STOLEN &&
                 !(IPCB(skb)->flags & IPSKB_XFRM_TRANSFORMED) &&
                  (ct = nf_ct_get(skb, &ctinfo)) != NULL) {
      
      I removed this while moving, breaking those in separate conditionals
      and moving the assignments into extra lines.
      
      checkpatch still generates some warnings:
       1. Overly long lines (of moved code).
          Breaking them is even more ugly. so I kept this as-is.
       2. use of extern function declarations in a .c file.
          This is necessary evil, we must call
          nf_nat_l3proto_register() from the nat core now.
          All l3proto related functions are removed later in this series,
          those prototypes are then removed as well.
      
      v2: keep empty nf_nat_ipv6_csum_update stub for CONFIG_IPV6=n case.
      v3: remove IS_ENABLED(NF_NAT_IPV4/6) tests, NF_NAT_IPVx toggles
          are removed here.
      v4: also get rid of the assignments in conditionals.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3bf195ae
    • F
      netfilter: nat: move nlattr parse and xfrm session decode to core · 096d0906
      Florian Westphal 提交于
      None of these functions calls any external functions, moving them allows
      to avoid both the indirection and a need to export these symbols.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      096d0906