1. 17 1月, 2020 1 次提交
    • X
      alinux: hotfix: Add Cloud Kernel hotfix enhancement · f94e5b1a
      Xunlei Pang 提交于
      We reserve some fields beforehand for core structures prone to change,
      so that we won't hurt when extra fields have to be added for hotfix,
      thereby inceasing the success rate, we even can hot add features with
      this enhancement.
      
      After reserving, normally cache does not matter as the reserved fields
      (usually at tail) are not accessed at all.
      
      Currently involve the following structures:
          MM:
          struct zone
          struct pglist_data
          struct mm_struct
          struct vm_area_struct
          struct mem_cgroup
          struct writeback_control
      
          Block:
          struct gendisk
          struct backing_dev_info
          struct bio
          struct queue_limits
          struct request_queue
          struct blkcg
          struct blkcg_policy
          struct blk_mq_hw_ctx
          struct blk_mq_tag_set
          struct blk_mq_queue_data
          struct blk_mq_ops
          struct elevator_mq_ops
          struct inode
          struct dentry
          struct address_space
          struct block_device
          struct hd_struct
          struct bio_set
      
          Network:
          struct sk_buff
          struct sock
          struct net_device_ops
          struct xt_target
          struct dst_entry
          struct dst_ops
          struct fib_rule
      
          Scheduler:
          struct task_struct
          struct cfs_rq
          struct rq
          struct sched_statistics
          struct sched_entity
          struct signal_struct
          struct task_group
          struct cpuacct
      
          cgroup:
          struct cgroup_root
          struct cgroup_subsys_state
          struct cgroup_subsys
          struct css_set
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      [ caspar: use SPDX-License-Identifier ]
      Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
      f94e5b1a
  2. 27 12月, 2019 2 次提交
    • G
      alinux: net/tcp: Support tunable tcp timeout value in TIME-WAIT state · 0e7b6b7f
      George Zhang 提交于
      By default the tcp_tw_timeout value is 60 seconds. The minimum is
      1 second and the maximum is 600. This setting is useful on system under
      heavy tcp load.
      
      NOTE: set the tcp_tw_timeout below 60 seconds voilates the "quiet time"
      restriction, and make your system into the risk of causing some old data
      to be accepted as new or new data rejected as old duplicated by some
      receivers.
      
      Link: http://web.archive.org/web/20150102003320/http://tools.ietf.org/html/rfc793Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      0e7b6b7f
    • G
      alinux: net: kernel hookers service for toa module · e209e3cb
      George Zhang 提交于
      LVS fullnat will replace network traffic's source ip with its local ip,
      and thus the backend servers cannot obtain the real client ip.
      
      To solve this, LVS has introduced the tcp option address (TOA) to store
      the essential ip address information in the last tcp ack packet of the
      3-way handshake, and the backend servers need to retrieve it from the
      packet header.
      
      In this patch, we have introduced the sk_toa_data member in the sock
      structure to hold the TOA information. There used to be an in-tree
      module for TOA managing, whereas it has now been maintained as an
      standalone module.
      
      In this case, the toa module should register its hook function(s) using
      the provided interfaces in the hookers module.
      
      TOA in sock structure:
      
      	__be32 sk_toa_data[16];
      
      The hookers module only provides the sk_toa_data placeholder, and the
      toa module can use this variable through the layout it needs.
      
      Hook interfaces:
      
      The hookers module replaces the kernel's syn_recv_sock and getname
      handler with a stub that chains the toa module's hook function(s) to the
      original handling function. The hookers module allows hook functions to
      be installed and uninstalled in any order.
      
      toa module:
      
      The external toa module will be provided in separate RPM package.
      
      [xuyu@linux.alibaba.com: amend commit log]
      Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      e209e3cb
  3. 21 12月, 2019 4 次提交
    • G
      tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE() · fbcf85b0
      Guillaume Nault 提交于
      [ Upstream commit 721c8dafad26ccfa90ff659ee19755e3377b829d ]
      
      Syncookies borrow the ->rx_opt.ts_recent_stamp field to store the
      timestamp of the last synflood. Protect them with READ_ONCE() and
      WRITE_ONCE() since reads and writes aren't serialised.
      
      Use of .rx_opt.ts_recent_stamp for storing the synflood timestamp was
      introduced by a0f82f64 ("syncookies: remove last_synq_overflow from
      struct tcp_sock"). But unprotected accesses were already there when
      timestamp was stored in .last_synq_overflow.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fbcf85b0
    • G
      tcp: tighten acceptance of ACKs not matching a child socket · 4b8a9869
      Guillaume Nault 提交于
      [ Upstream commit cb44a08f8647fd2e8db5cc9ac27cd8355fa392d8 ]
      
      When no synflood occurs, the synflood timestamp isn't updated.
      Therefore it can be so old that time_after32() can consider it to be
      in the future.
      
      That's a problem for tcp_synq_no_recent_overflow() as it may report
      that a recent overflow occurred while, in fact, it's just that jiffies
      has grown past 'last_overflow' + TCP_SYNCOOKIE_VALID + 2^31.
      
      Spurious detection of recent overflows lead to extra syncookie
      verification in cookie_v[46]_check(). At that point, the verification
      should fail and the packet dropped. But we should have dropped the
      packet earlier as we didn't even send a syncookie.
      
      Let's refine tcp_synq_no_recent_overflow() to report a recent overflow
      only if jiffies is within the
      [last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval. This
      way, no spurious recent overflow is reported when jiffies wraps and
      'last_overflow' becomes in the future from the point of view of
      time_after32().
      
      However, if jiffies wraps and enters the
      [last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval (with
      'last_overflow' being a stale synflood timestamp), then
      tcp_synq_no_recent_overflow() still erroneously reports an
      overflow. In such cases, we have to rely on syncookie verification
      to drop the packet. We unfortunately have no way to differentiate
      between a fresh and a stale syncookie timestamp.
      
      In practice, using last_overflow as lower bound is problematic.
      If the synflood timestamp is concurrently updated between the time
      we read jiffies and the moment we store the timestamp in
      'last_overflow', then 'now' becomes smaller than 'last_overflow' and
      tcp_synq_no_recent_overflow() returns true, potentially dropping a
      valid syncookie.
      
      Reading jiffies after loading the timestamp could fix the problem,
      but that'd require a memory barrier. Let's just accommodate for
      potential timestamp growth instead and extend the interval using
      'last_overflow - HZ' as lower bound.
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4b8a9869
    • G
      tcp: fix rejected syncookies due to stale timestamps · bac9e8f3
      Guillaume Nault 提交于
      [ Upstream commit 04d26e7b159a396372646a480f4caa166d1b6720 ]
      
      If no synflood happens for a long enough period of time, then the
      synflood timestamp isn't refreshed and jiffies can advance so much
      that time_after32() can't accurately compare them any more.
      
      Therefore, we can end up in a situation where time_after32(now,
      last_overflow + HZ) returns false, just because these two values are
      too far apart. In that case, the synflood timestamp isn't updated as
      it should be, which can trick tcp_synq_no_recent_overflow() into
      rejecting valid syncookies.
      
      For example, let's consider the following scenario on a system
      with HZ=1000:
      
        * The synflood timestamp is 0, either because that's the timestamp
          of the last synflood or, more commonly, because we're working with
          a freshly created socket.
      
        * We receive a new SYN, which triggers synflood protection. Let's say
          that this happens when jiffies == 2147484649 (that is,
          'synflood timestamp' + HZ + 2^31 + 1).
      
        * Then tcp_synq_overflow() doesn't update the synflood timestamp,
          because time_after32(2147484649, 1000) returns false.
          With:
            - 2147484649: the value of jiffies, aka. 'now'.
            - 1000: the value of 'last_overflow' + HZ.
      
        * A bit later, we receive the ACK completing the 3WHS. But
          cookie_v[46]_check() rejects it because tcp_synq_no_recent_overflow()
          says that we're not under synflood. That's because
          time_after32(2147484649, 120000) returns false.
          With:
            - 2147484649: the value of jiffies, aka. 'now'.
            - 120000: the value of 'last_overflow' + TCP_SYNCOOKIE_VALID.
      
          Of course, in reality jiffies would have increased a bit, but this
          condition will last for the next 119 seconds, which is far enough
          to accommodate for jiffie's growth.
      
      Fix this by updating the overflow timestamp whenever jiffies isn't
      within the [last_overflow, last_overflow + HZ] range. That shouldn't
      have any performance impact since the update still happens at most once
      per second.
      
      Now we're guaranteed to have fresh timestamps while under synflood, so
      tcp_synq_no_recent_overflow() can safely use it with time_after32() in
      such situations.
      
      Stale timestamps can still make tcp_synq_no_recent_overflow() return
      the wrong verdict when not under synflood. This will be handled in the
      next patch.
      
      For 64 bits architectures, the problem was introduced with the
      conversion of ->tw_ts_recent_stamp to 32 bits integer by commit
      cca9bab1 ("tcp: use monotonic timestamps for PAWS").
      The problem has always been there on 32 bits architectures.
      
      Fixes: cca9bab1 ("tcp: use monotonic timestamps for PAWS")
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bac9e8f3
    • E
      inet: protect against too small mtu values. · d80d67cd
      Eric Dumazet 提交于
      [ Upstream commit 501a90c945103e8627406763dac418f20f3837b2 ]
      
      syzbot was once again able to crash a host by setting a very small mtu
      on loopback device.
      
      Let's make inetdev_valid_mtu() available in include/net/ip.h,
      and use it in ip_setup_cork(), so that we protect both ip_append_page()
      and __ip_append_data()
      
      Also add a READ_ONCE() when the device mtu is read.
      
      Pairs this lockless read with one WRITE_ONCE() in __dev_set_mtu(),
      even if other code paths might write over this field.
      
      Add a big comment in include/linux/netdevice.h about dev->mtu
      needing READ_ONCE()/WRITE_ONCE() annotations.
      
      Hopefully we will add the missing ones in followup patches.
      
      [1]
      
      refcount_t: saturated; leaking memory.
      WARNING: CPU: 0 PID: 9464 at lib/refcount.c:22 refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 0 PID: 9464 Comm: syz-executor850 Not tainted 5.4.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x197/0x210 lib/dump_stack.c:118
       panic+0x2e3/0x75c kernel/panic.c:221
       __warn.cold+0x2f/0x3e kernel/panic.c:582
       report_bug+0x289/0x300 lib/bug.c:195
       fixup_bug arch/x86/kernel/traps.c:174 [inline]
       fixup_bug arch/x86/kernel/traps.c:169 [inline]
       do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267
       do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286
       invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
      RIP: 0010:refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Code: 06 31 ff 89 de e8 c8 f5 e6 fd 84 db 0f 85 6f ff ff ff e8 7b f4 e6 fd 48 c7 c7 e0 71 4f 88 c6 05 56 a6 a4 06 01 e8 c7 a8 b7 fd <0f> 0b e9 50 ff ff ff e8 5c f4 e6 fd 0f b6 1d 3d a6 a4 06 31 ff 89
      RSP: 0018:ffff88809689f550 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffffffff815e4336 RDI: ffffed1012d13e9c
      RBP: ffff88809689f560 R08: ffff88809c50a3c0 R09: fffffbfff15d31b1
      R10: fffffbfff15d31b0 R11: ffffffff8ae98d87 R12: 0000000000000001
      R13: 0000000000040100 R14: ffff888099041104 R15: ffff888218d96e40
       refcount_add include/linux/refcount.h:193 [inline]
       skb_set_owner_w+0x2b6/0x410 net/core/sock.c:1999
       sock_wmalloc+0xf1/0x120 net/core/sock.c:2096
       ip_append_page+0x7ef/0x1190 net/ipv4/ip_output.c:1383
       udp_sendpage+0x1c7/0x480 net/ipv4/udp.c:1276
       inet_sendpage+0xdb/0x150 net/ipv4/af_inet.c:821
       kernel_sendpage+0x92/0xf0 net/socket.c:3794
       sock_sendpage+0x8b/0xc0 net/socket.c:936
       pipe_to_sendpage+0x2da/0x3c0 fs/splice.c:458
       splice_from_pipe_feed fs/splice.c:512 [inline]
       __splice_from_pipe+0x3ee/0x7c0 fs/splice.c:636
       splice_from_pipe+0x108/0x170 fs/splice.c:671
       generic_splice_sendpage+0x3c/0x50 fs/splice.c:842
       do_splice_from fs/splice.c:861 [inline]
       direct_splice_actor+0x123/0x190 fs/splice.c:1035
       splice_direct_to_actor+0x3b4/0xa30 fs/splice.c:990
       do_splice_direct+0x1da/0x2a0 fs/splice.c:1078
       do_sendfile+0x597/0xd00 fs/read_write.c:1464
       __do_sys_sendfile64 fs/read_write.c:1525 [inline]
       __se_sys_sendfile64 fs/read_write.c:1511 [inline]
       __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x441409
      Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fffb64c4f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441409
      RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000005
      RBP: 0000000000073b8a R08: 0000000000000010 R09: 0000000000000010
      R10: 0000000000010001 R11: 0000000000000246 R12: 0000000000402180
      R13: 0000000000402210 R14: 0000000000000000 R15: 0000000000000000
      Kernel Offset: disabled
      Rebooting in 86400 seconds..
      
      Fixes: 1470ddf7 ("inet: Remove explicit write references to sk/inet in ip_append_data")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d80d67cd
  4. 13 12月, 2019 3 次提交
  5. 05 12月, 2019 3 次提交
    • X
      sctp: cache netns in sctp_ep_common · 1bef71f5
      Xin Long 提交于
      [ Upstream commit 312434617cb16be5166316cf9d08ba760b1042a1 ]
      
      This patch is to fix a data-race reported by syzbot:
      
        BUG: KCSAN: data-race in sctp_assoc_migrate / sctp_hash_obj
      
        write to 0xffff8880b67c0020 of 8 bytes by task 18908 on cpu 1:
          sctp_assoc_migrate+0x1a6/0x290 net/sctp/associola.c:1091
          sctp_sock_migrate+0x8aa/0x9b0 net/sctp/socket.c:9465
          sctp_accept+0x3c8/0x470 net/sctp/socket.c:4916
          inet_accept+0x7f/0x360 net/ipv4/af_inet.c:734
          __sys_accept4+0x224/0x430 net/socket.c:1754
          __do_sys_accept net/socket.c:1795 [inline]
          __se_sys_accept net/socket.c:1792 [inline]
          __x64_sys_accept+0x4e/0x60 net/socket.c:1792
          do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        read to 0xffff8880b67c0020 of 8 bytes by task 12003 on cpu 0:
          sctp_hash_obj+0x4f/0x2d0 net/sctp/input.c:894
          rht_key_get_hash include/linux/rhashtable.h:133 [inline]
          rht_key_hashfn include/linux/rhashtable.h:159 [inline]
          rht_head_hashfn include/linux/rhashtable.h:174 [inline]
          head_hashfn lib/rhashtable.c:41 [inline]
          rhashtable_rehash_one lib/rhashtable.c:245 [inline]
          rhashtable_rehash_chain lib/rhashtable.c:276 [inline]
          rhashtable_rehash_table lib/rhashtable.c:316 [inline]
          rht_deferred_worker+0x468/0xab0 lib/rhashtable.c:420
          process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
          worker_thread+0xa0/0x800 kernel/workqueue.c:2415
          kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
          ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
      
      It was caused by rhashtable access asoc->base.sk when sctp_assoc_migrate
      is changing its value. However, what rhashtable wants is netns from asoc
      base.sk, and for an asoc, its netns won't change once set. So we can
      simply fix it by caching netns since created.
      
      Fixes: d6c0256a ("sctp: add the rhashtable apis for sctp global transport hashtable")
      Reported-by: syzbot+e3b35fe7918ff0ee474e@syzkaller.appspotmail.com
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1bef71f5
    • E
      net: fix possible overflow in __sk_mem_raise_allocated() · d1dde413
      Eric Dumazet 提交于
      [ Upstream commit 5bf325a53202b8728cf7013b72688c46071e212e ]
      
      With many active TCP sockets, fat TCP sockets could fool
      __sk_mem_raise_allocated() thanks to an overflow.
      
      They would increase their share of the memory, instead
      of decreasing it.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d1dde413
    • T
      net/fq_impl: Switch to kvmalloc() for memory allocation · b707e0da
      Toke Høiland-Jørgensen 提交于
      [ Upstream commit 71e67c3bd127cfe7863f54e4b087eba1cc8f9a7a ]
      
      The FQ implementation used by mac80211 allocates memory using kmalloc(),
      which can fail; and Johannes reported that this actually happens in
      practice.
      
      To avoid this, switch the allocation to kvmalloc() instead; this also
      brings fq_impl in line with all the FQ qdiscs.
      
      Fixes: 557fc4a0 ("fq: add fair queuing framework")
      Reported-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/r/20191105155750.547379-1-toke@redhat.comSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b707e0da
  6. 21 11月, 2019 1 次提交
  7. 13 11月, 2019 5 次提交
    • E
      net: prevent load/store tearing on sk->sk_stamp · 99ea48af
      Eric Dumazet 提交于
      [ Upstream commit f75359f3ac855940c5718af10ba089b8977bf339 ]
      
      Add a couple of READ_ONCE() and WRITE_ONCE() to prevent
      load-tearing and store-tearing in sock_read_timestamp()
      and sock_write_timestamp()
      
      This might prevent another KCSAN report.
      
      Fixes: 3a0ed3e96197 ("sock: Make sock->sk_stamp thread-safe")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Deepa Dinamani <deepa.kernel@gmail.com>
      Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      99ea48af
    • E
      ipvs: move old_secure_tcp into struct netns_ipvs · 50e31318
      Eric Dumazet 提交于
      [ Upstream commit c24b75e0f9239e78105f81c5f03a751641eb07ef ]
      
      syzbot reported the following issue :
      
      BUG: KCSAN: data-race in update_defense_level / update_defense_level
      
      read to 0xffffffff861a6260 of 4 bytes by task 3006 on cpu 1:
       update_defense_level+0x621/0xb30 net/netfilter/ipvs/ip_vs_ctl.c:177
       defense_work_handler+0x3d/0xd0 net/netfilter/ipvs/ip_vs_ctl.c:225
       process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
       worker_thread+0xa0/0x800 kernel/workqueue.c:2415
       kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
      
      write to 0xffffffff861a6260 of 4 bytes by task 7333 on cpu 0:
       update_defense_level+0xa62/0xb30 net/netfilter/ipvs/ip_vs_ctl.c:205
       defense_work_handler+0x3d/0xd0 net/netfilter/ipvs/ip_vs_ctl.c:225
       process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
       worker_thread+0xa0/0x800 kernel/workqueue.c:2415
       kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 7333 Comm: kworker/0:5 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: events defense_work_handler
      
      Indeed, old_secure_tcp is currently a static variable, while it
      needs to be a per netns variable.
      
      Fixes: a0840e2e ("IPVS: netns, ip_vs_ctl local vars moved to ipvs struct.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      50e31318
    • L
      netfilter: nf_tables: Align nft_expr private data to 64-bit · 1b0e60f6
      Lukas Wunner 提交于
      commit 250367c59e6ba0d79d702a059712d66edacd4a1a upstream.
      
      Invoking the following commands on a 32-bit architecture with strict
      alignment requirements (such as an ARMv7-based Raspberry Pi) results
      in an alignment exception:
      
       # nft add table ip test-ip4
       # nft add chain ip test-ip4 output { type filter hook output priority 0; }
       # nft add rule  ip test-ip4 output quota 1025 bytes
      
      Alignment trap: not handling instruction e1b26f9f at [<7f4473f8>]
      Unhandled fault: alignment exception (0x001) at 0xb832e824
      Internal error: : 1 [#1] PREEMPT SMP ARM
      Hardware name: BCM2835
      [<7f4473fc>] (nft_quota_do_init [nft_quota])
      [<7f447448>] (nft_quota_init [nft_quota])
      [<7f4260d0>] (nf_tables_newrule [nf_tables])
      [<7f4168dc>] (nfnetlink_rcv_batch [nfnetlink])
      [<7f416bd0>] (nfnetlink_rcv [nfnetlink])
      [<8078b334>] (netlink_unicast)
      [<8078b664>] (netlink_sendmsg)
      [<8071b47c>] (sock_sendmsg)
      [<8071bd18>] (___sys_sendmsg)
      [<8071ce3c>] (__sys_sendmsg)
      [<8071ce94>] (sys_sendmsg)
      
      The reason is that nft_quota_do_init() calls atomic64_set() on an
      atomic64_t which is only aligned to 32-bit, not 64-bit, because it
      succeeds struct nft_expr in memory which only contains a 32-bit pointer.
      Fix by aligning the nft_expr private data to 64-bit.
      
      Fixes: 96518518 ("netfilter: add nftables")
      Signed-off-by: NLukas Wunner <lukas@wunner.de>
      Cc: stable@vger.kernel.org # v3.13+
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1b0e60f6
    • E
      net: fix data-race in neigh_event_send() · b9bda52f
      Eric Dumazet 提交于
      [ Upstream commit 1b53d64435d56902fc234ff2507142d971a09687 ]
      
      KCSAN reported the following data-race [1]
      
      The fix will also prevent the compiler from optimizing out
      the condition.
      
      [1]
      
      BUG: KCSAN: data-race in neigh_resolve_output / neigh_resolve_output
      
      write to 0xffff8880a41dba78 of 8 bytes by interrupt on cpu 1:
       neigh_event_send include/net/neighbour.h:443 [inline]
       neigh_resolve_output+0x78/0x480 net/core/neighbour.c:1474
       neigh_output include/net/neighbour.h:511 [inline]
       ip_finish_output2+0x4af/0xe40 net/ipv4/ip_output.c:228
       __ip_finish_output net/ipv4/ip_output.c:308 [inline]
       __ip_finish_output+0x23a/0x490 net/ipv4/ip_output.c:290
       ip_finish_output+0x41/0x160 net/ipv4/ip_output.c:318
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip_output+0xdf/0x210 net/ipv4/ip_output.c:432
       dst_output include/net/dst.h:436 [inline]
       ip_local_out+0x74/0x90 net/ipv4/ip_output.c:125
       __ip_queue_xmit+0x3a8/0xa40 net/ipv4/ip_output.c:532
       ip_queue_xmit+0x45/0x60 include/net/ip.h:237
       __tcp_transmit_skb+0xe81/0x1d60 net/ipv4/tcp_output.c:1169
       tcp_transmit_skb net/ipv4/tcp_output.c:1185 [inline]
       __tcp_retransmit_skb+0x4bd/0x15f0 net/ipv4/tcp_output.c:2976
       tcp_retransmit_skb+0x36/0x1a0 net/ipv4/tcp_output.c:2999
       tcp_retransmit_timer+0x719/0x16d0 net/ipv4/tcp_timer.c:515
       tcp_write_timer_handler+0x42d/0x510 net/ipv4/tcp_timer.c:598
       tcp_write_timer+0xd1/0xf0 net/ipv4/tcp_timer.c:618
      
      read to 0xffff8880a41dba78 of 8 bytes by interrupt on cpu 0:
       neigh_event_send include/net/neighbour.h:442 [inline]
       neigh_resolve_output+0x57/0x480 net/core/neighbour.c:1474
       neigh_output include/net/neighbour.h:511 [inline]
       ip_finish_output2+0x4af/0xe40 net/ipv4/ip_output.c:228
       __ip_finish_output net/ipv4/ip_output.c:308 [inline]
       __ip_finish_output+0x23a/0x490 net/ipv4/ip_output.c:290
       ip_finish_output+0x41/0x160 net/ipv4/ip_output.c:318
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip_output+0xdf/0x210 net/ipv4/ip_output.c:432
       dst_output include/net/dst.h:436 [inline]
       ip_local_out+0x74/0x90 net/ipv4/ip_output.c:125
       __ip_queue_xmit+0x3a8/0xa40 net/ipv4/ip_output.c:532
       ip_queue_xmit+0x45/0x60 include/net/ip.h:237
       __tcp_transmit_skb+0xe81/0x1d60 net/ipv4/tcp_output.c:1169
       tcp_transmit_skb net/ipv4/tcp_output.c:1185 [inline]
       __tcp_retransmit_skb+0x4bd/0x15f0 net/ipv4/tcp_output.c:2976
       tcp_retransmit_skb+0x36/0x1a0 net/ipv4/tcp_output.c:2999
       tcp_retransmit_timer+0x719/0x16d0 net/ipv4/tcp_timer.c:515
       tcp_write_timer_handler+0x42d/0x510 net/ipv4/tcp_timer.c:598
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9bda52f
    • J
      bonding: fix state transition issue in link monitoring · 27b5f4bf
      Jay Vosburgh 提交于
      [ Upstream commit 1899bb325149e481de31a4f32b59ea6f24e176ea ]
      
      Since de77ecd4 ("bonding: improve link-status update in
      mii-monitoring"), the bonding driver has utilized two separate variables
      to indicate the next link state a particular slave should transition to.
      Each is used to communicate to a different portion of the link state
      change commit logic; one to the bond_miimon_commit function itself, and
      another to the state transition logic.
      
      	Unfortunately, the two variables can become unsynchronized,
      resulting in incorrect link state transitions within bonding.  This can
      cause slaves to become stuck in an incorrect link state until a
      subsequent carrier state transition.
      
      	The issue occurs when a special case in bond_slave_netdev_event
      sets slave->link directly to BOND_LINK_FAIL.  On the next pass through
      bond_miimon_inspect after the slave goes carrier up, the BOND_LINK_FAIL
      case will set the proposed next state (link_new_state) to BOND_LINK_UP,
      but the new_link to BOND_LINK_DOWN.  The setting of the final link state
      from new_link comes after that from link_new_state, and so the slave
      will end up incorrectly in _DOWN state.
      
      	Resolve this by combining the two variables into one.
      Reported-by: NAleksei Zakharov <zakharov.a.g@yandex.ru>
      Reported-by: NSha Zhang <zhangsha.zhang@huawei.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Fixes: de77ecd4 ("bonding: improve link-status update in mii-monitoring")
      Signed-off-by: NJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      27b5f4bf
  8. 10 11月, 2019 5 次提交
    • E
      net/flow_dissector: switch to siphash · 558d2bda
      Eric Dumazet 提交于
      [ Upstream commit 55667441c84fa5e0911a0aac44fb059c15ba6da2 ]
      
      UDP IPv6 packets auto flowlabels are using a 32bit secret
      (static u32 hashrnd in net/core/flow_dissector.c) and
      apply jhash() over fields known by the receivers.
      
      Attackers can easily infer the 32bit secret and use this information
      to identify a device and/or user, since this 32bit secret is only
      set at boot time.
      
      Really, using jhash() to generate cookies sent on the wire
      is a serious security concern.
      
      Trying to change the rol32(hash, 16) in ip6_make_flowlabel() would be
      a dead end. Trying to periodically change the secret (like in sch_sfq.c)
      could change paths taken in the network for long lived flows.
      
      Let's switch to siphash, as we did in commit df453700e8d8
      ("inet: switch IP ID generator to siphash")
      
      Using a cryptographically strong pseudo random function will solve this
      privacy issue and more generally remove other weak points in the stack.
      
      Packet schedulers using skb_get_hash_perturb() benefit from this change.
      
      Fixes: b5677416 ("ipv6: Enable auto flow labels by default")
      Fixes: 42240901 ("ipv6: Implement different admin modes for automatic flow labels")
      Fixes: 67800f9b ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel")
      Fixes: cb1ce2ef ("ipv6: Implement automatic flow label generation on transmit")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJonathan Berger <jonathann1@walla.com>
      Reported-by: NAmit Klein <aksecurity@gmail.com>
      Reported-by: NBenny Pinkas <benny@pinkas.net>
      Cc: Tom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      558d2bda
    • G
      netns: fix GFP flags in rtnl_net_notifyid() · 40400fdd
      Guillaume Nault 提交于
      [ Upstream commit d4e4fdf9e4a27c87edb79b1478955075be141f67 ]
      
      In rtnl_net_notifyid(), we certainly can't pass a null GFP flag to
      rtnl_notify(). A GFP_KERNEL flag would be fine in most circumstances,
      but there are a few paths calling rtnl_net_notifyid() from atomic
      context or from RCU critical sections. The later also precludes the use
      of gfp_any() as it wouldn't detect the RCU case. Also, the nlmsg_new()
      call is wrong too, as it uses GFP_KERNEL unconditionally.
      
      Therefore, we need to pass the GFP flags as parameter and propagate it
      through function calls until the proper flags can be determined.
      
      In most cases, GFP_KERNEL is fine. The exceptions are:
        * openvswitch: ovs_vport_cmd_get() and ovs_vport_cmd_dump()
          indirectly call rtnl_net_notifyid() from RCU critical section,
      
        * rtnetlink: rtmsg_ifinfo_build_skb() already receives GFP flags as
          parameter.
      
      Also, in ovs_vport_cmd_build_info(), let's change the GFP flags used
      by nlmsg_new(). The function is allowed to sleep, so better make the
      flags consistent with the ones used in the following
      ovs_vport_cmd_fill_info() call.
      
      Found by code inspection.
      
      Fixes: 9a963454 ("netns: notify netns id events")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      40400fdd
    • T
      net: fix sk_page_frag() recursion from memory reclaim · 1d5cb12a
      Tejun Heo 提交于
      [ Upstream commit 20eb4f29b60286e0d6dc01d9c260b4bd383c58fb ]
      
      sk_page_frag() optimizes skb_frag allocations by using per-task
      skb_frag cache when it knows it's the only user.  The condition is
      determined by seeing whether the socket allocation mask allows
      blocking - if the allocation may block, it obviously owns the task's
      context and ergo exclusively owns current->task_frag.
      
      Unfortunately, this misses recursion through memory reclaim path.
      Please take a look at the following backtrace.
      
       [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
           ...
           tcp_sendmsg+0x27/0x40
           sock_sendmsg+0x30/0x40
           sock_xmit.isra.24+0xa1/0x170 [nbd]
           nbd_send_cmd+0x1d2/0x690 [nbd]
           nbd_queue_rq+0x1b5/0x3b0 [nbd]
           __blk_mq_try_issue_directly+0x108/0x1b0
           blk_mq_request_issue_directly+0xbd/0xe0
           blk_mq_try_issue_list_directly+0x41/0xb0
           blk_mq_sched_insert_requests+0xa2/0xe0
           blk_mq_flush_plug_list+0x205/0x2a0
           blk_flush_plug_list+0xc3/0xf0
       [1] blk_finish_plug+0x21/0x2e
           _xfs_buf_ioapply+0x313/0x460
           __xfs_buf_submit+0x67/0x220
           xfs_buf_read_map+0x113/0x1a0
           xfs_trans_read_buf_map+0xbf/0x330
           xfs_btree_read_buf_block.constprop.42+0x95/0xd0
           xfs_btree_lookup_get_block+0x95/0x170
           xfs_btree_lookup+0xcc/0x470
           xfs_bmap_del_extent_real+0x254/0x9a0
           __xfs_bunmapi+0x45c/0xab0
           xfs_bunmapi+0x15/0x30
           xfs_itruncate_extents_flags+0xca/0x250
           xfs_free_eofblocks+0x181/0x1e0
           xfs_fs_destroy_inode+0xa8/0x1b0
           destroy_inode+0x38/0x70
           dispose_list+0x35/0x50
           prune_icache_sb+0x52/0x70
           super_cache_scan+0x120/0x1a0
           do_shrink_slab+0x120/0x290
           shrink_slab+0x216/0x2b0
           shrink_node+0x1b6/0x4a0
           do_try_to_free_pages+0xc6/0x370
           try_to_free_mem_cgroup_pages+0xe3/0x1e0
           try_charge+0x29e/0x790
           mem_cgroup_charge_skmem+0x6a/0x100
           __sk_mem_raise_allocated+0x18e/0x390
           __sk_mem_schedule+0x2a/0x40
       [0] tcp_sendmsg_locked+0x8eb/0xe10
           tcp_sendmsg+0x27/0x40
           sock_sendmsg+0x30/0x40
           ___sys_sendmsg+0x26d/0x2b0
           __sys_sendmsg+0x57/0xa0
           do_syscall_64+0x42/0x100
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      In [0], tcp_send_msg_locked() was using current->page_frag when it
      called sk_wmem_schedule().  It already calculated how many bytes can
      be fit into current->page_frag.  Due to memory pressure,
      sk_wmem_schedule() called into memory reclaim path which called into
      xfs and then IO issue path.  Because the filesystem in question is
      backed by nbd, the control goes back into the tcp layer - back into
      tcp_sendmsg_locked().
      
      nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
      sense - it's in the process of freeing memory and wants to be able to,
      e.g., drop clean pages to make forward progress.  However, this
      confused sk_page_frag() called from [2].  Because it only tests
      whether the allocation allows blocking which it does, it now thinks
      current->page_frag can be used again although it already was being
      used in [0].
      
      After [2] used current->page_frag, the offset would be increased by
      the used amount.  When the control returns to [0],
      current->page_frag's offset is increased and the previously calculated
      number of bytes now may overrun the end of allocated memory leading to
      silent memory corruptions.
      
      Fix it by adding gfpflags_normal_context() which tests sleepable &&
      !reclaim and use it to determine whether to use current->task_frag.
      
      v2: Eric didn't like gfp flags being tested twice.  Introduce a new
          helper gfpflags_normal_context() and combine the two tests.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1d5cb12a
    • E
      net: annotate lockless accesses to sk->sk_napi_id · 2c50a36d
      Eric Dumazet 提交于
      [ Upstream commit ee8d153d46a3b98c064ee15c0c0a3bbf1450e5a1 ]
      
      We already annotated most accesses to sk->sk_napi_id
      
      We missed sk_mark_napi_id() and sk_mark_napi_id_once()
      which might be called without socket lock held in UDP stack.
      
      KCSAN reported :
      BUG: KCSAN: data-race in udpv6_queue_rcv_one_skb / udpv6_queue_rcv_one_skb
      
      write to 0xffff888121c6d108 of 4 bytes by interrupt on cpu 0:
       sk_mark_napi_id include/net/busy_poll.h:125 [inline]
       __udpv6_queue_rcv_skb net/ipv6/udp.c:571 [inline]
       udpv6_queue_rcv_one_skb+0x70c/0xb40 net/ipv6/udp.c:672
       udpv6_queue_rcv_skb+0xb5/0x400 net/ipv6/udp.c:689
       udp6_unicast_rcv_skb.isra.0+0xd7/0x180 net/ipv6/udp.c:832
       __udp6_lib_rcv+0x69c/0x1770 net/ipv6/udp.c:913
       udpv6_rcv+0x2b/0x40 net/ipv6/udp.c:1015
       ip6_protocol_deliver_rcu+0x22a/0xbe0 net/ipv6/ip6_input.c:409
       ip6_input_finish+0x30/0x50 net/ipv6/ip6_input.c:450
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip6_input+0x177/0x190 net/ipv6/ip6_input.c:459
       dst_input include/net/dst.h:442 [inline]
       ip6_rcv_finish+0x110/0x140 net/ipv6/ip6_input.c:76
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ipv6_rcv+0x1a1/0x1b0 net/ipv6/ip6_input.c:284
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
      
      write to 0xffff888121c6d108 of 4 bytes by interrupt on cpu 1:
       sk_mark_napi_id include/net/busy_poll.h:125 [inline]
       __udpv6_queue_rcv_skb net/ipv6/udp.c:571 [inline]
       udpv6_queue_rcv_one_skb+0x70c/0xb40 net/ipv6/udp.c:672
       udpv6_queue_rcv_skb+0xb5/0x400 net/ipv6/udp.c:689
       udp6_unicast_rcv_skb.isra.0+0xd7/0x180 net/ipv6/udp.c:832
       __udp6_lib_rcv+0x69c/0x1770 net/ipv6/udp.c:913
       udpv6_rcv+0x2b/0x40 net/ipv6/udp.c:1015
       ip6_protocol_deliver_rcu+0x22a/0xbe0 net/ipv6/ip6_input.c:409
       ip6_input_finish+0x30/0x50 net/ipv6/ip6_input.c:450
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip6_input+0x177/0x190 net/ipv6/ip6_input.c:459
       dst_input include/net/dst.h:442 [inline]
       ip6_rcv_finish+0x110/0x140 net/ipv6/ip6_input.c:76
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ipv6_rcv+0x1a1/0x1b0 net/ipv6/ip6_input.c:284
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 10890 Comm: syz-executor.0 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: e68b6e50 ("udp: enable busy polling for all sockets")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2c50a36d
    • E
      net: annotate accesses to sk->sk_incoming_cpu · 0cfaf03c
      Eric Dumazet 提交于
      [ Upstream commit 7170a977743b72cf3eb46ef6ef89885dc7ad3621 ]
      
      This socket field can be read and written by concurrent cpus.
      
      Use READ_ONCE() and WRITE_ONCE() annotations to document this,
      and avoid some compiler 'optimizations'.
      
      KCSAN reported :
      
      BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv
      
      write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0:
       sk_incoming_cpu_update include/net/sock.h:953 [inline]
       tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
       do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
       do_softirq kernel/softirq.c:329 [inline]
       __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189
      
      read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1:
       sk_incoming_cpu_update include/net/sock.h:952 [inline]
       tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
       smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0cfaf03c
  9. 06 11月, 2019 2 次提交
    • E
      sch_netem: fix rcu splat in netem_enqueue() · a6c91087
      Eric Dumazet 提交于
      commit 159d2c7d8106177bd9a986fd005a311fe0d11285 upstream.
      
      qdisc_root() use from netem_enqueue() triggers a lockdep warning.
      
      __dev_queue_xmit() uses rcu_read_lock_bh() which is
      not equivalent to rcu_read_lock() + local_bh_disable_bh as far
      as lockdep is concerned.
      
      WARNING: suspicious RCU usage
      5.3.0-rc7+ #0 Not tainted
      -----------------------------
      include/net/sch_generic.h:492 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      3 locks held by syz-executor427/8855:
       #0: 00000000b5525c01 (rcu_read_lock_bh){....}, at: lwtunnel_xmit_redirect include/net/lwtunnel.h:92 [inline]
       #0: 00000000b5525c01 (rcu_read_lock_bh){....}, at: ip_finish_output2+0x2dc/0x2570 net/ipv4/ip_output.c:214
       #1: 00000000b5525c01 (rcu_read_lock_bh){....}, at: __dev_queue_xmit+0x20a/0x3650 net/core/dev.c:3804
       #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: spin_lock include/linux/spinlock.h:338 [inline]
       #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: __dev_xmit_skb net/core/dev.c:3502 [inline]
       #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: __dev_queue_xmit+0x14b8/0x3650 net/core/dev.c:3838
      
      stack backtrace:
      CPU: 0 PID: 8855 Comm: syz-executor427 Not tainted 5.3.0-rc7+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5357
       qdisc_root include/net/sch_generic.h:492 [inline]
       netem_enqueue+0x1cfb/0x2d80 net/sched/sch_netem.c:479
       __dev_xmit_skb net/core/dev.c:3527 [inline]
       __dev_queue_xmit+0x15d2/0x3650 net/core/dev.c:3838
       dev_queue_xmit+0x18/0x20 net/core/dev.c:3902
       neigh_hh_output include/net/neighbour.h:500 [inline]
       neigh_output include/net/neighbour.h:509 [inline]
       ip_finish_output2+0x1726/0x2570 net/ipv4/ip_output.c:228
       __ip_finish_output net/ipv4/ip_output.c:308 [inline]
       __ip_finish_output+0x5fc/0xb90 net/ipv4/ip_output.c:290
       ip_finish_output+0x38/0x1f0 net/ipv4/ip_output.c:318
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip_mc_output+0x292/0xf40 net/ipv4/ip_output.c:417
       dst_output include/net/dst.h:436 [inline]
       ip_local_out+0xbb/0x190 net/ipv4/ip_output.c:125
       ip_send_skb+0x42/0xf0 net/ipv4/ip_output.c:1555
       udp_send_skb.isra.0+0x6b2/0x1160 net/ipv4/udp.c:887
       udp_sendmsg+0x1e96/0x2820 net/ipv4/udp.c:1174
       inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807
       sock_sendmsg_nosec net/socket.c:637 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:657
       ___sys_sendmsg+0x3e2/0x920 net/socket.c:2311
       __sys_sendmmsg+0x1bf/0x4d0 net/socket.c:2413
       __do_sys_sendmmsg net/socket.c:2442 [inline]
       __se_sys_sendmmsg net/socket.c:2439 [inline]
       __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2439
       do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a6c91087
    • E
      llc: fix sk_buff leak in llc_conn_service() · d634bd01
      Eric Biggers 提交于
      commit b74555de21acd791f12c4a1aeaf653dd7ac21133 upstream.
      
      syzbot reported:
      
          BUG: memory leak
          unreferenced object 0xffff88811eb3de00 (size 224):
             comm "syz-executor559", pid 7315, jiffies 4294943019 (age 10.300s)
             hex dump (first 32 bytes):
               00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
               00 a0 38 24 81 88 ff ff 00 c0 f2 15 81 88 ff ff  ..8$............
             backtrace:
               [<000000008d1c66a1>] kmemleak_alloc_recursive  include/linux/kmemleak.h:55 [inline]
               [<000000008d1c66a1>] slab_post_alloc_hook mm/slab.h:439 [inline]
               [<000000008d1c66a1>] slab_alloc_node mm/slab.c:3269 [inline]
               [<000000008d1c66a1>] kmem_cache_alloc_node+0x153/0x2a0 mm/slab.c:3579
               [<00000000447d9496>] __alloc_skb+0x6e/0x210 net/core/skbuff.c:198
               [<000000000cdbf82f>] alloc_skb include/linux/skbuff.h:1058 [inline]
               [<000000000cdbf82f>] llc_alloc_frame+0x66/0x110 net/llc/llc_sap.c:54
               [<000000002418b52e>] llc_conn_ac_send_sabme_cmd_p_set_x+0x2f/0x140  net/llc/llc_c_ac.c:777
               [<000000001372ae17>] llc_exec_conn_trans_actions net/llc/llc_conn.c:475  [inline]
               [<000000001372ae17>] llc_conn_service net/llc/llc_conn.c:400 [inline]
               [<000000001372ae17>] llc_conn_state_process+0x1ac/0x640  net/llc/llc_conn.c:75
               [<00000000f27e53c1>] llc_establish_connection+0x110/0x170  net/llc/llc_if.c:109
               [<00000000291b2ca0>] llc_ui_connect+0x10e/0x370 net/llc/af_llc.c:477
               [<000000000f9c740b>] __sys_connect+0x11d/0x170 net/socket.c:1840
               [...]
      
      The bug is that most callers of llc_conn_send_pdu() assume it consumes a
      reference to the skb, when actually due to commit b85ab56c ("llc:
      properly handle dev_queue_xmit() return value") it doesn't.
      
      Revert most of that commit, and instead make the few places that need
      llc_conn_send_pdu() to *not* consume a reference call skb_get() before.
      
      Fixes: b85ab56c ("llc: properly handle dev_queue_xmit() return value")
      Reported-by: syzbot+6b825a6494a04cc0e3f7@syzkaller.appspotmail.com
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d634bd01
  10. 21 9月, 2019 1 次提交
  11. 16 9月, 2019 1 次提交
    • M
      {nl,mac}80211: fix interface combinations on crypto controlled devices · 1aa38ece
      Manikanta Pubbisetty 提交于
      [ Upstream commit e6f4051123fd33901e9655a675b22aefcdc5d277 ]
      
      Commit 33d915d9e8ce ("{nl,mac}80211: allow 4addr AP operation on
      crypto controlled devices") has introduced a change which allows
      4addr operation on crypto controlled devices (ex: ath10k). This
      change has inadvertently impacted the interface combinations logic
      on such devices.
      
      General rule is that software interfaces like AP/VLAN should not be
      listed under supported interface combinations and should not be
      considered during validation of these combinations; because of the
      aforementioned change, AP/VLAN interfaces(if present) will be checked
      against interfaces supported by the device and blocks valid interface
      combinations.
      
      Consider a case where an AP and AP/VLAN are up and running; when a
      second AP device is brought up on the same physical device, this AP
      will be checked against the AP/VLAN interface (which will not be
      part of supported interface combinations of the device) and blocks
      second AP to come up.
      
      Add a new API cfg80211_iftype_allowed() to fix the problem, this
      API works for all devices with/without SW crypto control.
      Signed-off-by: NManikanta Pubbisetty <mpubbise@codeaurora.org>
      Fixes: 33d915d9e8ce ("{nl,mac}80211: allow 4addr AP operation on crypto controlled devices")
      Link: https://lore.kernel.org/r/1563779690-9716-1-git-send-email-mpubbise@codeaurora.orgSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1aa38ece
  12. 10 9月, 2019 3 次提交
    • P
      netfilter: nf_tables: use-after-free in failing rule with bound set · 5776970f
      Pablo Neira Ayuso 提交于
      [ Upstream commit 6a0a8d10a3661a036b55af695542a714c429ab7c ]
      
      If a rule that has already a bound anonymous set fails to be added, the
      preparation phase releases the rule and the bound set. However, the
      transaction object from the abort path still has a reference to the set
      object that is stale, leading to a use-after-free when checking for the
      set->bound field. Add a new field to the transaction that specifies if
      the set is bound, so the abort path can skip releasing it since the rule
      command owns it and it takes care of releasing it. After this update,
      the set->bound field is removed.
      
      [   24.649883] Unable to handle kernel paging request at virtual address 0000000000040434
      [   24.657858] Mem abort info:
      [   24.660686]   ESR = 0x96000004
      [   24.663769]   Exception class = DABT (current EL), IL = 32 bits
      [   24.669725]   SET = 0, FnV = 0
      [   24.672804]   EA = 0, S1PTW = 0
      [   24.675975] Data abort info:
      [   24.678880]   ISV = 0, ISS = 0x00000004
      [   24.682743]   CM = 0, WnR = 0
      [   24.685723] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000428952000
      [   24.692207] [0000000000040434] pgd=0000000000000000
      [   24.697119] Internal error: Oops: 96000004 [#1] SMP
      [...]
      [   24.889414] Call trace:
      [   24.891870]  __nf_tables_abort+0x3f0/0x7a0
      [   24.895984]  nf_tables_abort+0x20/0x40
      [   24.899750]  nfnetlink_rcv_batch+0x17c/0x588
      [   24.904037]  nfnetlink_rcv+0x13c/0x190
      [   24.907803]  netlink_unicast+0x18c/0x208
      [   24.911742]  netlink_sendmsg+0x1b0/0x350
      [   24.915682]  sock_sendmsg+0x4c/0x68
      [   24.919185]  ___sys_sendmsg+0x288/0x2c8
      [   24.923037]  __sys_sendmsg+0x7c/0xd0
      [   24.926628]  __arm64_sys_sendmsg+0x2c/0x38
      [   24.930744]  el0_svc_common.constprop.0+0x94/0x158
      [   24.935556]  el0_svc_handler+0x34/0x90
      [   24.939322]  el0_svc+0x8/0xc
      [   24.942216] Code: 37280300 f9404023 91014262 aa1703e0 (f9401863)
      [   24.948336] ---[ end trace cebbb9dcbed3b56f ]---
      
      Fixes: f6ac85858976 ("netfilter: nf_tables: unbind set in rule from commit path")
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5776970f
    • C
      net_sched: fix a NULL pointer deref in ipt action · 38166934
      Cong Wang 提交于
      [ Upstream commit 981471bd3abf4d572097645d765391533aac327d ]
      
      The net pointer in struct xt_tgdtor_param is not explicitly
      initialized therefore is still NULL when dereferencing it.
      So we have to find a way to pass the correct net pointer to
      ipt_destroy_target().
      
      The best way I find is just saving the net pointer inside the per
      netns struct tcf_idrinfo, which could make this patch smaller.
      
      Fixes: 0c66dc1e ("netfilter: conntrack: register hooks in netns when needed by ruleset")
      Reported-and-tested-by: itugrok@yahoo.com
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      38166934
    • V
      net: sched: act_sample: fix psample group handling on overwrite · 5ff0ab0c
      Vlad Buslov 提交于
      [ Upstream commit dbf47a2a094edf58983265e323ca4bdcdb58b5ee ]
      
      Action sample doesn't properly handle psample_group pointer in overwrite
      case. Following issues need to be fixed:
      
      - In tcf_sample_init() function RCU_INIT_POINTER() is used to set
        s->psample_group, even though we neither setting the pointer to NULL, nor
        preventing concurrent readers from accessing the pointer in some way.
        Use rcu_swap_protected() instead to safely reset the pointer.
      
      - Old value of s->psample_group is not released or deallocated in any way,
        which results resource leak. Use psample_group_put() on non-NULL value
        obtained with rcu_swap_protected().
      
      - The function psample_group_put() that released reference to struct
        psample_group pointed by rcu-pointer s->psample_group doesn't respect rcu
        grace period when deallocating it. Extend struct psample_group with rcu
        head and use kfree_rcu when freeing it.
      
      Fixes: 5c5670fa ("net/sched: Introduce sample tc action")
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5ff0ab0c
  13. 28 7月, 2019 4 次提交
  14. 26 7月, 2019 1 次提交
    • J
      ipvs: fix tinfo memory leak in start_sync_thread · fe2ceeb4
      Julian Anastasov 提交于
      [ Upstream commit 5db7c8b9f9fc2aeec671ae3ca6375752c162e0e7 ]
      
      syzkaller reports for memory leak in start_sync_thread [1]
      
      As Eric points out, kthread may start and stop before the
      threadfn function is called, so there is no chance the
      data (tinfo in our case) to be released in thread.
      
      Fix this by releasing tinfo in the controlling code instead.
      
      [1]
      BUG: memory leak
      unreferenced object 0xffff8881206bf700 (size 32):
       comm "syz-executor761", pid 7268, jiffies 4294943441 (age 20.470s)
       hex dump (first 32 bytes):
         00 40 7c 09 81 88 ff ff 80 45 b8 21 81 88 ff ff  .@|......E.!....
         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
       backtrace:
         [<0000000057619e23>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
         [<0000000057619e23>] slab_post_alloc_hook mm/slab.h:439 [inline]
         [<0000000057619e23>] slab_alloc mm/slab.c:3326 [inline]
         [<0000000057619e23>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
         [<0000000086ce5479>] kmalloc include/linux/slab.h:547 [inline]
         [<0000000086ce5479>] start_sync_thread+0x5d2/0xe10 net/netfilter/ipvs/ip_vs_sync.c:1862
         [<000000001a9229cc>] do_ip_vs_set_ctl+0x4c5/0x780 net/netfilter/ipvs/ip_vs_ctl.c:2402
         [<00000000ece457c8>] nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
         [<00000000ece457c8>] nf_setsockopt+0x4c/0x80 net/netfilter/nf_sockopt.c:115
         [<00000000942f62d4>] ip_setsockopt net/ipv4/ip_sockglue.c:1258 [inline]
         [<00000000942f62d4>] ip_setsockopt+0x9b/0xb0 net/ipv4/ip_sockglue.c:1238
         [<00000000a56a8ffd>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616
         [<00000000fa895401>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3130
         [<0000000095eef4cf>] __sys_setsockopt+0x98/0x120 net/socket.c:2078
         [<000000009747cf88>] __do_sys_setsockopt net/socket.c:2089 [inline]
         [<000000009747cf88>] __se_sys_setsockopt net/socket.c:2086 [inline]
         [<000000009747cf88>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086
         [<00000000ded8ba80>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
         [<00000000893b4ac8>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported-by: syzbot+7e2e50c8adfccd2e5041@syzkaller.appspotmail.com
      Suggested-by: NEric Biggers <ebiggers@kernel.org>
      Fixes: 998e7a76 ("ipvs: Use kthread_run() instead of doing a double-fork via kernel_thread()")
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fe2ceeb4
  15. 14 7月, 2019 1 次提交
  16. 10 7月, 2019 1 次提交
    • E
      ip6: fix skb leak in ip6frag_expire_frag_queue() · a8891c5e
      Eric Dumazet 提交于
      [ Upstream commit 47d3d7fdb10a21c223036b58bd70ffdc24a472c4 ]
      
      Since ip6frag_expire_frag_queue() now pulls the head skb
      from frag queue, we should no longer use skb_get(), since
      this leads to an skb leak.
      
      Stefan Bader initially reported a problem in 4.4.stable [1] caused
      by the skb_get(), so this patch should also fix this issue.
      
      296583.091021] kernel BUG at /build/linux-6VmqmP/linux-4.4.0/net/core/skbuff.c:1207!
      [296583.091734] Call Trace:
      [296583.091749]  [<ffffffff81740e50>] __pskb_pull_tail+0x50/0x350
      [296583.091764]  [<ffffffff8183939a>] _decode_session6+0x26a/0x400
      [296583.091779]  [<ffffffff817ec719>] __xfrm_decode_session+0x39/0x50
      [296583.091795]  [<ffffffff818239d0>] icmpv6_route_lookup+0xf0/0x1c0
      [296583.091809]  [<ffffffff81824421>] icmp6_send+0x5e1/0x940
      [296583.091823]  [<ffffffff81753238>] ? __netif_receive_skb+0x18/0x60
      [296583.091838]  [<ffffffff817532b2>] ? netif_receive_skb_internal+0x32/0xa0
      [296583.091858]  [<ffffffffc0199f74>] ? ixgbe_clean_rx_irq+0x594/0xac0 [ixgbe]
      [296583.091876]  [<ffffffffc04eb260>] ? nf_ct_net_exit+0x50/0x50 [nf_defrag_ipv6]
      [296583.091893]  [<ffffffff8183d431>] icmpv6_send+0x21/0x30
      [296583.091906]  [<ffffffff8182b500>] ip6_expire_frag_queue+0xe0/0x120
      [296583.091921]  [<ffffffffc04eb27f>] nf_ct_frag6_expire+0x1f/0x30 [nf_defrag_ipv6]
      [296583.091938]  [<ffffffff810f3b57>] call_timer_fn+0x37/0x140
      [296583.091951]  [<ffffffffc04eb260>] ? nf_ct_net_exit+0x50/0x50 [nf_defrag_ipv6]
      [296583.091968]  [<ffffffff810f5464>] run_timer_softirq+0x234/0x330
      [296583.091982]  [<ffffffff8108a339>] __do_softirq+0x109/0x2b0
      
      Fixes: d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6 defrag")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NStefan Bader <stefan.bader@canonical.com>
      Cc: Peter Oskolkov <posk@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a8891c5e
  17. 03 7月, 2019 2 次提交
    • T
      9p: Add refcount to p9_req_t · 3665a4d9
      Tomas Bortoli 提交于
      [ Upstream commit 728356dedeff8ef999cb436c71333ef4ac51a81c ]
      
      To avoid use-after-free(s), use a refcount to keep track of the
      usable references to any instantiated struct p9_req_t.
      
      This commit adds p9_req_put(), p9_req_get() and p9_req_try_get() as
      wrappers to kref_put(), kref_get() and kref_get_unless_zero().
      These are used by the client and the transports to keep track of
      valid requests' references.
      
      p9_free_req() is added back and used as callback by kref_put().
      
      Add SLAB_TYPESAFE_BY_RCU as it ensures that the memory freed by
      kmem_cache_free() will not be reused for another type until the rcu
      synchronisation period is over, so an address gotten under rcu read
      lock is safe to inc_ref() without corrupting random memory while
      the lock is held.
      
      Link: http://lkml.kernel.org/r/1535626341-20693-1-git-send-email-asmadeus@codewreck.orgCo-developed-by: NDominique Martinet <dominique.martinet@cea.fr>
      Signed-off-by: NTomas Bortoli <tomasbortoli@gmail.com>
      Reported-by: syzbot+467050c1ce275af2a5b8@syzkaller.appspotmail.com
      Signed-off-by: NDominique Martinet <dominique.martinet@cea.fr>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3665a4d9
    • D
      9p: add a per-client fcall kmem_cache · be87f21e
      Dominique Martinet 提交于
      [ Upstream commit 91a76be37ff89795526c452a6799576b03bec501 ]
      
      Having a specific cache for the fcall allocations helps speed up
      end-to-end latency.
      
      The caches will automatically be merged if there are multiple caches
      of items with the same size so we do not need to try to share a cache
      between different clients of the same size.
      
      Since the msize is negotiated with the server, only allocate the cache
      after that negotiation has happened - previous allocations or
      allocations of different sizes (e.g. zero-copy fcall) are made with
      kmalloc directly.
      
      Some figures on two beefy VMs with Connect-IB (sriov) / trans=rdma,
      with ior running 32 processes in parallel doing small 32 bytes IOs:
       - no alloc (4.18-rc7 request cache): 65.4k req/s
       - non-power of two alloc, no patch: 61.6k req/s
       - power of two alloc, no patch: 62.2k req/s
       - non-power of two alloc, with patch: 64.7k req/s
       - power of two alloc, with patch: 65.1k req/s
      
      Link: http://lkml.kernel.org/r/1532943263-24378-2-git-send-email-asmadeus@codewreck.orgSigned-off-by: NDominique Martinet <dominique.martinet@cea.fr>
      Acked-by: NJun Piao <piaojun@huawei.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Greg Kurz <groug@kaod.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      be87f21e