1. 17 5月, 2017 4 次提交
    • E
      tcp: internal implementation for pacing · 218af599
      Eric Dumazet 提交于
      BBR congestion control depends on pacing, and pacing is
      currently handled by sch_fq packet scheduler for performance reasons,
      and also because implemening pacing with FQ was convenient to truly
      avoid bursts.
      
      However there are many cases where this packet scheduler constraint
      is not practical.
      - Many linux hosts are not focusing on handling thousands of TCP
        flows in the most efficient way.
      - Some routers use fq_codel or other AQM, but still would like
        to use BBR for the few TCP flows they initiate/terminate.
      
      This patch implements an automatic fallback to internal pacing.
      
      Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.
      
      If sch_fq happens to be in the egress path, pacing is delegated to
      the qdisc, otherwise pacing is done by TCP itself.
      
      One advantage of pacing from TCP stack is to get more precise rtt
      estimations, and less work done from TX completion, since TCP Small
      queue limits are not generally hit. Setups with single TX queue but
      many cpus might even benefit from this.
      
      Note that unlike sch_fq, we do not take into account header sizes.
      Taking care of these headers would add additional complexity for
      no practical differences in behavior.
      
      Some performance numbers using 800 TCP_STREAM flows rate limited to
      ~48 Mbit per second on 40Gbit NIC.
      
      If MQ+pfifo_fast is used on the NIC :
      
      $ sar -n DEV 1 5 | grep eth
      14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
      14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
      14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
      14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
      14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
      Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
       4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
       0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
       1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
       1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0
      
      Now use MQ+FQ :
      
      lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
      lpaa23:~# tc qdisc replace dev eth0 root mq
      
      $ sar -n DEV 1 5 | grep eth
      14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
      14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
      14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
      14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
      14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
      Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
       1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
       0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
       4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
       3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0
      
      As expected, number of interrupts per second is very different.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      218af599
    • P
      udp: keep the sk_receive_queue held when splicing · 6dfb4367
      Paolo Abeni 提交于
      On packet reception, when we are forced to splice the
      sk_receive_queue, we can keep the related lock held, so
      that we can avoid re-acquiring it, if fwd memory
      scheduling is required.
      
      v1 -> v2:
        the rx_queue_lock_held param in udp_rmem_release() is
        now a bool
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6dfb4367
    • P
      udp: use a separate rx queue for packet reception · 2276f58a
      Paolo Abeni 提交于
      under udp flood the sk_receive_queue spinlock is heavily contended.
      This patch try to reduce the contention on such lock adding a
      second receive queue to the udp sockets; recvmsg() looks first
      in such queue and, only if empty, tries to fetch the data from
      sk_receive_queue. The latter is spliced into the newly added
      queue every time the receive path has to acquire the
      sk_receive_queue lock.
      
      The accounting of forward allocated memory is still protected with
      the sk_receive_queue lock, so udp_rmem_release() needs to acquire
      both locks when the forward deficit is flushed.
      
      On specific scenarios we can end up acquiring and releasing the
      sk_receive_queue lock multiple times; that will be covered by
      the next patch
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2276f58a
    • P
      net/sock: factor out dequeue/peek with offset code · 65101aec
      Paolo Abeni 提交于
      And update __sk_queue_drop_skb() to work on the specified queue.
      This will help the udp protocol to use an additional private
      rx queue in a later patch.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65101aec
  2. 16 5月, 2017 3 次提交
  3. 12 5月, 2017 8 次提交
    • X
      sctp: fix src address selection if using secondary addresses for ipv6 · dbc2b5e9
      Xin Long 提交于
      Commit 0ca50d12 ("sctp: fix src address selection if using secondary
      addresses") has fixed a src address selection issue when using secondary
      addresses for ipv4.
      
      Now sctp ipv6 also has the similar issue. When using a secondary address,
      sctp_v6_get_dst tries to choose the saddr which has the most same bits
      with the daddr by sctp_v6_addr_match_len. It may make some cases not work
      as expected.
      
      hostA:
        [1] fd21:356b:459a:cf10::11 (eth1)
        [2] fd21:356b:459a:cf20::11 (eth2)
      
      hostB:
        [a] fd21:356b:459a:cf30::2  (eth1)
        [b] fd21:356b:459a:cf40::2  (eth2)
      
      route from hostA to hostB:
        fd21:356b:459a:cf30::/64 dev eth1  metric 1024  mtu 1500
      
      The expected path should be:
        fd21:356b:459a:cf10::11 <-> fd21:356b:459a:cf30::2
      But addr[2] matches addr[a] more bits than addr[1] does, according to
      sctp_v6_addr_match_len. It causes the path to be:
        fd21:356b:459a:cf20::11 <-> fd21:356b:459a:cf30::2
      
      This patch is to fix it with the same way as Marcelo's fix for sctp ipv4.
      As no ip_dev_find for ipv6, this patch is to use ipv6_chk_addr to check
      if the saddr is in a dev instead.
      
      Note that for backwards compatibility, it will still do the addr_match_len
      check here when no optimal is found.
      Reported-by: NPatrick Talbert <ptalbert@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbc2b5e9
    • J
      tipc: make macro tipc_wait_for_cond() smp safe · 844cf763
      Jon Paul Maloy 提交于
      The macro tipc_wait_for_cond() is embedding the macro sk_wait_event()
      to fulfil its task. The latter, in turn, is evaluating the stated
      condition outside the socket lock context. This is problematic if
      the condition is accessing non-trivial data structures which may be
      altered by incoming interrupts, as is the case with the cong_links()
      linked list, used by socket to keep track of the current set of
      congested links. We sometimes see crashes when this list is accessed
      by a condition function at the same time as a SOCK_WAKEUP interrupt
      is removing an element from the list.
      
      We fix this by expanding selected parts of sk_wait_event() into the
      outer macro, while ensuring that all evaluations of a given condition
      are performed under socket lock protection.
      
      Fixes: commit 365ad353 ("tipc: reduce risk of user starvation during link congestion")
      Reviewed-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      844cf763
    • E
      net: sched: optimize class dumps · cb395b20
      Eric Dumazet 提交于
      In commit 59cc1f61 ("net: sched: convert qdisc linked list to
      hashtable") we missed the opportunity to considerably speed up
      tc_dump_tclass_root() if a qdisc handle is provided by user.
      
      Instead of iterating all the qdiscs, use qdisc_match_from_root()
      to directly get the one we look for.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb395b20
    • Y
      tcp: avoid fragmenting peculiar skbs in SACK · b451e5d2
      Yuchung Cheng 提交于
      This patch fixes a bug in splitting an SKB during SACK
      processing. Specifically if an skb contains multiple
      packets and is only partially sacked in the higher sequences,
      tcp_match_sack_to_skb() splits the skb and marks the second fragment
      as SACKed.
      
      The current code further attempts rounding up the first fragment
      to MSS boundaries. But it misses a boundary condition when the
      rounded-up fragment size (pkt_len) is exactly skb size.  Spliting
      such an skb is pointless and causses a kernel warning and aborts
      the SACK processing. This patch universally checks such over-split
      before calling tcp_fragment to prevent these unnecessary warnings.
      
      Fixes: adb92db8 ("tcp: Make SACK code to split only at mss boundaries")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b451e5d2
    • E
      netem: fix skb_orphan_partial() · f6ba8d33
      Eric Dumazet 提交于
      I should have known that lowering skb->truesize was dangerous :/
      
      In case packets are not leaving the host via a standard Ethernet device,
      but looped back to local sockets, bad things can happen, as reported
      by Michael Madsen ( https://bugzilla.kernel.org/show_bug.cgi?id=195713 )
      
      So instead of tweaking skb->truesize, lets change skb->destructor
      and keep a reference on the owner socket via its sk_refcnt.
      
      Fixes: f2f872f9 ("netem: Introduce skb_orphan_partial() helper")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMichael Madsen <mkm@nabto.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6ba8d33
    • D
      xdp: refine xdp api with regards to generic xdp · d67b9cd2
      Daniel Borkmann 提交于
      While working on the iproute2 generic XDP frontend, I noticed that
      as of right now it's possible to have native *and* generic XDP
      programs loaded both at the same time for the case when a driver
      supports native XDP.
      
      The intended model for generic XDP from b5cdae32 ("net: Generic
      XDP") is, however, that only one out of the two can be present at
      once which is also indicated as such in the XDP netlink dump part.
      The main rationale for generic XDP is to ease accessibility (in
      case a driver does not yet have XDP support) and to generically
      provide a semantical model as an example for driver developers
      wanting to add XDP support. The generic XDP option for an XDP
      aware driver can still be useful for comparing and testing both
      implementations.
      
      However, it is not intended to have a second XDP processing stage
      or layer with exactly the same functionality of the first native
      stage. Only reason could be to have a partial fallback for future
      XDP features that are not supported yet in the native implementation
      and we probably also shouldn't strive for such fallback and instead
      encourage native feature support in the first place. Given there's
      currently no such fallback issue or use case, lets not go there yet
      if we don't need to.
      
      Therefore, change semantics for loading XDP and bail out if the
      user tries to load a generic XDP program when a native one is
      present and vice versa. Another alternative to bailing out would
      be to handle the transition from one flavor to another gracefully,
      but that would require to bring the device down, exchange both
      types of programs, and bring it up again in order to avoid a tiny
      window where a packet could hit both hooks. Given this complicates
      the logic for just a debugging feature in the native case, I went
      with the simpler variant.
      
      For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae32
      and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
      or just a subset of flags that were used for loading the XDP prog
      is suboptimal in the long run since not all flags are useful for
      dumping and if we start to reuse the same flag definitions for
      load and dump, then we'll waste bit space. What we really just
      want is to dump the mode for now.
      
      Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
      a program is running at the native driver layer (1). Thus, add a
      mode that says that a program is running at generic XDP layer (2).
      Applications will handle this fine in that older binaries will
      just indicate that something is attached at XDP layer, effectively
      this is similar to IFLA_XDP_FLAGS attr that we would have had
      modulo the redundancy.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d67b9cd2
    • D
      xdp: add flag to enforce driver mode · 0489df9a
      Daniel Borkmann 提交于
      After commit b5cdae32 ("net: Generic XDP") we automatically fall
      back to a generic XDP variant if the driver does not support native
      XDP. Allow for an option where the user can specify that always the
      native XDP variant should be selected and in case it's not supported
      by a driver, just bail out.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0489df9a
    • W
      ipv6/dccp: do not inherit ipv6_mc_list from parent · 83eaddab
      WANG Cong 提交于
      Like commit 657831ff ("dccp/tcp: do not inherit mc_list from parent")
      we should clear ipv6_mc_list etc. for IPv6 sockets too.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83eaddab
  4. 10 5月, 2017 1 次提交
  5. 09 5月, 2017 14 次提交
    • K
      DECnet: Use container_of() for embedded struct · f92ceb01
      Kees Cook 提交于
      Instead of a direct cross-type cast, use conatiner_of() to locate
      the embedded structure, even in the face of future struct layout
      randomization.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f92ceb01
    • D
      Revert "ipv4: restore rt->fi for reference counting" · 32f1bc0f
      David S. Miller 提交于
      This reverts commit 82486aa6.
      
      As implemented, this causes dangling netdevice refs.
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32f1bc0f
    • V
      treewide: convert PF_MEMALLOC manipulations to new helpers · f1083048
      Vlastimil Babka 提交于
      We now have memalloc_noreclaim_{save,restore} helpers for robust setting
      and clearing of PF_MEMALLOC.  Let's convert the code which was using the
      generic tsk_restore_flags().  No functional change.
      
      [vbabka@suse.cz: in net/core/sock.c the hunk is missing]
      Link: http://lkml.kernel.org/r/20170405074700.29871-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Lee Duncan <lduncan@suse.com>
      Cc: Chris Leech <cleech@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Wouter Verhelst <w@uter.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1083048
    • D
      fs: ceph: CURRENT_TIME with ktime_get_real_ts() · 1134e091
      Deepa Dinamani 提交于
      CURRENT_TIME is not y2038 safe.  The macro will be deleted and all the
      references to it will be replaced by ktime_get_* apis.
      
      struct timespec is also not y2038 safe.  Retain timespec for timestamp
      representation here as ceph uses it internally everywhere.  These
      references will be changed to use struct timespec64 in a separate patch.
      
      The current_fs_time() api is being changed to use vfs struct inode* as
      an argument instead of struct super_block*.
      
      Set the new mds client request r_stamp field using ktime_get_real_ts()
      instead of using current_fs_time().
      
      Also, since r_stamp is used as mtime on the server, use timespec_trunc()
      to truncate the timestamp, using the right granularity from the
      superblock.
      
      This api will be transitioned to be y2038 safe along with vfs.
      
      Link: http://lkml.kernel.org/r/1491613030-11599-5-git-send-email-deepa.kernel@gmail.comSigned-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      M:	Ilya Dryomov <idryomov@gmail.com>
      M:	"Yan, Zheng" <zyan@redhat.com>
      M:	Sage Weil <sage@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1134e091
    • K
      format-security: move static strings to const · 06324664
      Kees Cook 提交于
      While examining output from trial builds with -Wformat-security enabled,
      many strings were found that should be defined as "const", or as a char
      array instead of char pointer.  This makes some static analysis easier,
      by producing fewer false positives.
      
      As these are all trivial changes, it seemed best to put them all in a
      single patch rather than chopping them up per maintainer.
      
      Link: http://lkml.kernel.org/r/20170405214711.GA5711@beastSigned-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: Jes Sorensen <jes@trained-monkey.org>	[runner.c]
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Sean Paul <seanpaul@chromium.org>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
      Cc: Salil Mehta <salil.mehta@huawei.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Patrice Chotard <patrice.chotard@st.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Matt Redfearn <matt.redfearn@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Mugunthan V N <mugunthanvnm@ti.com>
      Cc: Felipe Balbi <felipe.balbi@linux.intel.com>
      Cc: Jarod Wilson <jarod@redhat.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Antonio Quartulli <a@unstable.cc>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Kejian Yan <yankejian@huawei.com>
      Cc: Daode Huang <huangdaode@hisilicon.com>
      Cc: Qianqian Xie <xieqianqian@huawei.com>
      Cc: Philippe Reynes <tremyfr@gmail.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Christian Gromm <christian.gromm@microchip.com>
      Cc: Andrey Shvetsov <andrey.shvetsov@k2l.de>
      Cc: Jason Litzinger <jlitzingerdev@gmail.com>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06324664
    • M
      mm, vmalloc: use __GFP_HIGHMEM implicitly · 19809c2d
      Michal Hocko 提交于
      __vmalloc* allows users to provide gfp flags for the underlying
      allocation.  This API is quite popular
      
        $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
        77
      
      The only problem is that many people are not aware that they really want
      to give __GFP_HIGHMEM along with other flags because there is really no
      reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
      which are mapped to the kernel vmalloc space.  About half of users don't
      use this flag, though.  This signals that we make the API unnecessarily
      too complex.
      
      This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
      be mapped to the vmalloc space.  Current users which add __GFP_HIGHMEM
      are simplified and drop the flag.
      
      Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Cristopher Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19809c2d
    • M
      net: use kvmalloc with __GFP_REPEAT rather than open coded variant · da6bc57a
      Michal Hocko 提交于
      fq_alloc_node, alloc_netdev_mqs and netif_alloc* open code kmalloc with
      vmalloc fallback.  Use the kvmalloc variant instead.  Keep the
      __GFP_REPEAT flag based on explanation from Eric:
      
       "At the time, tests on the hardware I had in my labs showed that
        vmalloc() could deliver pages spread all over the memory and that was
        a small penalty (once memory is fragmented enough, not at boot time)"
      
      The way how the code is constructed means, however, that we prefer to go
      and hit the OOM killer before we fall back to the vmalloc for requests
      <=32kB (with 4kB pages) in the current code.  This is rather disruptive
      for something that can be achived with the fallback.  On the other hand
      __GFP_REPEAT doesn't have any useful semantic for these requests.  So
      the effect of this patch is that requests which fit into 32kB will fall
      back to vmalloc easier now.
      
      Link: http://lkml.kernel.org/r/20170306103327.2766-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da6bc57a
    • M
      treewide: use kv[mz]alloc* rather than opencoded variants · 752ade68
      Michal Hocko 提交于
      There are many code paths opencoding kvmalloc.  Let's use the helper
      instead.  The main difference to kvmalloc is that those users are
      usually not considering all the aspects of the memory allocator.  E.g.
      allocation requests <= 32kB (with 4kB pages) are basically never failing
      and invoke OOM killer to satisfy the allocation.  This sounds too
      disruptive for something that has a reasonable fallback - the vmalloc.
      On the other hand those requests might fallback to vmalloc even when the
      memory allocator would succeed after several more reclaim/compaction
      attempts previously.  There is no guarantee something like that happens
      though.
      
      This patch converts many of those places to kv[mz]alloc* helpers because
      they are more conservative.
      
      Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
      Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
      Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
      Acked-by: David Sterba <dsterba@suse.com> # btrfs
      Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
      Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Santosh Raspatur <santosh@chelsio.com>
      Cc: Hariprasad S <hariprasad@chelsio.com>
      Cc: Yishai Hadas <yishaih@mellanox.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: "Yan, Zheng" <zyan@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      752ade68
    • M
      net/ipv6/ila/ila_xlat.c: simplify a strange allocation pattern · 847f716f
      Michal Hocko 提交于
      alloc_ila_locks seemed to c&p from alloc_bucket_locks allocation pattern
      which is quite unusual.  The default allocation size is 320 *
      sizeof(spinlock_t) which is sub page unless lockdep is enabled when the
      performance benefit is really questionable and not worth the subtle code
      IMHO.  Also note that the context when we call ila_init_net (modprobe or
      a task creating a net namespace) has to be properly configured.
      
      Let's just simplify the code and use kvmalloc helper which is a
      transparent way to use kmalloc with vmalloc fallback.
      
      Link: http://lkml.kernel.org/r/20170306103032.2540-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      847f716f
    • W
      ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf · 242d3a49
      WANG Cong 提交于
      For each netns (except init_net), we initialize its null entry
      in 3 places:
      
      1) The template itself, as we use kmemdup()
      2) Code around dst_init_metrics() in ip6_route_net_init()
      3) ip6_route_dev_notify(), which is supposed to initialize it after
         loopback registers
      
      Unfortunately the last one still happens in a wrong order because
      we expect to initialize net->ipv6.ip6_null_entry->rt6i_idev to
      net->loopback_dev's idev, thus we have to do that after we add
      idev to loopback. However, this notifier has priority == 0 same as
      ipv6_dev_notf, and ipv6_dev_notf is registered after
      ip6_route_dev_notifier so it is called actually after
      ip6_route_dev_notifier. This is similar to commit 2f460933
      ("ipv6: initialize route null entry in addrconf_init()") which
      fixes init_net.
      
      Fix it by picking a smaller priority for ip6_route_dev_notifier.
      Also, we have to release the refcnt accordingly when unregistering
      loopback_dev because device exit functions are called before subsys
      exit functions.
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Tested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      242d3a49
    • H
      vti: check nla_put_* return value · 8ed508fd
      Hangbin Liu 提交于
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ed508fd
    • V
      vlan: Keep NETIF_F_HW_CSUM similar to other software devices · 8403debe
      Vlad Yasevich 提交于
      Vlan devices, like all other software devices, enable
      NETIF_F_HW_CSUM feature.  However, unlike all the othe other
      software devices, vlans will switch to using IP|IPV6_CSUM
      features, if the underlying devices uses them.  In these situations,
      checksum offload features on the vlan device can't be controlled
      via ethtool.
      
      This patch makes vlans keep HW_CSUM feature if the underlying
      device supports checksum offloading.  This makes vlan devices
      behave like other software devices, and restores control to the
      user.
      
      A side-effect is that some offload settings (typically UFO)
      may be enabled on the vlan device while being disabled on the HW.
      However, the GSO code will correctly process the packets. This
      actually results in slightly better raw throughput.
      Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8403debe
    • W
      tcp: make congestion control optionally skip slow start after idle · 1b1fc3fd
      Wei Wang 提交于
      Congestion control modules that want full control over congestion
      control behavior do not want the cwnd modifications controlled by
      the sysctl_tcp_slow_start_after_idle code path.
      So skip those code paths for CC modules that use the cong_control()
      API.
      As an example, those cwnd effects are not desired for the BBR congestion
      control algorithm.
      
      Fixes: c0402760 ("tcp: new CC hook to set sending rate with rate_sample in any CA state")
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b1fc3fd
    • W
      ipv4: restore rt->fi for reference counting · 82486aa6
      WANG Cong 提交于
      IPv4 dst could use fi->fib_metrics to store metrics but fib_info
      itself is refcnt'ed, so without taking a refcnt fi and
      fi->fib_metrics could be freed while dst metrics still points to
      it. This triggers use-after-free as reported by Andrey twice.
      
      This patch reverts commit 2860583f ("ipv4: Kill rt->fi") to
      restore this reference counting. It is a quick fix for -net and
      -stable, for -net-next, as Eric suggested, we can consider doing
      reference counting for metrics itself instead of relying on fib_info.
      
      IPv6 is very different, it copies or steals the metrics from mx6_config
      in fib6_commit_metrics() so probably doesn't need a refcnt.
      
      Decnet has already done the refcnt'ing, see dn_fib_semantic_match().
      
      Fixes: 2860583f ("ipv4: Kill rt->fi")
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82486aa6
  6. 08 5月, 2017 3 次提交
  7. 06 5月, 2017 1 次提交
    • E
      tcp: randomize timestamps on syncookies · 84b114b9
      Eric Dumazet 提交于
      Whole point of randomization was to hide server uptime, but an attacker
      can simply start a syn flood and TCP generates 'old style' timestamps,
      directly revealing server jiffies value.
      
      Also, TSval sent by the server to a particular remote address vary
      depending on syncookies being sent or not, potentially triggering PAWS
      drops for innocent clients.
      
      Lets implement proper randomization, including for SYNcookies.
      
      Also we do not need to export sysctl_tcp_timestamps, since it is not
      used from a module.
      
      In v2, I added Florian feedback and contribution, adding tsoff to
      tcp_get_cookie_sock().
      
      v3 removed one unused variable in tcp_v4_connect() as Florian spotted.
      
      Fixes: 95a22cae ("tcp: randomize tcp timestamp offsets for each connection")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Tested-by: NFlorian Westphal <fw@strlen.de>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84b114b9
  8. 05 5月, 2017 2 次提交
  9. 04 5月, 2017 4 次提交
    • M
      rtnetlink: NUL-terminate IFLA_PHYS_PORT_NAME string · 77ef033b
      Michal Schmidt 提交于
      IFLA_PHYS_PORT_NAME is a string attribute, so terminate it with \0.
      Otherwise libnl3 fails to validate netlink messages with this attribute.
      "ip -detail a" assumes too that the attribute is NUL-terminated when
      printing it. It often was, due to padding.
      
      I noticed this as libvirtd failing to start on a system with sfc driver
      after upgrading it to Linux 4.11, i.e. when sfc added support for
      phys_port_name.
      Signed-off-by: NMichal Schmidt <mschmidt@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77ef033b
    • A
      ipv4, ipv6: ensure raw socket message is big enough to hold an IP header · 86f4c90a
      Alexander Potapenko 提交于
      raw_send_hdrinc() and rawv6_send_hdrinc() expect that the buffer copied
      from the userspace contains the IPv4/IPv6 header, so if too few bytes are
      copied, parts of the header may remain uninitialized.
      
      This bug has been detected with KMSAN.
      
      For the record, the KMSAN report:
      
      ==================================================================
      BUG: KMSAN: use of unitialized memory in nf_ct_frag6_gather+0xf5a/0x44a0
      inter: 0
      CPU: 0 PID: 1036 Comm: probe Not tainted 4.11.0-rc5+ #2455
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:16
       dump_stack+0x143/0x1b0 lib/dump_stack.c:52
       kmsan_report+0x16b/0x1e0 mm/kmsan/kmsan.c:1078
       __kmsan_warning_32+0x5c/0xa0 mm/kmsan/kmsan_instr.c:510
       nf_ct_frag6_gather+0xf5a/0x44a0 net/ipv6/netfilter/nf_conntrack_reasm.c:577
       ipv6_defrag+0x1d9/0x280 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68
       nf_hook_entry_hookfn ./include/linux/netfilter.h:102
       nf_hook_slow+0x13f/0x3c0 net/netfilter/core.c:310
       nf_hook ./include/linux/netfilter.h:212
       NF_HOOK ./include/linux/netfilter.h:255
       rawv6_send_hdrinc net/ipv6/raw.c:673
       rawv6_sendmsg+0x2fcb/0x41a0 net/ipv6/raw.c:919
       inet_sendmsg+0x3f8/0x6d0 net/ipv4/af_inet.c:762
       sock_sendmsg_nosec net/socket.c:633
       sock_sendmsg net/socket.c:643
       SYSC_sendto+0x6a5/0x7c0 net/socket.c:1696
       SyS_sendto+0xbc/0xe0 net/socket.c:1664
       do_syscall_64+0x72/0xa0 arch/x86/entry/common.c:285
       entry_SYSCALL64_slow_path+0x25/0x25 arch/x86/entry/entry_64.S:246
      RIP: 0033:0x436e03
      RSP: 002b:00007ffce48baf38 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 00000000004002b0 RCX: 0000000000436e03
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
      RBP: 00007ffce48baf90 R08: 00007ffce48baf50 R09: 000000000000001c
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 0000000000401790 R14: 0000000000401820 R15: 0000000000000000
      origin: 00000000d9400053
       save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:362
       kmsan_internal_poison_shadow+0xb1/0x1a0 mm/kmsan/kmsan.c:257
       kmsan_poison_shadow+0x6d/0xc0 mm/kmsan/kmsan.c:270
       slab_alloc_node mm/slub.c:2735
       __kmalloc_node_track_caller+0x1f4/0x390 mm/slub.c:4341
       __kmalloc_reserve net/core/skbuff.c:138
       __alloc_skb+0x2cd/0x740 net/core/skbuff.c:231
       alloc_skb ./include/linux/skbuff.h:933
       alloc_skb_with_frags+0x209/0xbc0 net/core/skbuff.c:4678
       sock_alloc_send_pskb+0x9ff/0xe00 net/core/sock.c:1903
       sock_alloc_send_skb+0xe4/0x100 net/core/sock.c:1920
       rawv6_send_hdrinc net/ipv6/raw.c:638
       rawv6_sendmsg+0x2918/0x41a0 net/ipv6/raw.c:919
       inet_sendmsg+0x3f8/0x6d0 net/ipv4/af_inet.c:762
       sock_sendmsg_nosec net/socket.c:633
       sock_sendmsg net/socket.c:643
       SYSC_sendto+0x6a5/0x7c0 net/socket.c:1696
       SyS_sendto+0xbc/0xe0 net/socket.c:1664
       do_syscall_64+0x72/0xa0 arch/x86/entry/common.c:285
       return_from_SYSCALL_64+0x0/0x6a arch/x86/entry/entry_64.S:246
      ==================================================================
      
      , triggered by the following syscalls:
        socket(PF_INET6, SOCK_RAW, IPPROTO_RAW) = 3
        sendto(3, NULL, 0, 0, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "ff00::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EPERM
      
      A similar report is triggered in net/ipv4/raw.c if we use a PF_INET socket
      instead of a PF_INET6 one.
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86f4c90a
    • C
      net/sched: remove redundant null check on head · 985538ee
      Colin Ian King 提交于
      head is previously null checked and so the 2nd null check on head
      is redundant and therefore can be removed.
      
      Detected by CoverityScan, CID#1399505 ("Logically dead code")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      985538ee
    • E
      tcp: do not inherit fastopen_req from parent · 8b485ce6
      Eric Dumazet 提交于
      Under fuzzer stress, it is possible that a child gets a non NULL
      fastopen_req pointer from its parent at accept() time, when/if parent
      morphs from listener to active session.
      
      We need to make sure this can not happen, by clearing the field after
      socket cloning.
      
      BUG: Double free or freeing an invalid pointer
      Unexpected shadow byte: 0xFB
      CPU: 3 PID: 20933 Comm: syz-executor3 Not tainted 4.11.0+ #306
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
      01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0x292/0x395 lib/dump_stack.c:52
       kasan_object_err+0x1c/0x70 mm/kasan/report.c:164
       kasan_report_double_free+0x5c/0x70 mm/kasan/report.c:185
       kasan_slab_free+0x9d/0xc0 mm/kasan/kasan.c:580
       slab_free_hook mm/slub.c:1357 [inline]
       slab_free_freelist_hook mm/slub.c:1379 [inline]
       slab_free mm/slub.c:2961 [inline]
       kfree+0xe8/0x2b0 mm/slub.c:3882
       tcp_free_fastopen_req net/ipv4/tcp.c:1077 [inline]
       tcp_disconnect+0xc15/0x13e0 net/ipv4/tcp.c:2328
       inet_child_forget+0xb8/0x600 net/ipv4/inet_connection_sock.c:898
       inet_csk_reqsk_queue_add+0x1e7/0x250
      net/ipv4/inet_connection_sock.c:928
       tcp_get_cookie_sock+0x21a/0x510 net/ipv4/syncookies.c:217
       cookie_v4_check+0x1a19/0x28b0 net/ipv4/syncookies.c:384
       tcp_v4_cookie_check net/ipv4/tcp_ipv4.c:1384 [inline]
       tcp_v4_do_rcv+0x731/0x940 net/ipv4/tcp_ipv4.c:1421
       tcp_v4_rcv+0x2dc0/0x31c0 net/ipv4/tcp_ipv4.c:1715
       ip_local_deliver_finish+0x4cc/0xc20 net/ipv4/ip_input.c:216
       NF_HOOK include/linux/netfilter.h:257 [inline]
       ip_local_deliver+0x1ce/0x700 net/ipv4/ip_input.c:257
       dst_input include/net/dst.h:492 [inline]
       ip_rcv_finish+0xb1d/0x20b0 net/ipv4/ip_input.c:396
       NF_HOOK include/linux/netfilter.h:257 [inline]
       ip_rcv+0xd8c/0x19c0 net/ipv4/ip_input.c:487
       __netif_receive_skb_core+0x1ad1/0x3400 net/core/dev.c:4210
       __netif_receive_skb+0x2a/0x1a0 net/core/dev.c:4248
       process_backlog+0xe5/0x6c0 net/core/dev.c:4868
       napi_poll net/core/dev.c:5270 [inline]
       net_rx_action+0xe70/0x18e0 net/core/dev.c:5335
       __do_softirq+0x2fb/0xb99 kernel/softirq.c:284
       do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:899
       </IRQ>
       do_softirq.part.17+0x1e8/0x230 kernel/softirq.c:328
       do_softirq kernel/softirq.c:176 [inline]
       __local_bh_enable_ip+0x1cf/0x1e0 kernel/softirq.c:181
       local_bh_enable include/linux/bottom_half.h:31 [inline]
       rcu_read_unlock_bh include/linux/rcupdate.h:931 [inline]
       ip_finish_output2+0x9ab/0x15e0 net/ipv4/ip_output.c:230
       ip_finish_output+0xa35/0xdf0 net/ipv4/ip_output.c:316
       NF_HOOK_COND include/linux/netfilter.h:246 [inline]
       ip_output+0x1f6/0x7b0 net/ipv4/ip_output.c:404
       dst_output include/net/dst.h:486 [inline]
       ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124
       ip_queue_xmit+0x9a8/0x1a10 net/ipv4/ip_output.c:503
       tcp_transmit_skb+0x1ade/0x3470 net/ipv4/tcp_output.c:1057
       tcp_write_xmit+0x79e/0x55b0 net/ipv4/tcp_output.c:2265
       __tcp_push_pending_frames+0xfa/0x3a0 net/ipv4/tcp_output.c:2450
       tcp_push+0x4ee/0x780 net/ipv4/tcp.c:683
       tcp_sendmsg+0x128d/0x39b0 net/ipv4/tcp.c:1342
       inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762
       sock_sendmsg_nosec net/socket.c:633 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:643
       SYSC_sendto+0x660/0x810 net/socket.c:1696
       SyS_sendto+0x40/0x50 net/socket.c:1664
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      RIP: 0033:0x446059
      RSP: 002b:00007faa6761fb58 EFLAGS: 00000282 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 0000000000000017 RCX: 0000000000446059
      RDX: 0000000000000001 RSI: 0000000020ba3fcd RDI: 0000000000000017
      RBP: 00000000006e40a0 R08: 0000000020ba4ff0 R09: 0000000000000010
      R10: 0000000020000000 R11: 0000000000000282 R12: 0000000000708150
      R13: 0000000000000000 R14: 00007faa676209c0 R15: 00007faa67620700
      Object at ffff88003b5bbcb8, in cache kmalloc-64 size: 64
      Allocated:
      PID = 20909
       save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
       save_stack+0x43/0xd0 mm/kasan/kasan.c:513
       set_track mm/kasan/kasan.c:525 [inline]
       kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:616
       kmem_cache_alloc_trace+0x82/0x270 mm/slub.c:2745
       kmalloc include/linux/slab.h:490 [inline]
       kzalloc include/linux/slab.h:663 [inline]
       tcp_sendmsg_fastopen net/ipv4/tcp.c:1094 [inline]
       tcp_sendmsg+0x221a/0x39b0 net/ipv4/tcp.c:1139
       inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762
       sock_sendmsg_nosec net/socket.c:633 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:643
       SYSC_sendto+0x660/0x810 net/socket.c:1696
       SyS_sendto+0x40/0x50 net/socket.c:1664
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      Freed:
      PID = 20909
       save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
       save_stack+0x43/0xd0 mm/kasan/kasan.c:513
       set_track mm/kasan/kasan.c:525 [inline]
       kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:589
       slab_free_hook mm/slub.c:1357 [inline]
       slab_free_freelist_hook mm/slub.c:1379 [inline]
       slab_free mm/slub.c:2961 [inline]
       kfree+0xe8/0x2b0 mm/slub.c:3882
       tcp_free_fastopen_req net/ipv4/tcp.c:1077 [inline]
       tcp_disconnect+0xc15/0x13e0 net/ipv4/tcp.c:2328
       __inet_stream_connect+0x20c/0xf90 net/ipv4/af_inet.c:593
       tcp_sendmsg_fastopen net/ipv4/tcp.c:1111 [inline]
       tcp_sendmsg+0x23a8/0x39b0 net/ipv4/tcp.c:1139
       inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762
       sock_sendmsg_nosec net/socket.c:633 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:643
       SYSC_sendto+0x660/0x810 net/socket.c:1696
       SyS_sendto+0x40/0x50 net/socket.c:1664
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
      Fixes: 7db92362 ("tcp: fix potential double free issue for fastopen_req")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Acked-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b485ce6