1. 19 11月, 2017 2 次提交
    • N
      tcp: when scheduling TLP, time of RTO should account for current ACK · ed66dfaf
      Neal Cardwell 提交于
      Fix the TLP scheduling logic so that when scheduling a TLP probe, we
      ensure that the estimated time at which an RTO would fire accounts for
      the fact that ACKs indicating forward progress should push back RTO
      times.
      
      After the following fix:
      
      df92c839 ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
      
      we had an unintentional behavior change in the following kind of
      scenario: suppose the RTT variance has been very low recently. Then
      suppose we send out a flight of N packets and our RTT is 100ms:
      
      t=0: send a flight of N packets
      t=100ms: receive an ACK for N-1 packets
      
      The response before df92c839 that was:
        -> schedule a TLP for now + RTO_interval
      
      The response after df92c839 is:
        -> schedule a TLP for t=0 + RTO_interval
      
      Since RTO_interval = srtt + RTT_variance, this means that we have
      scheduled a TLP timer at a point in the future that only accounts for
      RTT_variance. If the RTT_variance term is small, this means that the
      timer fires soon.
      
      Before df92c839 this would not happen, because in that code, when
      we receive an ACK for a prefix of flight, we did:
      
          1) Near the top of tcp_ack(), switch from TLP timer to RTO
             at write_queue_head->paket_tx_time + RTO_interval:
                  if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
                         tcp_rearm_rto(sk);
      
          2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval:
                  if (flag & FLAG_ACKED) {
                         tcp_rearm_rto(sk);
      
          3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO
             to TLP at now + RTO_interval:
                  if (icsk->icsk_pending == ICSK_TIME_RETRANS)
                         tcp_schedule_loss_probe(sk);
      
      In df92c839 we removed that 3-phase dance, and instead directly
      set the TLP timer once: we set the TLP timer in cases like this to
      write_queue_head->packet_tx_time + RTO_interval. So if the RTT
      variance is small, then this means that this is setting the TLP timer
      to fire quite soon. This means if the ACK for the tail of the flight
      takes longer than an RTT to arrive (often due to delayed ACKs), then
      the TLP timer fires too quickly.
      
      Fixes: df92c839 ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed66dfaf
    • A
      gre6: use log_ecn_error module parameter in ip6_tnl_rcv() · 981542c5
      Alexey Kodanev 提交于
      After commit 308edfdf ("gre6: Cleanup GREv6 receive path, call
      common GRE functions") it's not used anywhere in the module, but
      previously was used in ip6gre_rcv().
      
      Fixes: 308edfdf ("gre6: Cleanup GREv6 receive path, call common GRE functions")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      981542c5
  2. 18 11月, 2017 27 次提交
  3. 16 11月, 2017 10 次提交
    • E
      net/sctp: Always set scope_id in sctp_inet6_skb_msgname · 7c8a61d9
      Eric W. Biederman 提交于
      Alexandar Potapenko while testing the kernel with KMSAN and syzkaller
      discovered that in some configurations sctp would leak 4 bytes of
      kernel stack.
      
      Working with his reproducer I discovered that those 4 bytes that
      are leaked is the scope id of an ipv6 address returned by recvmsg.
      
      With a little code inspection and a shrewd guess I discovered that
      sctp_inet6_skb_msgname only initializes the scope_id field for link
      local ipv6 addresses to the interface index the link local address
      pertains to instead of initializing the scope_id field for all ipv6
      addresses.
      
      That is almost reasonable as scope_id's are meaniningful only for link
      local addresses.  Set the scope_id in all other cases to 0 which is
      not a valid interface index to make it clear there is nothing useful
      in the scope_id field.
      
      There should be no danger of breaking userspace as the stack leak
      guaranteed that previously meaningless random data was being returned.
      
      Fixes: 372f525b495c ("SCTP:  Resync with LKSCTP tree.")
      History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.gitReported-by: NAlexander Potapenko <glider@google.com>
      Tested-by: NAlexander Potapenko <glider@google.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c8a61d9
    • M
      mm: remove __GFP_COLD · 453f85d4
      Mel Gorman 提交于
      As the page free path makes no distinction between cache hot and cold
      pages, there is no real useful ordering of pages in the free list that
      allocation requests can take advantage of.  Juding from the users of
      __GFP_COLD, it is likely that a number of them are the result of copying
      other sites instead of actually measuring the impact.  Remove the
      __GFP_COLD parameter which simplifies a number of paths in the page
      allocator.
      
      This is potentially controversial but bear in mind that the size of the
      per-cpu pagelists versus modern cache sizes means that the whole per-cpu
      list can often fit in the L3 cache.  Hence, there is only a potential
      benefit for microbenchmarks that alloc/free pages in a tight loop.  It's
      even worse when THP is taken into account which has little or no chance
      of getting a cache-hot page as the per-cpu list is bypassed and the
      zeroing of multiple pages will thrash the cache anyway.
      
      The truncate microbenchmarks are not shown as this patch affects the
      allocation path and not the free path.  A page fault microbenchmark was
      tested but it showed no sigificant difference which is not surprising
      given that the __GFP_COLD branches are a miniscule percentage of the
      fault path.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      453f85d4
    • L
      kmemcheck: remove annotations · 49502766
      Levin, Alexander (Sasha Levin) 提交于
      Patch series "kmemcheck: kill kmemcheck", v2.
      
      As discussed at LSF/MM, kill kmemcheck.
      
      KASan is a replacement that is able to work without the limitation of
      kmemcheck (single CPU, slow).  KASan is already upstream.
      
      We are also not aware of any users of kmemcheck (or users who don't
      consider KASan as a suitable replacement).
      
      The only objection was that since KASAN wasn't supported by all GCC
      versions provided by distros at that time we should hold off for 2
      years, and try again.
      
      Now that 2 years have passed, and all distros provide gcc that supports
      KASAN, kill kmemcheck again for the very same reasons.
      
      This patch (of 4):
      
      Remove kmemcheck annotations, and calls to kmemcheck from the kernel.
      
      [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
        Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
      Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.comSigned-off-by: NSasha Levin <alexander.levin@verizon.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Hansen <devtimhansen@gmail.com>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49502766
    • J
      net/rds/ib_fmr.c: use kmalloc_array_node() · c413af87
      Johannes Thumshirn 提交于
      Now that we have a NUMA-aware version of kmalloc_array() we can use it
      instead of kmalloc_node() without an overflow check in the size
      calculation.
      
      Link: http://lkml.kernel.org/r/20170927082038.3782-7-jthumshirn@suse.deSigned-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Damien Le Moal <damien.lemoal@wdc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mike Marciniszyn <infinipath@intel.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c413af87
    • J
      tipc: enforce valid ratio between skb truesize and contents · d618d09a
      Jon Maloy 提交于
      The socket level flow control is based on the assumption that incoming
      buffers meet the condition (skb->truesize / roundup(skb->len) <= 4),
      where the latter value is rounded off upwards to the nearest 1k number.
      This does empirically hold true for the device drivers we know, but we
      cannot trust that it will always be so, e.g., in a system with jumbo
      frames and very small packets.
      
      We now introduce a check for this condition at packet arrival, and if
      we find it to be false, we copy the packet to a new, smaller buffer,
      where the condition will be true. We expect this to affect only a small
      fraction of all incoming packets, if at all.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d618d09a
    • A
      netfilter: add ifdef around ctnetlink_proto_size · 8252fcea
      Arnd Bergmann 提交于
      This function is no longer marked 'inline', so we now get a warning
      when it is unused:
      
      net/netfilter/nf_conntrack_netlink.c:536:15: error: 'ctnetlink_proto_size' defined but not used [-Werror=unused-function]
      
      We could mark it inline again, mark it __maybe_unused, or add an #ifdef
      around the definition. I'm picking the third approach here since that
      seems to be what the rest of the file has.
      
      Fixes: 5caaed15 ("netfilter: conntrack: don't cache nlattr_tuple_size result in nla_size")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8252fcea
    • M
      genetlink: fix genlmsg_nlhdr() · 0a833c29
      Michal Kubecek 提交于
      According to the description, first argument of genlmsg_nlhdr() points to
      what genlmsg_put() returns, i.e. beginning of user header. Therefore we
      should only subtract size of genetlink header and netlink message header,
      not user header.
      
      This also means we don't need to pass the pointer to genetlink family and
      the same is true for genl_dump_check_consistent() which is the only caller
      of genlmsg_nlhdr(). (Note that at the moment, these functions are only
      used for families which do not have user header so that they are not
      affected.)
      
      Fixes: 670dc283 ("netlink: advertise incomplete dumps")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a833c29
    • X
      sctp: check stream reset info len before making reconf chunk · 423852f8
      Xin Long 提交于
      Now when resetting stream, if both in and out flags are set, the info
      len can reach:
        sizeof(struct sctp_strreset_outreq) + SCTP_MAX_STREAM(65535) +
        sizeof(struct sctp_strreset_inreq)  + SCTP_MAX_STREAM(65535)
      even without duplicated stream no, this value is far greater than the
      chunk's max size.
      
      _sctp_make_chunk doesn't do any check for this, which would cause the
      skb it allocs is huge, syzbot even reported a crash due to this.
      
      This patch is to check stream reset info len before making reconf
      chunk and return EINVAL if the len exceeds chunk's capacity.
      
      Thanks Marcelo and Neil for making this clear.
      
      v1->v2:
        - move the check into sctp_send_reset_streams instead.
      
      Fixes: cc16f00f ("sctp: add support for generating stream reconf ssn reset request chunk")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      423852f8
    • X
      sctp: use the right sk after waking up from wait_buf sleep · cea0cc80
      Xin Long 提交于
      Commit dfcb9f4f ("sctp: deny peeloff operation on asocs with threads
      sleeping on it") fixed the race between peeloff and wait sndbuf by
      checking waitqueue_active(&asoc->wait) in sctp_do_peeloff().
      
      But it actually doesn't work, as even if waitqueue_active returns false
      the waiting sndbuf thread may still not yet hold sk lock. After asoc is
      peeled off, sk is not asoc->base.sk any more, then to hold the old sk
      lock couldn't make assoc safe to access.
      
      This patch is to fix this by changing to hold the new sk lock if sk is
      not asoc->base.sk, meanwhile, also set the sk in sctp_sendmsg with the
      new sk.
      
      With this fix, there is no more race between peeloff and waitbuf, the
      check 'waitqueue_active' in sctp_do_peeloff can be removed.
      
      Thanks Marcelo and Neil for making this clear.
      
      v1->v2:
        fix it by changing to lock the new sock instead of adding a flag in asoc.
      Suggested-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cea0cc80
    • X
      sctp: do not free asoc when it is already dead in sctp_sendmsg · ca3af4dd
      Xin Long 提交于
      Now in sctp_sendmsg sctp_wait_for_sndbuf could schedule out without
      holding sock sk. It means the current asoc can be freed elsewhere,
      like when receiving an abort packet.
      
      If the asoc is just created in sctp_sendmsg and sctp_wait_for_sndbuf
      returns err, the asoc will be freed again due to new_asoc is not nil.
      An use-after-free issue would be triggered by this.
      
      This patch is to fix it by setting new_asoc with nil if the asoc is
      already dead when cpu schedules back, so that it will not be freed
      again in sctp_sendmsg.
      
      v1->v2:
        set new_asoc as nil in sctp_sendmsg instead of sctp_wait_for_sndbuf.
      Suggested-by: NNeil Horman <nhorman@tuxdriver.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca3af4dd
  4. 15 11月, 2017 1 次提交
    • E
      tcp: highest_sack fix · 50895b9d
      Eric Dumazet 提交于
      syzbot easily found a regression added in our latest patches [1]
      
      No longer set tp->highest_sack to the head of the send queue since
      this is not logical and error prone.
      
      Only sack processing should maintain the pointer to an skb from rtx queue.
      
      We might in the future only remember the sequence instead of a pointer to skb,
      since rb-tree should allow a fast lookup.
      
      [1]
      BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1706 [inline]
      BUG: KASAN: use-after-free in tcp_ack+0x42bb/0x4fd0 net/ipv4/tcp_input.c:3537
      Read of size 4 at addr ffff8801c154faa8 by task syz-executor4/12860
      
      CPU: 0 PID: 12860 Comm: syz-executor4 Not tainted 4.14.0-next-20171113+ #41
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:53
       print_address_description+0x73/0x250 mm/kasan/report.c:252
       kasan_report_error mm/kasan/report.c:351 [inline]
       kasan_report+0x25b/0x340 mm/kasan/report.c:409
       __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429
       tcp_highest_sack_seq include/net/tcp.h:1706 [inline]
       tcp_ack+0x42bb/0x4fd0 net/ipv4/tcp_input.c:3537
       tcp_rcv_established+0x672/0x18a0 net/ipv4/tcp_input.c:5439
       tcp_v4_do_rcv+0x2ab/0x7d0 net/ipv4/tcp_ipv4.c:1468
       sk_backlog_rcv include/net/sock.h:909 [inline]
       __release_sock+0x124/0x360 net/core/sock.c:2264
       release_sock+0xa4/0x2a0 net/core/sock.c:2778
       tcp_sendmsg+0x3a/0x50 net/ipv4/tcp.c:1462
       inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
       sock_sendmsg_nosec net/socket.c:632 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:642
       ___sys_sendmsg+0x75b/0x8a0 net/socket.c:2048
       __sys_sendmsg+0xe5/0x210 net/socket.c:2082
       SYSC_sendmsg net/socket.c:2093 [inline]
       SyS_sendmsg+0x2d/0x50 net/socket.c:2089
       entry_SYSCALL_64_fastpath+0x1f/0x96
      RIP: 0033:0x452879
      RSP: 002b:00007fc9761bfbe8 EFLAGS: 00000212 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 0000000000758020 RCX: 0000000000452879
      RDX: 0000000000000000 RSI: 0000000020917fc8 RDI: 0000000000000015
      RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000212 R12: 00000000006ee3a0
      R13: 00000000ffffffff R14: 00007fc9761c06d4 R15: 0000000000000000
      
      Allocated by task 12860:
       save_stack+0x43/0xd0 mm/kasan/kasan.c:447
       set_track mm/kasan/kasan.c:459 [inline]
       kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
       kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
       kmem_cache_alloc_node+0x144/0x760 mm/slab.c:3638
       __alloc_skb+0xf1/0x780 net/core/skbuff.c:193
       alloc_skb_fclone include/linux/skbuff.h:1023 [inline]
       sk_stream_alloc_skb+0x11d/0x900 net/ipv4/tcp.c:870
       tcp_sendmsg_locked+0x1341/0x3b80 net/ipv4/tcp.c:1299
       tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1461
       inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
       sock_sendmsg_nosec net/socket.c:632 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:642
       SYSC_sendto+0x358/0x5a0 net/socket.c:1749
       SyS_sendto+0x40/0x50 net/socket.c:1717
       entry_SYSCALL_64_fastpath+0x1f/0x96
      
      Freed by task 12860:
       save_stack+0x43/0xd0 mm/kasan/kasan.c:447
       set_track mm/kasan/kasan.c:459 [inline]
       kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
       __cache_free mm/slab.c:3492 [inline]
       kmem_cache_free+0x77/0x280 mm/slab.c:3750
       kfree_skbmem+0xdd/0x1d0 net/core/skbuff.c:603
       __kfree_skb+0x1d/0x20 net/core/skbuff.c:642
       sk_wmem_free_skb include/net/sock.h:1419 [inline]
       tcp_rtx_queue_unlink_and_free include/net/tcp.h:1682 [inline]
       tcp_clean_rtx_queue net/ipv4/tcp_input.c:3111 [inline]
       tcp_ack+0x1b17/0x4fd0 net/ipv4/tcp_input.c:3593
       tcp_rcv_established+0x672/0x18a0 net/ipv4/tcp_input.c:5439
       tcp_v4_do_rcv+0x2ab/0x7d0 net/ipv4/tcp_ipv4.c:1468
       sk_backlog_rcv include/net/sock.h:909 [inline]
       __release_sock+0x124/0x360 net/core/sock.c:2264
       release_sock+0xa4/0x2a0 net/core/sock.c:2778
       tcp_sendmsg+0x3a/0x50 net/ipv4/tcp.c:1462
       inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
       sock_sendmsg_nosec net/socket.c:632 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:642
       ___sys_sendmsg+0x75b/0x8a0 net/socket.c:2048
       __sys_sendmsg+0xe5/0x210 net/socket.c:2082
       SYSC_sendmsg net/socket.c:2093 [inline]
       SyS_sendmsg+0x2d/0x50 net/socket.c:2089
       entry_SYSCALL_64_fastpath+0x1f/0x96
      
      The buggy address belongs to the object at ffff8801c154fa80
       which belongs to the cache skbuff_fclone_cache of size 456
      The buggy address is located 40 bytes inside of
       456-byte region [ffff8801c154fa80, ffff8801c154fc48)
      The buggy address belongs to the page:
      page:ffffea00070553c0 count:1 mapcount:0 mapping:ffff8801c154f080 index:0x0
      flags: 0x2fffc0000000100(slab)
      raw: 02fffc0000000100 ffff8801c154f080 0000000000000000 0000000100000006
      raw: ffffea00070a5a20 ffffea0006a18360 ffff8801d9ca0500 0000000000000000
      page dumped because: kasan: bad access detected
      
      Fixes: 737ff314 ("tcp: use sequence distance to detect reordering")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      50895b9d