1. 20 11月, 2017 1 次提交
  2. 19 11月, 2017 2 次提交
    • N
      tcp: when scheduling TLP, time of RTO should account for current ACK · ed66dfaf
      Neal Cardwell 提交于
      Fix the TLP scheduling logic so that when scheduling a TLP probe, we
      ensure that the estimated time at which an RTO would fire accounts for
      the fact that ACKs indicating forward progress should push back RTO
      times.
      
      After the following fix:
      
      df92c839 ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
      
      we had an unintentional behavior change in the following kind of
      scenario: suppose the RTT variance has been very low recently. Then
      suppose we send out a flight of N packets and our RTT is 100ms:
      
      t=0: send a flight of N packets
      t=100ms: receive an ACK for N-1 packets
      
      The response before df92c839 that was:
        -> schedule a TLP for now + RTO_interval
      
      The response after df92c839 is:
        -> schedule a TLP for t=0 + RTO_interval
      
      Since RTO_interval = srtt + RTT_variance, this means that we have
      scheduled a TLP timer at a point in the future that only accounts for
      RTT_variance. If the RTT_variance term is small, this means that the
      timer fires soon.
      
      Before df92c839 this would not happen, because in that code, when
      we receive an ACK for a prefix of flight, we did:
      
          1) Near the top of tcp_ack(), switch from TLP timer to RTO
             at write_queue_head->paket_tx_time + RTO_interval:
                  if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
                         tcp_rearm_rto(sk);
      
          2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval:
                  if (flag & FLAG_ACKED) {
                         tcp_rearm_rto(sk);
      
          3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO
             to TLP at now + RTO_interval:
                  if (icsk->icsk_pending == ICSK_TIME_RETRANS)
                         tcp_schedule_loss_probe(sk);
      
      In df92c839 we removed that 3-phase dance, and instead directly
      set the TLP timer once: we set the TLP timer in cases like this to
      write_queue_head->packet_tx_time + RTO_interval. So if the RTT
      variance is small, then this means that this is setting the TLP timer
      to fire quite soon. This means if the ACK for the tail of the flight
      takes longer than an RTT to arrive (often due to delayed ACKs), then
      the TLP timer fires too quickly.
      
      Fixes: df92c839 ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed66dfaf
    • A
      gre6: use log_ecn_error module parameter in ip6_tnl_rcv() · 981542c5
      Alexey Kodanev 提交于
      After commit 308edfdf ("gre6: Cleanup GREv6 receive path, call
      common GRE functions") it's not used anywhere in the module, but
      previously was used in ip6gre_rcv().
      
      Fixes: 308edfdf ("gre6: Cleanup GREv6 receive path, call common GRE functions")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      981542c5
  3. 18 11月, 2017 27 次提交
  4. 16 11月, 2017 10 次提交
    • E
      net/sctp: Always set scope_id in sctp_inet6_skb_msgname · 7c8a61d9
      Eric W. Biederman 提交于
      Alexandar Potapenko while testing the kernel with KMSAN and syzkaller
      discovered that in some configurations sctp would leak 4 bytes of
      kernel stack.
      
      Working with his reproducer I discovered that those 4 bytes that
      are leaked is the scope id of an ipv6 address returned by recvmsg.
      
      With a little code inspection and a shrewd guess I discovered that
      sctp_inet6_skb_msgname only initializes the scope_id field for link
      local ipv6 addresses to the interface index the link local address
      pertains to instead of initializing the scope_id field for all ipv6
      addresses.
      
      That is almost reasonable as scope_id's are meaniningful only for link
      local addresses.  Set the scope_id in all other cases to 0 which is
      not a valid interface index to make it clear there is nothing useful
      in the scope_id field.
      
      There should be no danger of breaking userspace as the stack leak
      guaranteed that previously meaningless random data was being returned.
      
      Fixes: 372f525b495c ("SCTP:  Resync with LKSCTP tree.")
      History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.gitReported-by: NAlexander Potapenko <glider@google.com>
      Tested-by: NAlexander Potapenko <glider@google.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c8a61d9
    • M
      mm: remove __GFP_COLD · 453f85d4
      Mel Gorman 提交于
      As the page free path makes no distinction between cache hot and cold
      pages, there is no real useful ordering of pages in the free list that
      allocation requests can take advantage of.  Juding from the users of
      __GFP_COLD, it is likely that a number of them are the result of copying
      other sites instead of actually measuring the impact.  Remove the
      __GFP_COLD parameter which simplifies a number of paths in the page
      allocator.
      
      This is potentially controversial but bear in mind that the size of the
      per-cpu pagelists versus modern cache sizes means that the whole per-cpu
      list can often fit in the L3 cache.  Hence, there is only a potential
      benefit for microbenchmarks that alloc/free pages in a tight loop.  It's
      even worse when THP is taken into account which has little or no chance
      of getting a cache-hot page as the per-cpu list is bypassed and the
      zeroing of multiple pages will thrash the cache anyway.
      
      The truncate microbenchmarks are not shown as this patch affects the
      allocation path and not the free path.  A page fault microbenchmark was
      tested but it showed no sigificant difference which is not surprising
      given that the __GFP_COLD branches are a miniscule percentage of the
      fault path.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      453f85d4
    • L
      kmemcheck: remove annotations · 49502766
      Levin, Alexander (Sasha Levin) 提交于
      Patch series "kmemcheck: kill kmemcheck", v2.
      
      As discussed at LSF/MM, kill kmemcheck.
      
      KASan is a replacement that is able to work without the limitation of
      kmemcheck (single CPU, slow).  KASan is already upstream.
      
      We are also not aware of any users of kmemcheck (or users who don't
      consider KASan as a suitable replacement).
      
      The only objection was that since KASAN wasn't supported by all GCC
      versions provided by distros at that time we should hold off for 2
      years, and try again.
      
      Now that 2 years have passed, and all distros provide gcc that supports
      KASAN, kill kmemcheck again for the very same reasons.
      
      This patch (of 4):
      
      Remove kmemcheck annotations, and calls to kmemcheck from the kernel.
      
      [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
        Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
      Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.comSigned-off-by: NSasha Levin <alexander.levin@verizon.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Hansen <devtimhansen@gmail.com>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49502766
    • J
      net/rds/ib_fmr.c: use kmalloc_array_node() · c413af87
      Johannes Thumshirn 提交于
      Now that we have a NUMA-aware version of kmalloc_array() we can use it
      instead of kmalloc_node() without an overflow check in the size
      calculation.
      
      Link: http://lkml.kernel.org/r/20170927082038.3782-7-jthumshirn@suse.deSigned-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Damien Le Moal <damien.lemoal@wdc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mike Marciniszyn <infinipath@intel.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c413af87
    • J
      tipc: enforce valid ratio between skb truesize and contents · d618d09a
      Jon Maloy 提交于
      The socket level flow control is based on the assumption that incoming
      buffers meet the condition (skb->truesize / roundup(skb->len) <= 4),
      where the latter value is rounded off upwards to the nearest 1k number.
      This does empirically hold true for the device drivers we know, but we
      cannot trust that it will always be so, e.g., in a system with jumbo
      frames and very small packets.
      
      We now introduce a check for this condition at packet arrival, and if
      we find it to be false, we copy the packet to a new, smaller buffer,
      where the condition will be true. We expect this to affect only a small
      fraction of all incoming packets, if at all.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d618d09a
    • A
      netfilter: add ifdef around ctnetlink_proto_size · 8252fcea
      Arnd Bergmann 提交于
      This function is no longer marked 'inline', so we now get a warning
      when it is unused:
      
      net/netfilter/nf_conntrack_netlink.c:536:15: error: 'ctnetlink_proto_size' defined but not used [-Werror=unused-function]
      
      We could mark it inline again, mark it __maybe_unused, or add an #ifdef
      around the definition. I'm picking the third approach here since that
      seems to be what the rest of the file has.
      
      Fixes: 5caaed15 ("netfilter: conntrack: don't cache nlattr_tuple_size result in nla_size")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8252fcea
    • M
      genetlink: fix genlmsg_nlhdr() · 0a833c29
      Michal Kubecek 提交于
      According to the description, first argument of genlmsg_nlhdr() points to
      what genlmsg_put() returns, i.e. beginning of user header. Therefore we
      should only subtract size of genetlink header and netlink message header,
      not user header.
      
      This also means we don't need to pass the pointer to genetlink family and
      the same is true for genl_dump_check_consistent() which is the only caller
      of genlmsg_nlhdr(). (Note that at the moment, these functions are only
      used for families which do not have user header so that they are not
      affected.)
      
      Fixes: 670dc283 ("netlink: advertise incomplete dumps")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a833c29
    • X
      sctp: check stream reset info len before making reconf chunk · 423852f8
      Xin Long 提交于
      Now when resetting stream, if both in and out flags are set, the info
      len can reach:
        sizeof(struct sctp_strreset_outreq) + SCTP_MAX_STREAM(65535) +
        sizeof(struct sctp_strreset_inreq)  + SCTP_MAX_STREAM(65535)
      even without duplicated stream no, this value is far greater than the
      chunk's max size.
      
      _sctp_make_chunk doesn't do any check for this, which would cause the
      skb it allocs is huge, syzbot even reported a crash due to this.
      
      This patch is to check stream reset info len before making reconf
      chunk and return EINVAL if the len exceeds chunk's capacity.
      
      Thanks Marcelo and Neil for making this clear.
      
      v1->v2:
        - move the check into sctp_send_reset_streams instead.
      
      Fixes: cc16f00f ("sctp: add support for generating stream reconf ssn reset request chunk")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      423852f8
    • X
      sctp: use the right sk after waking up from wait_buf sleep · cea0cc80
      Xin Long 提交于
      Commit dfcb9f4f ("sctp: deny peeloff operation on asocs with threads
      sleeping on it") fixed the race between peeloff and wait sndbuf by
      checking waitqueue_active(&asoc->wait) in sctp_do_peeloff().
      
      But it actually doesn't work, as even if waitqueue_active returns false
      the waiting sndbuf thread may still not yet hold sk lock. After asoc is
      peeled off, sk is not asoc->base.sk any more, then to hold the old sk
      lock couldn't make assoc safe to access.
      
      This patch is to fix this by changing to hold the new sk lock if sk is
      not asoc->base.sk, meanwhile, also set the sk in sctp_sendmsg with the
      new sk.
      
      With this fix, there is no more race between peeloff and waitbuf, the
      check 'waitqueue_active' in sctp_do_peeloff can be removed.
      
      Thanks Marcelo and Neil for making this clear.
      
      v1->v2:
        fix it by changing to lock the new sock instead of adding a flag in asoc.
      Suggested-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cea0cc80
    • X
      sctp: do not free asoc when it is already dead in sctp_sendmsg · ca3af4dd
      Xin Long 提交于
      Now in sctp_sendmsg sctp_wait_for_sndbuf could schedule out without
      holding sock sk. It means the current asoc can be freed elsewhere,
      like when receiving an abort packet.
      
      If the asoc is just created in sctp_sendmsg and sctp_wait_for_sndbuf
      returns err, the asoc will be freed again due to new_asoc is not nil.
      An use-after-free issue would be triggered by this.
      
      This patch is to fix it by setting new_asoc with nil if the asoc is
      already dead when cpu schedules back, so that it will not be freed
      again in sctp_sendmsg.
      
      v1->v2:
        set new_asoc as nil in sctp_sendmsg instead of sctp_wait_for_sndbuf.
      Suggested-by: NNeil Horman <nhorman@tuxdriver.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca3af4dd