1. 07 1月, 2017 3 次提交
  2. 06 1月, 2017 1 次提交
    • S
      tcp: provide timestamps for partial writes · ad02c4f5
      Soheil Hassas Yeganeh 提交于
      For TCP sockets, TX timestamps are only captured when the user data
      is successfully and fully written to the socket. In many cases,
      however, TCP writes can be partial for which no timestamp is
      collected.
      
      Collect timestamps whenever any user data is (fully or partially)
      copied into the socket. Pass tcp_write_queue_tail to tcp_tx_timestamp
      instead of the local skb pointer since it can be set to NULL on
      the error path.
      
      Note that tcp_write_queue_tail can be NULL, even if bytes have been
      copied to the socket. This is because acknowledgements are being
      processed in tcp_sendmsg(), and by the time tcp_tx_timestamp is
      called tcp_write_queue_tail can be NULL. For such cases, this patch
      does not collect any timestamps (i.e., it is best-effort).
      
      This patch is written with suggestions from Willem de Bruijn and
      Eric Dumazet.
      
      Change-log V1 -> V2:
      	- Use sockc.tsflags instead of sk->sk_tsflags.
      	- Use the same code path for normal writes and errors.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad02c4f5
  3. 05 1月, 2017 1 次提交
  4. 03 1月, 2017 3 次提交
    • N
      ipmr, ip6mr: add RTNH_F_UNRESOLVED flag to unresolved cache entries · 1708ebc9
      Nikolay Aleksandrov 提交于
      While working with ipmr, we noticed that it is impossible to determine
      if an entry is actually unresolved or its IIF interface has disappeared
      (e.g. virtual interface got deleted). These entries look almost
      identical to user-space when dumping or receiving notifications. So in
      order to recognize them add a new RTNH_F_UNRESOLVED flag which is set when
      sending an unresolved cache entry to user-space.
      Suggested-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1708ebc9
    • A
      ipv4: Do not allow MAIN to be alias for new LOCAL w/ custom rules · 5350d54f
      Alexander Duyck 提交于
      In the case of custom rules being present we need to handle the case of the
      LOCAL table being intialized after the new rule has been added.  To address
      that I am adding a new check so that we can make certain we don't use an
      alias of MAIN for LOCAL when allocating a new table.
      
      Fixes: 0ddcf43d ("ipv4: FIB Local/MAIN table collapse")
      Reported-by: NOliver Brunel <jjk@jjacky.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5350d54f
    • M
      igmp: Make igmp group member RFC 3376 compliant · 7ababb78
      Michal Tesar 提交于
      5.2. Action on Reception of a Query
      
       When a system receives a Query, it does not respond immediately.
       Instead, it delays its response by a random amount of time, bounded
       by the Max Resp Time value derived from the Max Resp Code in the
       received Query message.  A system may receive a variety of Queries on
       different interfaces and of different kinds (e.g., General Queries,
       Group-Specific Queries, and Group-and-Source-Specific Queries), each
       of which may require its own delayed response.
      
       Before scheduling a response to a Query, the system must first
       consider previously scheduled pending responses and in many cases
       schedule a combined response.  Therefore, the system must be able to
       maintain the following state:
      
       o A timer per interface for scheduling responses to General Queries.
      
       o A per-group and interface timer for scheduling responses to Group-
         Specific and Group-and-Source-Specific Queries.
      
       o A per-group and interface list of sources to be reported in the
         response to a Group-and-Source-Specific Query.
      
       When a new Query with the Router-Alert option arrives on an
       interface, provided the system has state to report, a delay for a
       response is randomly selected in the range (0, [Max Resp Time]) where
       Max Resp Time is derived from Max Resp Code in the received Query
       message.  The following rules are then used to determine if a Report
       needs to be scheduled and the type of Report to schedule.  The rules
       are considered in order and only the first matching rule is applied.
      
       1. If there is a pending response to a previous General Query
          scheduled sooner than the selected delay, no additional response
          needs to be scheduled.
      
       2. If the received Query is a General Query, the interface timer is
          used to schedule a response to the General Query after the
          selected delay.  Any previously pending response to a General
          Query is canceled.
      --8<--
      
      Currently the timer is rearmed with new random expiration time for
      every incoming query regardless of possibly already pending report.
      Which is not aligned with the above RFE.
      It also might happen that higher rate of incoming queries can
      postpone the report after the expiration time of the first query
      causing group membership loss.
      
      Now the per interface general query timer is rearmed only
      when there is no pending report already scheduled on that interface or
      the newly selected expiration time is before the already pending
      scheduled report.
      Signed-off-by: NMichal Tesar <mtesar@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ababb78
  5. 31 12月, 2016 1 次提交
    • D
      net: Allow IP_MULTICAST_IF to set index to L3 slave · 7bb387c5
      David Ahern 提交于
      IP_MULTICAST_IF fails if sk_bound_dev_if is already set and the new index
      does not match it. e.g.,
      
          ntpd[15381]: setsockopt IP_MULTICAST_IF 192.168.1.23 fails: Invalid argument
      
      Relax the check in setsockopt to allow setting mc_index to an L3 slave if
      sk_bound_dev_if points to an L3 master.
      
      Make a similar change for IPv6. In this case change the device lookup to
      take the rcu_read_lock avoiding a refcnt. The rcu lock is also needed for
      the lookup of a potential L3 master device.
      
      This really only silences a setsockopt failure since uses of mc_index are
      secondary to sk_bound_dev_if if it is set. In both cases, if either index
      is an L3 slave or master, lookups are directed to the same FIB table so
      relaxing the check at setsockopt time causes no harm.
      
      Patch is based on a suggested change by Darwin for a problem noted in
      their code base.
      Suggested-by: NDarwin Dingel <darwin.dingel@alliedtelesis.co.nz>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bb387c5
  6. 30 12月, 2016 4 次提交
  7. 28 12月, 2016 1 次提交
  8. 26 12月, 2016 1 次提交
    • T
      ktime: Get rid of the union · 2456e855
      Thomas Gleixner 提交于
      ktime is a union because the initial implementation stored the time in
      scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
      variant for 32bit machines. The Y2038 cleanup removed the timespec variant
      and switched everything to scalar nanoseconds. The union remained, but
      become completely pointless.
      
      Get rid of the union and just keep ktime_t as simple typedef of type s64.
      
      The conversion was done with coccinelle and some manual mopping up.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2456e855
  9. 25 12月, 2016 1 次提交
  10. 24 12月, 2016 1 次提交
  11. 23 12月, 2016 1 次提交
  12. 22 12月, 2016 1 次提交
  13. 20 12月, 2016 1 次提交
    • Z
      ipv4: Should use consistent conditional judgement for ip fragment in... · 0a28cfd5
      zheng li 提交于
      ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output
      
      There is an inconsistent conditional judgement in __ip_append_data and
      ip_finish_output functions, the variable length in __ip_append_data just
      include the length of application's payload and udp header, don't include
      the length of ip header, but in ip_finish_output use
      (skb->len > ip_skb_dst_mtu(skb)) as judgement, and skb->len include the
      length of ip header.
      
      That causes some particular application's udp payload whose length is
      between (MTU - IP Header) and MTU were fragmented by ip_fragment even
      though the rst->dev support UFO feature.
      
      Add the length of ip header to length in __ip_append_data to keep
      consistent conditional judgement as ip_finish_output for ip fragment.
      Signed-off-by: NZheng Li <james.z.li@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a28cfd5
  14. 18 12月, 2016 2 次提交
    • T
      inet: Fix get port to handle zero port number with soreuseport set · 0643ee4f
      Tom Herbert 提交于
      A user may call listen with binding an explicit port with the intent
      that the kernel will assign an available port to the socket. In this
      case inet_csk_get_port does a port scan. For such sockets, the user may
      also set soreuseport with the intent a creating more sockets for the
      port that is selected. The problem is that the initial socket being
      opened could inadvertently choose an existing and unreleated port
      number that was already created with soreuseport.
      
      This patch adds a boolean parameter to inet_bind_conflict that indicates
      rather soreuseport is allowed for the check (in addition to
      sk->sk_reuseport). In calls to inet_bind_conflict from inet_csk_get_port
      the argument is set to true if an explicit port is being looked up (snum
      argument is nonzero), and is false if port scan is done.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0643ee4f
    • T
      inet: Don't go into port scan when looking for specific bind port · 9af7e923
      Tom Herbert 提交于
      inet_csk_get_port is called with port number (snum argument) that may be
      zero or nonzero. If it is zero, then the intent is to find an available
      ephemeral port number to bind to. If snum is non-zero then the caller
      is asking to allocate a specific port number. In the latter case we
      never want to perform the scan in ephemeral port range. It is
      conceivable that this can happen if the "goto again" in "tb_found:"
      is done. This patch adds a check that snum is zero before doing
      the "goto again".
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9af7e923
  15. 10 12月, 2016 4 次提交
    • E
      udp: udp_rmem_release() should touch sk_rmem_alloc later · 02ab0d13
      Eric Dumazet 提交于
      In flood situations, keeping sk_rmem_alloc at a high value
      prevents producers from touching the socket.
      
      It makes sense to lower sk_rmem_alloc only at the end
      of udp_rmem_release() after the thread draining receive
      queue in udp_recvmsg() finished the writes to sk_forward_alloc.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02ab0d13
    • E
      udp: add batching to udp_rmem_release() · 6b229cf7
      Eric Dumazet 提交于
      If udp_recvmsg() constantly releases sk_rmem_alloc
      for every read packet, it gives opportunity for
      producers to immediately grab spinlocks and desperatly
      try adding another packet, causing false sharing.
      
      We can add a simple heuristic to give the signal
      by batches of ~25 % of the queue capacity.
      
      This patch considerably increases performance under
      flood by about 50 %, since the thread draining the queue
      is no longer slowed by false sharing.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b229cf7
    • E
      udp: copy skb->truesize in the first cache line · c84d9490
      Eric Dumazet 提交于
      In UDP RX handler, we currently clear skb->dev before skb
      is added to receive queue, because device pointer is no longer
      available once we exit from RCU section.
      
      Since this first cache line is always hot, lets reuse this space
      to store skb->truesize and thus avoid a cache line miss at
      udp_recvmsg()/udp_skb_destructor time while receive queue
      spinlock is held.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c84d9490
    • E
      udp: add busylocks in RX path · 4b272750
      Eric Dumazet 提交于
      Idea of busylocks is to let producers grab an extra spinlock
      to relieve pressure on the receive_queue spinlock shared by consumer.
      
      This behavior is requested only once socket receive queue is above
      half occupancy.
      
      Under flood, this means that only one producer can be in line
      trying to acquire the receive_queue spinlock.
      
      These busylock can be allocated on a per cpu manner, instead of a
      per socket one (that would consume a cache line per socket)
      
      This patch considerably improves UDP behavior under stress,
      depending on number of NIC RX queues and/or RPS spread.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b272750
  16. 09 12月, 2016 2 次提交
    • E
      udp: under rx pressure, try to condense skbs · c8c8b127
      Eric Dumazet 提交于
      Under UDP flood, many softirq producers try to add packets to
      UDP receive queue, and one user thread is burning one cpu trying
      to dequeue packets as fast as possible.
      
      Two parts of the per packet cost are :
      - copying payload from kernel space to user space,
      - freeing memory pieces associated with skb.
      
      If socket is under pressure, softirq handler(s) can try to pull in
      skb->head the payload of the packet if it fits.
      
      Meaning the softirq handler(s) can free/reuse the page fragment
      immediately, instead of letting udp_recvmsg() do this hundreds of usec
      later, possibly from another node.
      
      Additional gains :
      - We reduce skb->truesize and thus can store more packets per SO_RCVBUF
      - We avoid cache line misses at copyout() time and consume_skb() time,
      and avoid one put_page() with potential alien freeing on NUMA hosts.
      
      This comes at the cost of a copy, bounded to available tail room, which
      is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger
      than necessary)
      
      This patch gave me about 5 % increase in throughput in my tests.
      
      skb_condense() helper could probably used in other contexts.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8c8b127
    • Z
      icmp: correct return value of icmp_rcv() · f91c58d6
      Zhang Shengju 提交于
      Currently, icmp_rcv() always return zero on a packet delivery upcall.
      
      To make its behavior more compliant with the way this API should be
      used, this patch changes this to let it return NET_RX_SUCCESS when the
      packet is proper handled, and NET_RX_DROP otherwise.
      Signed-off-by: NZhang Shengju <zhangshengju@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f91c58d6
  17. 07 12月, 2016 9 次提交
  18. 06 12月, 2016 3 次提交
    • A
      switch getfrag callbacks to ..._full() primitives · 0b62fca2
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0b62fca2
    • K
      net: ping: check minimum size on ICMP header length · 0eab121e
      Kees Cook 提交于
      Prior to commit c0371da6 ("put iov_iter into msghdr") in v3.19, there
      was no check that the iovec contained enough bytes for an ICMP header,
      and the read loop would walk across neighboring stack contents. Since the
      iov_iter conversion, bad arguments are noticed, but the returned error is
      EFAULT. Returning EINVAL is a clearer error and also solves the problem
      prior to v3.19.
      
      This was found using trinity with KASAN on v3.18:
      
      BUG: KASAN: stack-out-of-bounds in memcpy_fromiovec+0x60/0x114 at addr ffffffc071077da0
      Read of size 8 by task trinity-c2/9623
      page:ffffffbe034b9a08 count:0 mapcount:0 mapping:          (null) index:0x0
      flags: 0x0()
      page dumped because: kasan: bad access detected
      CPU: 0 PID: 9623 Comm: trinity-c2 Tainted: G    BU         3.18.0-dirty #15
      Hardware name: Google Tegra210 Smaug Rev 1,3+ (DT)
      Call trace:
      [<ffffffc000209c98>] dump_backtrace+0x0/0x1ac arch/arm64/kernel/traps.c:90
      [<ffffffc000209e54>] show_stack+0x10/0x1c arch/arm64/kernel/traps.c:171
      [<     inline     >] __dump_stack lib/dump_stack.c:15
      [<ffffffc000f18dc4>] dump_stack+0x7c/0xd0 lib/dump_stack.c:50
      [<     inline     >] print_address_description mm/kasan/report.c:147
      [<     inline     >] kasan_report_error mm/kasan/report.c:236
      [<ffffffc000373dcc>] kasan_report+0x380/0x4b8 mm/kasan/report.c:259
      [<     inline     >] check_memory_region mm/kasan/kasan.c:264
      [<ffffffc00037352c>] __asan_load8+0x20/0x70 mm/kasan/kasan.c:507
      [<ffffffc0005b9624>] memcpy_fromiovec+0x5c/0x114 lib/iovec.c:15
      [<     inline     >] memcpy_from_msg include/linux/skbuff.h:2667
      [<ffffffc000ddeba0>] ping_common_sendmsg+0x50/0x108 net/ipv4/ping.c:674
      [<ffffffc000dded30>] ping_v4_sendmsg+0xd8/0x698 net/ipv4/ping.c:714
      [<ffffffc000dc91dc>] inet_sendmsg+0xe0/0x12c net/ipv4/af_inet.c:749
      [<     inline     >] __sock_sendmsg_nosec net/socket.c:624
      [<     inline     >] __sock_sendmsg net/socket.c:632
      [<ffffffc000cab61c>] sock_sendmsg+0x124/0x164 net/socket.c:643
      [<     inline     >] SYSC_sendto net/socket.c:1797
      [<ffffffc000cad270>] SyS_sendto+0x178/0x1d8 net/socket.c:1761
      
      CVE-2016-8399
      Reported-by: NQidan He <i@flanker017.me>
      Fixes: c319b4d7 ("net: ipv4: add IPPROTO_ICMP socket kind")
      Cc: stable@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0eab121e
    • E
      tcp: tsq: move tsq_flags close to sk_wmem_alloc · 7aa5470c
      Eric Dumazet 提交于
      tsq_flags being in the same cache line than sk_wmem_alloc
      makes a lot of sense. Both fields are changed from tcp_wfree()
      and more generally by various TSQ related functions.
      
      Prior patch made room in struct sock and added sk_tsq_flags,
      this patch deletes tsq_flags from struct tcp_sock.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7aa5470c