1. 23 8月, 2017 4 次提交
  2. 19 8月, 2017 5 次提交
    • L
      net: inet: diag: expose sockets cgroup classid · 0888e372
      Levin, Alexander (Sasha Levin) 提交于
      This is useful for directly looking up a task based on class id rather than
      having to scan through all open file descriptors.
      Signed-off-by: NSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0888e372
    • N
      tcp: when rearming RTO, if RTO time is in past then fire RTO ASAP · cdbeb633
      Neal Cardwell 提交于
      In some situations tcp_send_loss_probe() can realize that it's unable
      to send a loss probe (TLP), and falls back to calling tcp_rearm_rto()
      to schedule an RTO timer. In such cases, sometimes tcp_rearm_rto()
      realizes that the RTO was eligible to fire immediately or at some
      point in the past (delta_us <= 0). Previously in such cases
      tcp_rearm_rto() was scheduling such "overdue" RTOs to happen at now +
      icsk_rto, which caused needless delays of hundreds of milliseconds
      (and non-linear behavior that made reproducible testing
      difficult). This commit changes the logic to schedule "overdue" RTOs
      ASAP, rather than at now + icsk_rto.
      
      Fixes: 6ba8a3b1 ("tcp: Tail loss probe (TLP)")
      Suggested-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdbeb633
    • R
      net: check and errout if res->fi is NULL when RTM_F_FIB_MATCH is set · bc3aae2b
      Roopa Prabhu 提交于
      Syzkaller hit 'general protection fault in fib_dump_info' bug on
      commit 4.13-rc5..
      
      Guilty file: net/ipv4/fib_semantics.c
      
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 0 PID: 2808 Comm: syz-executor0 Not tainted 4.13.0-rc5 #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      Ubuntu-1.8.2-1ubuntu1 04/01/2014
      task: ffff880078562700 task.stack: ffff880078110000
      RIP: 0010:fib_dump_info+0x388/0x1170 net/ipv4/fib_semantics.c:1314
      RSP: 0018:ffff880078117010 EFLAGS: 00010206
      RAX: dffffc0000000000 RBX: 00000000000000fe RCX: 0000000000000002
      RDX: 0000000000000006 RSI: ffff880078117084 RDI: 0000000000000030
      RBP: ffff880078117268 R08: 000000000000000c R09: ffff8800780d80c8
      R10: 0000000058d629b4 R11: 0000000067fce681 R12: 0000000000000000
      R13: ffff8800784bd540 R14: ffff8800780d80b5 R15: ffff8800780d80a4
      FS:  00000000022fa940(0000) GS:ffff88007fc00000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000004387d0 CR3: 0000000079135000 CR4: 00000000000006f0
      Call Trace:
        inet_rtm_getroute+0xc89/0x1f50 net/ipv4/route.c:2766
        rtnetlink_rcv_msg+0x288/0x680 net/core/rtnetlink.c:4217
        netlink_rcv_skb+0x340/0x470 net/netlink/af_netlink.c:2397
        rtnetlink_rcv+0x28/0x30 net/core/rtnetlink.c:4223
        netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline]
        netlink_unicast+0x4c4/0x6e0 net/netlink/af_netlink.c:1291
        netlink_sendmsg+0x8c4/0xca0 net/netlink/af_netlink.c:1854
        sock_sendmsg_nosec net/socket.c:633 [inline]
        sock_sendmsg+0xca/0x110 net/socket.c:643
        ___sys_sendmsg+0x779/0x8d0 net/socket.c:2035
        __sys_sendmsg+0xd1/0x170 net/socket.c:2069
        SYSC_sendmsg net/socket.c:2080 [inline]
        SyS_sendmsg+0x2d/0x50 net/socket.c:2076
        entry_SYSCALL_64_fastpath+0x1a/0xa5
        RIP: 0033:0x4512e9
        RSP: 002b:00007ffc75584cc8 EFLAGS: 00000216 ORIG_RAX:
        000000000000002e
        RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00000000004512e9
        RDX: 0000000000000000 RSI: 0000000020f2cfc8 RDI: 0000000000000003
        RBP: 000000000000000e R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000216 R12: fffffffffffffffe
        R13: 0000000000718000 R14: 0000000020c44ff0 R15: 0000000000000000
        Code: 00 0f b6 8d ec fd ff ff 48 8b 85 f0 fd ff ff 88 48 17 48 8b 45
        28 48 8d 78 30 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03
        <0f>
        b6 04 02 84 c0 74 08 3c 03 0f 8e cb 0c 00 00 48 8b 45 28 44
        RIP: fib_dump_info+0x388/0x1170 net/ipv4/fib_semantics.c:1314 RSP:
        ffff880078117010
      ---[ end trace 254a7af28348f88b ]---
      
      This patch adds a res->fi NULL check.
      
      example run:
      $ip route get 0.0.0.0 iif virt1-0
      broadcast 0.0.0.0 dev lo
          cache <local,brd> iif virt1-0
      
      $ip route get 0.0.0.0 iif virt1-0 fibmatch
      RTNETLINK answers: No route to host
      Reported-by: Nidaifish <idaifish@gmail.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Fixes: b6179813 ("net: ipv4: RTM_GETROUTE: return matched fib result when requested")
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc3aae2b
    • E
      ipv4: convert dst_metrics.refcnt from atomic_t to refcount_t · 9620fef2
      Eric Dumazet 提交于
      refcount_t type and corresponding API should be
      used instead of atomic_t when the variable is used as
      a reference counter. This allows to avoid accidental
      refcounter overflows that might lead to use-after-free
      situations.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9620fef2
    • M
      datagram: When peeking datagrams with offset < 0 don't skip empty skbs · a0917e0b
      Matthew Dawson 提交于
      Due to commit e6afc8ac ("udp: remove
      headers from UDP packets before queueing"), when udp packets are being
      peeked the requested extra offset is always 0 as there is no need to skip
      the udp header.  However, when the offset is 0 and the next skb is
      of length 0, it is only returned once.  The behaviour can be seen with
      the following python script:
      
      from socket import *;
      f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
      g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
      f.bind(('::', 0));
      addr=('::1', f.getsockname()[1]);
      g.sendto(b'', addr)
      g.sendto(b'b', addr)
      print(f.recvfrom(10, MSG_PEEK));
      print(f.recvfrom(10, MSG_PEEK));
      
      Where the expected output should be the empty string twice.
      
      Instead, make sk_peek_offset return negative values, and pass those values
      to __skb_try_recv_datagram/__skb_try_recv_from_queue.  If the passed offset
      to __skb_try_recv_from_queue is negative, the checked skb is never skipped.
      __skb_try_recv_from_queue will then ensure the offset is reset back to 0
      if a peek is requested without an offset, unless no packets are found.
      
      Also simplify the if condition in __skb_try_recv_from_queue.  If _off is
      greater then 0, and off is greater then or equal to skb->len, then
      (_off || skb->len) must always be true assuming skb->len >= 0 is always
      true.
      
      Also remove a redundant check around a call to sk_peek_offset in af_unix.c,
      as it double checked if MSG_PEEK was set in the flags.
      
      V2:
       - Moved the negative fixup into __skb_try_recv_from_queue, and remove now
      redundant checks
       - Fix peeking in udp{,v6}_recvmsg to report the right value when the
      offset is 0
      
      V3:
       - Marked new branch in __skb_try_recv_from_queue as unlikely.
      Signed-off-by: NMatthew Dawson <matthew@mjdsystems.ca>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0917e0b
  3. 17 8月, 2017 3 次提交
  4. 16 8月, 2017 2 次提交
  5. 15 8月, 2017 3 次提交
    • E
      tcp: fix possible deadlock in TCP stack vs BPF filter · d624d276
      Eric Dumazet 提交于
      Filtering the ACK packet was not put at the right place.
      
      At this place, we already allocated a child and put it
      into accept queue.
      
      We absolutely need to call tcp_child_process() to release
      its spinlock, or we will deadlock at accept() or close() time.
      
      Found by syzkaller team (Thanks a lot !)
      
      Fixes: 8fac365f ("tcp: Add a tcp_filter hook before handle ack packet")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Chenbo Feng <fengc@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d624d276
    • S
      tcp: ulp: avoid module refcnt leak in tcp_set_ulp · 539a06ba
      Sabrina Dubroca 提交于
      __tcp_ulp_find_autoload returns tcp_ulp_ops after taking a reference on
      the module. Then, if ->init fails, tcp_set_ulp propagates the error but
      nothing releases that reference.
      
      Fixes: 734942cc ("tcp: ULP infrastructure")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      539a06ba
    • F
      ipv4: route: fix inet_rtm_getroute induced crash · 2c87d63a
      Florian Westphal 提交于
      "ip route get $daddr iif eth0 from $saddr" causes:
       BUG: KASAN: use-after-free in ip_route_input_rcu+0x1535/0x1b50
       Call Trace:
        ip_route_input_rcu+0x1535/0x1b50
        ip_route_input_noref+0xf9/0x190
        tcp_v4_early_demux+0x1a4/0x2b0
        ip_rcv+0xbcb/0xc05
        __netif_receive_skb+0x9c/0xd0
        netif_receive_skb_internal+0x5a8/0x890
      
      Problem is that inet_rtm_getroute calls either ip_route_input_rcu (if an
      iif was provided) or ip_route_output_key_hash_rcu.
      
      But ip_route_input_rcu, unlike ip_route_output_key_hash_rcu, already
      associates the dst_entry with the skb.  This clears the SKB_DST_NOREF
      bit (i.e. skb_dst_drop will release/free the entry while it should not).
      
      Thus only set the dst if we called ip_route_output_key_hash_rcu().
      
      I tested this patch by running:
       while true;do ip r get 10.0.1.2;done > /dev/null &
       while true;do ip r get 10.0.1.2 iif eth0  from 10.0.1.1;done > /dev/null &
      ... and saw no crash or memory leak.
      
      Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
      Cc: David Ahern <dsahern@gmail.com>
      Fixes: ba52d61e ("ipv4: route: restore skb_dst_set in inet_rtm_getroute")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c87d63a
  6. 14 8月, 2017 2 次提交
  7. 12 8月, 2017 1 次提交
    • D
      net: ipv4: set orig_oif based on fib result for local traffic · 839da4d9
      David Ahern 提交于
      Attempts to connect to a local address with a socket bound
      to a device with the local address hangs if there is no listener:
      
        $ ip addr sh dev eth1
        3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
          link/ether 02:e0:f9:1c:00:37 brd ff:ff:ff:ff:ff:ff
          inet 10.100.1.4/24 scope global eth1
             valid_lft forever preferred_lft forever
          inet6 2001:db8:1::4/120 scope global
             valid_lft forever preferred_lft forever
          inet6 fe80::e0:f9ff:fe1c:37/64 scope link
             valid_lft forever preferred_lft forever
      
        $ vrf-test -I eth1 -r 10.100.1.4
        <hangs when there is no server>
      
      (don't let the command name fool you; vrf-test works without vrfs.)
      
      The problem is that the original intended device, eth1 in this case, is
      lost when the tcp reset is sent, so the socket lookup does not find a
      match for the reset and the connect attempt hangs. Fix by adjusting
      orig_oif for local traffic to the device from the fib lookup result.
      
      With this patch you get the more user friendly:
        $ vrf-test -I eth1 -r 10.100.1.4
        connect failed: 111: Connection refused
      
      orig_oif is saved to the newly created rtable as rt_iif and when set
      it is used as the dif for socket lookups. It is set based on flowi4_oif
      passed in to ip_route_output_key_hash_rcu and will be set to either
      the loopback device, an l3mdev device, nothing (flowi4_oif = 0 which
      is the case in the example above) or a netdev index depending on the
      lookup path. In each case, resetting orig_oif to the device in the fib
      result for the RTN_LOCAL case allows the actual device to be preserved
      as the skb tx and rx is done over the loopback or VRF device.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      839da4d9
  8. 11 8月, 2017 2 次提交
    • L
      net: xfrm: support setting an output mark. · 077fbac4
      Lorenzo Colitti 提交于
      On systems that use mark-based routing it may be necessary for
      routing lookups to use marks in order for packets to be routed
      correctly. An example of such a system is Android, which uses
      socket marks to route packets via different networks.
      
      Currently, routing lookups in tunnel mode always use a mark of
      zero, making routing incorrect on such systems.
      
      This patch adds a new output_mark element to the xfrm state and
      a corresponding XFRMA_OUTPUT_MARK netlink attribute. The output
      mark differs from the existing xfrm mark in two ways:
      
      1. The xfrm mark is used to match xfrm policies and states, while
         the xfrm output mark is used to set the mark (and influence
         the routing) of the packets emitted by those states.
      2. The existing mark is constrained to be a subset of the bits of
         the originating socket or transformed packet, but the output
         mark is arbitrary and depends only on the state.
      
      The use of a separate mark provides additional flexibility. For
      example:
      
      - A packet subject to two transforms (e.g., transport mode inside
        tunnel mode) can have two different output marks applied to it,
        one for the transport mode SA and one for the tunnel mode SA.
      - On a system where socket marks determine routing, the packets
        emitted by an IPsec tunnel can be routed based on a mark that
        is determined by the tunnel, not by the marks of the
        unencrypted packets.
      - Support for setting the output marks can be introduced without
        breaking any existing setups that employ both mark-based
        routing and xfrm tunnel mode. Simply changing the code to use
        the xfrm mark for routing output packets could xfrm mark could
        change behaviour in a way that breaks these setups.
      
      If the output mark is unspecified or set to zero, the mark is not
      set or changed.
      
      Tested: make allyesconfig; make -j64
      Tested: https://android-review.googlesource.com/452776Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      077fbac4
    • W
      udp: consistently apply ufo or fragmentation · 85f1bd9a
      Willem de Bruijn 提交于
      When iteratively building a UDP datagram with MSG_MORE and that
      datagram exceeds MTU, consistently choose UFO or fragmentation.
      
      Once skb_is_gso, always apply ufo. Conversely, once a datagram is
      split across multiple skbs, do not consider ufo.
      
      Sendpage already maintains the first invariant, only add the second.
      IPv6 does not have a sendpage implementation to modify.
      
      A gso skb must have a partial checksum, do not follow sk_no_check_tx
      in udp_send_skb.
      
      Found by syzkaller.
      
      Fixes: e89e9cf5 ("[IPv4/IPv6]: UFO Scatter-gather approach")
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85f1bd9a
  9. 10 8月, 2017 2 次提交
  10. 09 8月, 2017 2 次提交
  11. 08 8月, 2017 6 次提交
  12. 07 8月, 2017 6 次提交
  13. 04 8月, 2017 2 次提交
    • W
      tcp: enable MSG_ZEROCOPY · f214f915
      Willem de Bruijn 提交于
      Enable support for MSG_ZEROCOPY to the TCP stack. TSO and GSO are
      both supported. Only data sent to remote destinations is sent without
      copying. Packets looped onto a local destination have their payload
      copied to avoid unbounded latency.
      
      Tested:
        A 10x TCP_STREAM between two hosts showed a reduction in netserver
        process cycles by up to 70%, depending on packet size. Systemwide,
        savings are of course much less pronounced, at up to 20% best case.
      
        msg_zerocopy.sh 4 tcp:
      
        without zerocopy
          tx=121792 (7600 MB) txc=0 zc=n
          rx=60458 (7600 MB)
      
        with zerocopy
          tx=286257 (17863 MB) txc=286257 zc=y
          rx=140022 (17863 MB)
      
        This test opens a pair of sockets over veth, one one calls send with
        64KB and optionally MSG_ZEROCOPY and on the other reads the initial
        bytes. The receiver truncates, so this is strictly an upper bound on
        what is achievable. It is more representative of sending data out of
        a physical NIC (when payload is not touched, either).
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f214f915
    • N
      tcp: fix xmit timer to only be reset if data ACKed/SACKed · df92c839
      Neal Cardwell 提交于
      Fix a TCP loss recovery performance bug raised recently on the netdev
      list, in two threads:
      
      (i)  July 26, 2017: netdev thread "TCP fast retransmit issues"
      (ii) July 26, 2017: netdev thread:
           "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
           outstanding TLP retransmission"
      
      The basic problem is that incoming TCP packets that did not indicate
      forward progress could cause the xmit timer (TLP or RTO) to be rearmed
      and pushed back in time. In certain corner cases this could result in
      the following problems noted in these threads:
      
       - Repeated ACKs coming in with bogus SACKs corrupted by middleboxes
         could cause TCP to repeatedly schedule TLPs forever. We kept
         sending TLPs after every ~200ms, which elicited bogus SACKs, which
         caused more TLPs, ad infinitum; we never fired an RTO to fill in
         the holes.
      
       - Incoming data segments could, in some cases, cause us to reschedule
         our RTO or TLP timer further out in time, for no good reason. This
         could cause repeated inbound data to result in stalls in outbound
         data, in the presence of packet loss.
      
      This commit fixes these bugs by changing the TLP and RTO ACK
      processing to:
      
       (a) Only reschedule the xmit timer once per ACK.
      
       (b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the
           ACK indicates sufficient forward progress (a packet was
           cumulatively ACKed, or we got a SACK for a packet that was sent
           before the most recent retransmit of the write queue head).
      
      This brings us back into closer compliance with the RFCs, since, as
      the comment for tcp_rearm_rto() notes, we should only restart the RTO
      timer after forward progress on the connection. Previously we were
      restarting the xmit timer even in these cases where there was no
      forward progress.
      
      As a side benefit, this commit simplifies and speeds up the TCP timer
      arming logic. We had been calling inet_csk_reset_xmit_timer() three
      times on normal ACKs that cumulatively acknowledged some data:
      
      1) Once near the top of tcp_ack() to switch from TLP timer to RTO:
              if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
                     tcp_rearm_rto(sk);
      
      2) Once in tcp_clean_rtx_queue(), to update the RTO:
              if (flag & FLAG_ACKED) {
                     tcp_rearm_rto(sk);
      
      3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO
         to TLP:
              if (icsk->icsk_pending == ICSK_TIME_RETRANS)
                     tcp_schedule_loss_probe(sk);
      
      This commit, by only rescheduling the xmit timer once per ACK,
      simplifies the code and reduces CPU overhead.
      
      This commit was tested in an A/B test with Google web server
      traffic. SNMP stats and request latency metrics were within noise
      levels, substantiating that for normal web traffic patterns this is a
      rare issue. This commit was also tested with packetdrill tests to
      verify that it fixes the timer behavior in the corner cases discussed
      in the netdev threads mentioned above.
      
      This patch is a bug fix patch intended to be queued for -stable
      relases.
      
      Fixes: 6ba8a3b1 ("tcp: Tail loss probe (TLP)")
      Reported-by: NKlavs Klavsen <kl@vsen.dk>
      Reported-by: NMao Wenan <maowenan@huawei.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df92c839