1. 06 12月, 2016 2 次提交
  2. 03 12月, 2016 1 次提交
  3. 01 12月, 2016 1 次提交
  4. 30 11月, 2016 2 次提交
  5. 29 11月, 2016 1 次提交
  6. 25 11月, 2016 1 次提交
    • E
      udplite: call proper backlog handlers · 30c7be26
      Eric Dumazet 提交于
      In commits 93821778 ("udp: Fix rcv socket locking") and
      f7ad74fe ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into
      __udpv6_queue_rcv_skb") UDP backlog handlers were renamed, but UDPlite
      was forgotten.
      
      This leads to crashes if UDPlite header is pulled twice, which happens
      starting from commit e6afc8ac ("udp: remove headers from UDP packets
      before queueing")
      
      Bug found by syzkaller team, thanks a lot guys !
      
      Note that backlog use in UDP/UDPlite is scheduled to be removed starting
      from linux-4.10, so this patch is only needed up to linux-4.9
      
      Fixes: 93821778 ("udp: Fix rcv socket locking")
      Fixes: f7ad74fe ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb")
      Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30c7be26
  7. 24 11月, 2016 1 次提交
    • D
      netfilter: Update ip_route_me_harder to consider L3 domain · 6d8b49c3
      David Ahern 提交于
      ip_route_me_harder is not considering the L3 domain and sending lookups
      to the wrong table. For example consider the following output rule:
      
      iptables -I OUTPUT -p tcp --dport 12345 -j REJECT --reject-with tcp-reset
      
      using perf to analyze lookups via the fib_table_lookup tracepoint shows:
      
      vrf-test  1187 [001] 46887.295927: fib:fib_table_lookup: table 255 oif 0 iif 0 src 0.0.0.0 dst 10.100.1.254 tos 0 scope 0 flags 0
              ffffffff8143922c perf_trace_fib_table_lookup ([kernel.kallsyms])
              ffffffff81493aac fib_table_lookup ([kernel.kallsyms])
              ffffffff8148dda3 __inet_dev_addr_type ([kernel.kallsyms])
              ffffffff8148ddf6 inet_addr_type ([kernel.kallsyms])
              ffffffff8149e344 ip_route_me_harder ([kernel.kallsyms])
      
      and
      
      vrf-test  1187 [001] 46887.295933: fib:fib_table_lookup: table 255 oif 0 iif 1 src 10.100.1.254 dst 10.100.1.2 tos 0 scope 0 flags
              ffffffff8143922c perf_trace_fib_table_lookup ([kernel.kallsyms])
              ffffffff81493aac fib_table_lookup ([kernel.kallsyms])
              ffffffff814998ff fib4_rule_action ([kernel.kallsyms])
              ffffffff81437f35 fib_rules_lookup ([kernel.kallsyms])
              ffffffff81499758 __fib_lookup ([kernel.kallsyms])
              ffffffff8144f010 fib_lookup.constprop.34 ([kernel.kallsyms])
              ffffffff8144f759 __ip_route_output_key_hash ([kernel.kallsyms])
              ffffffff8144fc6a ip_route_output_flow ([kernel.kallsyms])
              ffffffff8149e39b ip_route_me_harder ([kernel.kallsyms])
      
      In both cases the lookups are directed to table 255 rather than the
      table associated with the device via the L3 domain. Update both
      lookups to pull the L3 domain from the dst currently attached to the
      skb.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      6d8b49c3
  8. 22 11月, 2016 1 次提交
  9. 17 11月, 2016 2 次提交
  10. 16 11月, 2016 2 次提交
  11. 14 11月, 2016 2 次提交
    • E
      tcp: take care of truncations done by sk_filter() · ac6e7800
      Eric Dumazet 提交于
      With syzkaller help, Marco Grassi found a bug in TCP stack,
      crashing in tcp_collapse()
      
      Root cause is that sk_filter() can truncate the incoming skb,
      but TCP stack was not really expecting this to happen.
      It probably was expecting a simple DROP or ACCEPT behavior.
      
      We first need to make sure no part of TCP header could be removed.
      Then we need to adjust TCP_SKB_CB(skb)->end_seq
      
      Many thanks to syzkaller team and Marco for giving us a reproducer.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMarco Grassi <marco.gra@gmail.com>
      Reported-by: NVladis Dronov <vdronov@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac6e7800
    • S
      ipv4: use new_gw for redirect neigh lookup · 969447f2
      Stephen Suryaputra Lin 提交于
      In v2.6, ip_rt_redirect() calls arp_bind_neighbour() which returns 0
      and then the state of the neigh for the new_gw is checked. If the state
      isn't valid then the redirected route is deleted. This behavior is
      maintained up to v3.5.7 by check_peer_redirect() because rt->rt_gateway
      is assigned to peer->redirect_learned.a4 before calling
      ipv4_neigh_lookup().
      
      After commit 5943634f ("ipv4: Maintain redirect and PMTU info in
      struct rtable again."), ipv4_neigh_lookup() is performed without the
      rt_gateway assigned to the new_gw. In the case when rt_gateway (old_gw)
      isn't zero, the function uses it as the key. The neigh is most likely
      valid since the old_gw is the one that sends the ICMP redirect message.
      Then the new_gw is assigned to fib_nh_exception. The problem is: the
      new_gw ARP may never gets resolved and the traffic is blackholed.
      
      So, use the new_gw for neigh lookup.
      
      Changes from v1:
       - use __ipv4_neigh_lookup instead (per Eric Dumazet).
      
      Fixes: 5943634f ("ipv4: Maintain redirect and PMTU info in struct rtable again.")
      Signed-off-by: NStephen Suryaputra Lin <ssurya@ieee.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      969447f2
  12. 11 11月, 2016 1 次提交
  13. 10 11月, 2016 2 次提交
  14. 08 11月, 2016 1 次提交
    • A
      fib_trie: Correct /proc/net/route off by one error · fd0285a3
      Alexander Duyck 提交于
      The display of /proc/net/route has had a couple issues due to the fact that
      when I originally rewrote most of fib_trie I made it so that the iterator
      was tracking the next value to use instead of the current.
      
      In addition it had an off by 1 error where I was tracking the first piece
      of data as position 0, even though in reality that belonged to the
      SEQ_START_TOKEN.
      
      This patch updates the code so the iterator tracks the last reported
      position and key instead of the next expected position and key.  In
      addition it shifts things so that all of the leaves start at 1 instead of
      trying to report leaves starting with offset 0 as being valid.  With these
      two issues addressed this should resolve any off by one errors that were
      present in the display of /proc/net/route.
      
      Fixes: 25b97c01 ("ipv4: off-by-one in continuation handling in /proc/net/route")
      Cc: Andy Whitcroft <apw@canonical.com>
      Reported-by: NJason Baron <jbaron@akamai.com>
      Tested-by: NJason Baron <jbaron@akamai.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd0285a3
  15. 04 11月, 2016 4 次提交
    • E
      tcp: fix return value for partial writes · 79d8665b
      Eric Dumazet 提交于
      After my commit, tcp_sendmsg() might restart its loop after
      processing socket backlog.
      
      If sk_err is set, we blindly return an error, even though we
      copied data to user space before.
      
      We should instead return number of bytes that could be copied,
      otherwise user space might resend data and corrupt the stream.
      
      This might happen if another thread is using recvmsg(MSG_ERRQUEUE)
      to process timestamps.
      
      Issue was diagnosed by Soheil and Willem, big kudos to them !
      
      Fixes: d41a69f1 ("tcp: make tcp_sendmsg() aware of socket backlog")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79d8665b
    • L
      ipv4: allow local fragmentation in ip_finish_output_gso() · 9ee6c5dc
      Lance Richardson 提交于
      Some configurations (e.g. geneve interface with default
      MTU of 1500 over an ethernet interface with 1500 MTU) result
      in the transmission of packets that exceed the configured MTU.
      While this should be considered to be a "bad" configuration,
      it is still allowed and should not result in the sending
      of packets that exceed the configured MTU.
      
      Fix by dropping the assumption in ip_finish_output_gso() that
      locally originated gso packets will never need fragmentation.
      Basic testing using iperf (observing CPU usage and bandwidth)
      have shown no measurable performance impact for traffic not
      requiring fragmentation.
      
      Fixes: c7ba65d7 ("net: ip: push gso skb forwarding handling down the stack")
      Reported-by: NJan Tluka <jtluka@redhat.com>
      Signed-off-by: NLance Richardson <lrichard@redhat.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ee6c5dc
    • E
      tcp: fix potential memory corruption · ac9e70b1
      Eric Dumazet 提交于
      Imagine initial value of max_skb_frags is 17, and last
      skb in write queue has 15 frags.
      
      Then max_skb_frags is lowered to 14 or smaller value.
      
      tcp_sendmsg() will then be allowed to add additional page frags
      and eventually go past MAX_SKB_FRAGS, overflowing struct
      skb_shared_info.
      
      Fixes: 5f74f82e ("net:Add sysctl_max_skb_frags")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
      Cc: Håkon Bugge <haakon.bugge@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac9e70b1
    • W
      inet: fix sleeping inside inet_wait_for_connect() · 14135f30
      WANG Cong 提交于
      Andrey reported this kernel warning:
      
        WARNING: CPU: 0 PID: 4608 at kernel/sched/core.c:7724
        __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
        do not call blocking ops when !TASK_RUNNING; state=1 set at
        [<ffffffff811f5a5c>] prepare_to_wait+0xbc/0x210
        kernel/sched/wait.c:178
        Modules linked in:
        CPU: 0 PID: 4608 Comm: syz-executor Not tainted 4.9.0-rc2+ #320
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
         ffff88006625f7a0 ffffffff81b46914 ffff88006625f818 0000000000000000
         ffffffff84052960 0000000000000000 ffff88006625f7e8 ffffffff81111237
         ffff88006aceac00 ffffffff00001e2c ffffed000cc4beff ffffffff84052960
        Call Trace:
         [<     inline     >] __dump_stack lib/dump_stack.c:15
         [<ffffffff81b46914>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
         [<ffffffff81111237>] __warn+0x1a7/0x1f0 kernel/panic.c:550
         [<ffffffff8111132c>] warn_slowpath_fmt+0xac/0xd0 kernel/panic.c:565
         [<ffffffff811922fc>] __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
         [<     inline     >] slab_pre_alloc_hook mm/slab.h:393
         [<     inline     >] slab_alloc_node mm/slub.c:2634
         [<     inline     >] slab_alloc mm/slub.c:2716
         [<ffffffff81508da0>] __kmalloc_track_caller+0x150/0x2a0 mm/slub.c:4240
         [<ffffffff8146be14>] kmemdup+0x24/0x50 mm/util.c:113
         [<ffffffff8388b2cf>] dccp_feat_clone_sp_val.part.5+0x4f/0xe0 net/dccp/feat.c:374
         [<     inline     >] dccp_feat_clone_sp_val net/dccp/feat.c:1141
         [<     inline     >] dccp_feat_change_recv net/dccp/feat.c:1141
         [<ffffffff8388d491>] dccp_feat_parse_options+0xaa1/0x13d0 net/dccp/feat.c:1411
         [<ffffffff83894f01>] dccp_parse_options+0x721/0x1010 net/dccp/options.c:128
         [<ffffffff83891280>] dccp_rcv_state_process+0x200/0x15b0 net/dccp/input.c:644
         [<ffffffff838b8a94>] dccp_v4_do_rcv+0xf4/0x1a0 net/dccp/ipv4.c:681
         [<     inline     >] sk_backlog_rcv ./include/net/sock.h:872
         [<ffffffff82b7ceb6>] __release_sock+0x126/0x3a0 net/core/sock.c:2044
         [<ffffffff82b7d189>] release_sock+0x59/0x1c0 net/core/sock.c:2502
         [<     inline     >] inet_wait_for_connect net/ipv4/af_inet.c:547
         [<ffffffff8316b2a2>] __inet_stream_connect+0x5d2/0xbb0 net/ipv4/af_inet.c:617
         [<ffffffff8316b8d5>] inet_stream_connect+0x55/0xa0 net/ipv4/af_inet.c:656
         [<ffffffff82b705e4>] SYSC_connect+0x244/0x2f0 net/socket.c:1533
         [<ffffffff82b72dd4>] SyS_connect+0x24/0x30 net/socket.c:1514
         [<ffffffff83fbf701>] entry_SYSCALL_64_fastpath+0x1f/0xc2
        arch/x86/entry/entry_64.S:209
      
      Unlike commit 26cabd31
      ("sched, net: Clean up sk_wait_event() vs. might_sleep()"), the
      sleeping function is called before schedule_timeout(), this is indeed
      a bug. Fix this by moving the wait logic to the new API, it is similar
      to commit ff960a73
      ("netdev, sched/wait: Fix sleeping inside wait event").
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14135f30
  16. 01 11月, 2016 1 次提交
    • F
      dctcp: avoid bogus doubling of cwnd after loss · ce6dd233
      Florian Westphal 提交于
      If a congestion control module doesn't provide .undo_cwnd function,
      tcp_undo_cwnd_reduction() will set cwnd to
      
         tp->snd_cwnd = max(tp->snd_cwnd, tp->snd_ssthresh << 1);
      
      ... which makes sense for reno (it sets ssthresh to half the current cwnd),
      but it makes no sense for dctcp, which sets ssthresh based on the current
      congestion estimate.
      
      This can cause severe growth of cwnd (eventually overflowing u32).
      
      Fix this by saving last cwnd on loss and restore cwnd based on that,
      similar to cubic and other algorithms.
      
      Fixes: e3118e83 ("net: tcp: add DCTCP congestion control algorithm")
      Cc: Lawrence Brakmo <brakmo@fb.com>
      Cc: Andrew Shewmaker <agshew@gmail.com>
      Cc: Glenn Judd <glenn.judd@morganstanley.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce6dd233
  17. 31 10月, 2016 1 次提交
  18. 27 10月, 2016 1 次提交
    • E
      udp: fix IP_CHECKSUM handling · 10df8e61
      Eric Dumazet 提交于
      First bug was added in commit ad6f939a ("ip: Add offset parameter to
      ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
      AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
      ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));
      
      Then commit e6afc8ac ("udp: remove headers from UDP packets before
      queueing") forgot to adjust the offsets now UDP headers are pulled
      before skb are put in receive queue.
      
      Fixes: ad6f939a ("ip: Add offset parameter to ip_cmsg_recv")
      Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Sam Kumar <samanthakumar@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Tested-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10df8e61
  19. 23 10月, 2016 1 次提交
    • W
      ipv4: use the right lock for ping_group_range · 396a30cc
      WANG Cong 提交于
      This reverts commit a681574c
      ("ipv4: disable BH in set_ping_group_range()") because we never
      read ping_group_range in BH context (unlike local_port_range).
      
      Then, since we already have a lock for ping_group_range, those
      using ip_local_ports.lock for ping_group_range are clearly typos.
      
      We might consider to share a same lock for both ping_group_range
      and local_port_range w.r.t. space saving, but that should be for
      net-next.
      
      Fixes: a681574c ("ipv4: disable BH in set_ping_group_range()")
      Fixes: ba6b918a ("ping: move ping_group_range out of CONFIG_SYSCTL")
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Eric Salo <salo@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      396a30cc
  20. 21 10月, 2016 3 次提交
  21. 19 10月, 2016 1 次提交
  22. 18 10月, 2016 2 次提交
  23. 17 10月, 2016 1 次提交
    • D
      net: Require exact match for TCP socket lookups if dif is l3mdev · a04a480d
      David Ahern 提交于
      Currently, socket lookups for l3mdev (vrf) use cases can match a socket
      that is bound to a port but not a device (ie., a global socket). If the
      sysctl tcp_l3mdev_accept is not set this leads to ack packets going out
      based on the main table even though the packet came in from an L3 domain.
      The end result is that the connection does not establish creating
      confusion for users since the service is running and a socket shows in
      ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the
      skb came through an interface enslaved to an l3mdev device and the
      tcp_l3mdev_accept is not set.
      
      skb's through an l3mdev interface are marked by setting a flag in
      inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the
      flag for IPv4. Using an skb flag avoids a device lookup on the dif. The
      flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the
      inet_skb_parm struct is moved in the cb per commit 971f10ec, so the
      match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the
      move is done after the socket lookup, so IP6CB is used.
      
      The flags field in inet_skb_parm struct needs to be increased to add
      another flag. There is currently a 1-byte hole following the flags,
      so it can be expanded to u16 without increasing the size of the struct.
      
      Fixes: 193125db ("net: Introduce VRF device driver")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a04a480d
  24. 14 10月, 2016 1 次提交
  25. 08 10月, 2016 2 次提交
  26. 04 10月, 2016 1 次提交
    • A
      skb_splice_bits(): get rid of callback · 25869262
      Al Viro 提交于
      since pipe_lock is the outermost now, we don't need to drop/regain
      socket locks around the call of splice_to_pipe() from skb_splice_bits(),
      which kills the need to have a socket-specific callback; we can just
      call splice_to_pipe() and be done with that.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      25869262
  27. 30 9月, 2016 1 次提交