1. 08 10月, 2017 5 次提交
    • W
      ipv6: hook up exception table to store dst cache · 2b760fcf
      Wei Wang 提交于
      This commit makes use of the exception hash table implementation to
      store dst caches created by pmtu discovery and ip redirect into the hash
      table under the rt_info and no longer inserts these routes into fib6
      tree.
      This makes the fib6 tree only contain static configured routes and could
      now be protected by rcu instead of a rw lock.
      With this change, in the route lookup related functions, after finding
      the rt6_info with the longest prefix, we also need to search for the
      exception table before doing backtracking.
      In the route delete function, if the route being deleted is not a dst
      cache, deletion of this route also need to flush the whole hash table
      under it. If it is a dst cache, then only delete the cached dst in the
      hash table.
      
      Note: for fib6_walk_continue() function, w->root now is always pointing
      to a root node considering that fib6_prune_clones() is removed from the
      code. So we add a WARN_ON() msg to make sure w->root always points to a
      root node and also removed the update of w->root in fib6_repair_tree().
      This is a prerequisite for later patch because we don't need to make
      w->root as rcu protected when replacing rwlock with RCU.
      Also, we remove all prune related variables as it is no longer used.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b760fcf
    • W
      ipv6: prepare fib6_locate() for exception table · 38fbeeee
      Wei Wang 提交于
      fib6_locate() is used to find the fib6_node according to the passed in
      prefix address key. It currently tries to find the fib6_node with the
      exact match of the passed in key. However, when we move cached routes
      into the exception table, fib6_locate() will fail to find the fib6_node
      for it as the cached routes will be stored in the exception table under
      the fib6_node with the longest prefix match of the cache's dst addr key.
      This commit adds a new parameter to let the caller specify if it needs
      exact match or longest prefix match.
      Right now, all callers still does exact match when calling
      fib6_locate(). It will be changed in later commit where exception table
      is hooked up to store cached routes.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38fbeeee
    • W
      ipv6: prepare fib6_age() for exception table · c757faa8
      Wei Wang 提交于
      If all dst cache entries are stored in the exception table under the
      main route, we have to go through them during fib6_age() when doing
      garbage collecting.
      Introduce a new function rt6_age_exception() which goes through all dst
      entries in the exception table and remove those entries that are expired.
      This function is called in fib6_age() so that all dst caches are also
      garbage collected.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c757faa8
    • W
      ipv6: introduce a hash table to store dst cache · 35732d01
      Wei Wang 提交于
      Add a hash table into struct rt6_info in order to store dst caches
      created by pmtu discovery and ip redirect in ipv6 routing code.
      APIs to add dst cache, delete dst cache, find dst cache and update
      dst cache in the hash table are implemented and will be used in later
      commits.
      This is a preparation work to move all cache routes into the exception
      table instead of getting inserted into the fib6 tree.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35732d01
    • W
      ipv6: introduce a new function fib6_update_sernum() · 180ca444
      Wei Wang 提交于
      This function takes a route as input and tries to update the sernum in
      the fib6_node this route is associated with. It will be used in later
      commit when adding a cached route into the exception table under that
      route.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      180ca444
  2. 07 10月, 2017 3 次提交
    • E
      tcp: implement rb-tree based retransmit queue · 75c119af
      Eric Dumazet 提交于
      Using a linear list to store all skbs in write queue has been okay
      for quite a while : O(N) is not too bad when N < 500.
      
      Things get messy when N is the order of 100,000 : Modern TCP stacks
      want 10Gbit+ of throughput even with 200 ms RTT flows.
      
      40 ns per cache line miss means a full scan can use 4 ms,
      blowing away CPU caches.
      
      SACK processing often can use various hints to avoid parsing
      whole retransmit queue. But with high packet losses and/or high
      reordering, hints no longer work.
      
      Sender has to process thousands of unfriendly SACK, accumulating
      a huge socket backlog, burning a cpu and massively dropping packets.
      
      Using an rb-tree for retransmit queue has been avoided for years
      because it added complexity and overhead, but now is the time
      to be more resistant and say no to quadratic behavior.
      
      1) RTX queue is no longer part of the write queue : already sent skbs
      are stored in one rb-tree.
      
      2) Since reaching the head of write queue no longer needs
      sk->sk_send_head, we added an union of sk_send_head and tcp_rtx_queue
      
      Tested:
      
       On receiver :
       netem on ingress : delay 150ms 200us loss 1
       GRO disabled to force stress and SACK storms.
      
      for f in `seq 1 10`
      do
       ./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1
      done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}'
      
      Before patch :
      
      323.87
      351.48
      339.59
      338.62
      306.72
      204.07
      304.93
      291.88
      202.47
      176.88
         2840
      
      After patch:
      
      1700.83
      2207.98
      2070.17
      1544.26
      2114.76
      2124.89
      1693.14
      1080.91
      2216.82
      1299.94
        18053
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75c119af
    • E
      tcp: uninline tcp_write_queue_purge() · ac3f09ba
      Eric Dumazet 提交于
      Since the upcoming rtx rbtree will add some extra code,
      it is time to not inline this fat function anymore.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac3f09ba
    • J
      net/ipv6: Convert icmpv6_push_pending_frames to void · 4e64b1ed
      Joe Perches 提交于
      commit cc71b7b0 ("net/ipv6: remove unused err variable on
      icmpv6_push_pending_frames") exposed icmpv6_push_pending_frames
      return value not being used.
      
      Remove now unnecessary int err declarations and uses.
      
      Miscellanea:
      
      o Remove unnecessary goto and out: labels
      o Realign arguments
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e64b1ed
  3. 06 10月, 2017 5 次提交
    • E
      tcp: new list for sent but unacked skbs for RACK recovery · e2080072
      Eric Dumazet 提交于
      This patch adds a new queue (list) that tracks the sent but not yet
      acked or SACKed skbs for a TCP connection. The list is chronologically
      ordered by skb->skb_mstamp (the head is the oldest sent skb).
      
      This list will be used to optimize TCP Rack recovery, which checks
      an skb's timestamp to judge if it has been lost and needs to be
      retransmitted. Since TCP write queue is ordered by sequence instead
      of sent time, RACK has to scan over the write queue to catch all
      eligible packets to detect lost retransmission, and iterates through
      SACKed skbs repeatedly.
      
      Special cares for rare events:
      1. TCP repair fakes skb transmission so the send queue needs adjusted
      2. SACK reneging would require re-inserting SACKed skbs into the
         send queue. For now I believe it's not worth the complexity to
         make RACK work perfectly on SACK reneging, so we do nothing here.
      3. Fast Open: currently for non-TFO, send-queue correctly queues
         the pure SYN packet. For TFO which queues a pure SYN and
         then a data packet, send-queue only queues the data packet but
         not the pure SYN due to the structure of TFO code. This is okay
         because the SYN receiver would never respond with a SACK on a
         missing SYN (i.e. SYN is never fast-retransmitted by SACK/RACK).
      
      In order to not grow sk_buff, we use an union for the new list and
      _skb_refdst/destructor fields. This is a bit complicated because
      we need to make sure _skb_refdst and destructor are properly zeroed
      before skb is cloned/copied at transmit, and before being freed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2080072
    • W
      tcp: uniform the set up of sockets after successful connection · 27204aaa
      Wei Wang 提交于
      Currently in the TCP code, the initialization sequence for cached
      metrics, congestion control, BPF, etc, after successful connection
      is very inconsistent. This introduces inconsistent bevhavior and is
      prone to bugs. The current call sequence is as follows:
      
      (1) for active case (tcp_finish_connect() case):
              tcp_mtup_init(sk);
              icsk->icsk_af_ops->rebuild_header(sk);
              tcp_init_metrics(sk);
              tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
              tcp_init_congestion_control(sk);
              tcp_init_buffer_space(sk);
      
      (2) for passive case (tcp_rcv_state_process() TCP_SYN_RECV case):
              icsk->icsk_af_ops->rebuild_header(sk);
              tcp_call_bpf(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
              tcp_init_congestion_control(sk);
              tcp_mtup_init(sk);
              tcp_init_buffer_space(sk);
              tcp_init_metrics(sk);
      
      (3) for TFO passive case (tcp_fastopen_create_child()):
              inet_csk(child)->icsk_af_ops->rebuild_header(child);
              tcp_init_congestion_control(child);
              tcp_mtup_init(child);
              tcp_init_metrics(child);
              tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
              tcp_init_buffer_space(child);
      
      This commit uniforms the above functions to have the following sequence:
              tcp_mtup_init(sk);
              icsk->icsk_af_ops->rebuild_header(sk);
              tcp_init_metrics(sk);
              tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE/PASSIVE_ESTABLISHED_CB);
              tcp_init_congestion_control(sk);
              tcp_init_buffer_space(sk);
      This sequence is the same as the (1) active case. We pick this sequence
      because this order correctly allows BPF to override the settings
      including congestion control module and initial cwnd, etc from
      the route, and then allows the CC module to see those settings.
      Suggested-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27204aaa
    • S
      VSOCK: use TCP state constants for sk_state · 3b4477d2
      Stefan Hajnoczi 提交于
      There are two state fields: socket->state and sock->sk_state.  The
      socket->state field uses SS_UNCONNECTED, SS_CONNECTED, etc while the
      sock->sk_state typically uses values that match TCP state constants
      (TCP_CLOSE, TCP_ESTABLISHED).  AF_VSOCK does not follow this convention
      and instead uses SS_* constants for both fields.
      
      The sk_state field will be exposed to userspace through the vsock_diag
      interface for ss(8), netstat(8), and other programs.
      
      This patch switches sk_state to TCP state constants so that the meaning
      of this field is consistent with other address families.  Not just
      AF_INET and AF_INET6 use the TCP constants, AF_UNIX and others do too.
      
      The following mapping was used to convert the code:
      
        SS_FREE -> TCP_CLOSE
        SS_UNCONNECTED -> TCP_CLOSE
        SS_CONNECTING -> TCP_SYN_SENT
        SS_CONNECTED -> TCP_ESTABLISHED
        SS_DISCONNECTING -> TCP_CLOSING
        VSOCK_SS_LISTEN -> TCP_LISTEN
      
      In __vsock_create() the sk_state initialization was dropped because
      sock_init_data() already initializes sk_state to TCP_CLOSE.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b4477d2
    • S
      VSOCK: move __vsock_in_bound/connected_table() to af_vsock.h · bf359b81
      Stefan Hajnoczi 提交于
      The vsock_diag.ko module will need to check socket table membership.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf359b81
    • S
      VSOCK: export socket tables for sock_diag interface · 44f20980
      Stefan Hajnoczi 提交于
      The socket table symbols need to be exported from vsock.ko so that the
      vsock_diag.ko module will be able to traverse sockets.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44f20980
  4. 05 10月, 2017 3 次提交
  5. 04 10月, 2017 5 次提交
    • M
      sctp: introduce round robin stream scheduler · ac1ed8b8
      Marcelo Ricardo Leitner 提交于
      This patch introduces RFC Draft ndata section 3.2 Priority Based
      Scheduler (SCTP_SS_RR).
      
      Works by maintaining a list of enqueued streams and tracking the last
      one used to send data. When the datamsg is done, it switches to the next
      stream.
      
      See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13Tested-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac1ed8b8
    • M
      sctp: introduce priority based stream scheduler · 637784ad
      Marcelo Ricardo Leitner 提交于
      This patch introduces RFC Draft ndata section 3.4 Priority Based
      Scheduler (SCTP_SS_PRIO).
      
      It works by having a struct sctp_stream_priority for each priority
      configured. This struct is then enlisted on a queue ordered per priority
      if, and only if, there is a stream with data queued, so that dequeueing
      is very straightforward: either finish current datamsg or simply dequeue
      from the highest priority queued, which is the next stream pointed, and
      that's it.
      
      If there are multiple streams assigned with the same priority and with
      data queued, it will do round robin amongst them while respecting
      datamsgs boundaries (when not using idata chunks), to be reasonably
      fair.
      
      We intentionally don't maintain a list of priorities nor a list of all
      streams with the same priority to save memory. The first would mean at
      least 2 other pointers per priority (which, for 1000 priorities, that
      can mean 16kB) and the second would also mean 2 other pointers but per
      stream. As SCTP supports up to 65535 streams on a given asoc, that's
      1MB. This impacts when giving a priority to some stream, as we have to
      find out if the new priority is already being used and if we can free
      the old one, and also when tearing down.
      
      The new fields in struct sctp_stream_out_ext and sctp_stream are added
      under a union because that memory is to be shared with other schedulers.
      
      See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13Tested-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      637784ad
    • M
      sctp: introduce stream scheduler foundations · 5bbbbe32
      Marcelo Ricardo Leitner 提交于
      This patch introduces the hooks necessary to do stream scheduling, as
      per RFC Draft ndata.  It also introduces the first scheduler, which is
      what we do today but now factored out: first come first served (FCFS).
      
      With stream scheduling now we have to track which chunk was enqueued on
      which stream and be able to select another other than the in front of
      the main outqueue. So we introduce a list on sctp_stream_out_ext
      structure for this purpose.
      
      We reuse sctp_chunk->transmitted_list space for the list above, as the
      chunk cannot belong to the two lists at the same time. By using the
      union in there, we can have distinct names for these moments.
      
      sctp_sched_ops are the operations expected to be implemented by each
      scheduler. The dequeueing is a bit particular to this implementation but
      it is to match how we dequeue packets today. We first dequeue and then
      check if it fits the packet and if not, we requeue it at head. Thus why
      we don't have a peek operation but have dequeue_done instead, which is
      called once the chunk can be safely considered as transmitted.
      
      The check removed from sctp_outq_flush is now performed by
      sctp_stream_outq_migrate, which is only called during assoc setup.
      (sctp_sendmsg() also checks for it)
      
      The only operation that is foreseen but not yet added here is a way to
      signalize that a new packet is starting or that the packet is done, for
      round robin scheduler per packet, but is intentionally left to the
      patch that actually implements it.
      
      Support for I-DATA chunks, also described in this RFC, with user message
      interleaving is straightforward as it just requires the schedulers to
      probe for the feature and ignore datamsg boundaries when dequeueing.
      
      See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13Tested-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5bbbbe32
    • M
      sctp: introduce sctp_chunk_stream_no · 2fc019f7
      Marcelo Ricardo Leitner 提交于
      Add a helper to fetch the stream number from a given chunk.
      Tested-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2fc019f7
    • M
      sctp: introduce struct sctp_stream_out_ext · f952be79
      Marcelo Ricardo Leitner 提交于
      With the stream schedulers, sctp_stream_out will become too big to be
      allocated by kmalloc and as we need to allocate with BH disabled, we
      cannot use __vmalloc in sctp_stream_init().
      
      This patch moves out the stats from sctp_stream_out to
      sctp_stream_out_ext, which will be allocated only when the application
      tries to sendmsg something on it.
      
      Just the introduction of sctp_stream_out_ext would already fix the issue
      described above by splitting the allocation in two. Moving the stats
      to it also reduces the pressure on the allocator as we will ask for less
      memory atomically when creating the socket and we will use GFP_KERNEL
      later.
      
      Then, for stream schedulers, we will just use sctp_stream_out_ext.
      Tested-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f952be79
  6. 03 10月, 2017 1 次提交
  7. 02 10月, 2017 5 次提交
  8. 01 10月, 2017 5 次提交
  9. 29 9月, 2017 1 次提交
  10. 28 9月, 2017 2 次提交
  11. 27 9月, 2017 1 次提交
  12. 26 9月, 2017 3 次提交
    • A
      neigh: make strucrt neigh_table::entry_size unsigned int · 01ccdf12
      Alexey Dobriyan 提交于
      Key length can't be negative.
      
      Leave comparisons against nla_len() signed just in case truncated attribute
      can sneak in there.
      
      Space savings:
      
      	add/remove: 0/0 grow/shrink: 0/7 up/down: 0/-7 (-7)
      	function                                     old     new   delta
      	pneigh_delete                                273     272      -1
      	mlx5e_rep_netevent_event                    1415    1414      -1
      	mlx5e_create_encap_header_ipv6              1194    1193      -1
      	mlx5e_create_encap_header_ipv4              1071    1070      -1
      	cxgb4_l2t_get                               1104    1103      -1
      	__pneigh_lookup                               69      68      -1
      	__neigh_create                              2452    2451      -1
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01ccdf12
    • A
      neigh: make struct neigh_table::entry_size unsigned int · e451ae8e
      Alexey Dobriyan 提交于
      Neigh entry size can't be negative.
      
      Space savings:
      
      	add/remove: 0/0 grow/shrink: 0/5 up/down: 0/-7 (-7)
      	function                                     old     new   delta
      	lowpan_neigh_construct                        25      24      -1
      	clip_seq_sub_iter                            152     151      -1
      	clip_ioctl                                  1475    1474      -1
      	clip_constructor                              93      92      -1
      	__neigh_create                              2455    2452      -3
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e451ae8e
    • A
      netlink: fix nla_put_{u8,u16,u32} for KASAN · b4391db4
      Arnd Bergmann 提交于
      When CONFIG_KASAN is enabled, the "--param asan-stack=1" causes rather large
      stack frames in some functions. This goes unnoticed normally because
      CONFIG_FRAME_WARN is disabled with CONFIG_KASAN by default as of commit
      3f181b4d ("lib/Kconfig.debug: disable -Wframe-larger-than warnings with
      KASAN=y").
      
      The kernelci.org build bot however has the warning enabled and that led
      me to investigate it a little further, as every build produces these warnings:
      
      net/wireless/nl80211.c:4389:1: warning: the frame size of 2240 bytes is larger than 2048 bytes [-Wframe-larger-than=]
      net/wireless/nl80211.c:1895:1: warning: the frame size of 3776 bytes is larger than 2048 bytes [-Wframe-larger-than=]
      net/wireless/nl80211.c:1410:1: warning: the frame size of 2208 bytes is larger than 2048 bytes [-Wframe-larger-than=]
      net/bridge/br_netlink.c:1282:1: warning: the frame size of 2544 bytes is larger than 2048 bytes [-Wframe-larger-than=]
      
      Most of this problem is now solved in gcc-8, which can consolidate
      the stack slots for the inline function arguments. On older compilers
      we can add a workaround by declaring a local variable in each function
      to pass the inline function argument.
      
      Cc: stable@vger.kernel.org
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81715Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4391db4
  13. 22 9月, 2017 1 次提交
    • E
      net: prevent dst uses after free · 222d7dbd
      Eric Dumazet 提交于
      In linux-4.13, Wei worked hard to convert dst to a traditional
      refcounted model, removing GC.
      
      We now want to make sure a dst refcount can not transition from 0 back
      to 1.
      
      The problem here is that input path attached a not refcounted dst to an
      skb. Then later, because packet is forwarded and hits skb_dst_force()
      before exiting RCU section, we might try to take a refcount on one dst
      that is about to be freed, if another cpu saw 1 -> 0 transition in
      dst_release() and queued the dst for freeing after one RCU grace period.
      
      Lets unify skb_dst_force() and skb_dst_force_safe(), since we should
      always perform the complete check against dst refcount, and not assume
      it is not zero.
      
      Bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=197005
      
      [  989.919496]  skb_dst_force+0x32/0x34
      [  989.919498]  __dev_queue_xmit+0x1ad/0x482
      [  989.919501]  ? eth_header+0x28/0xc6
      [  989.919502]  dev_queue_xmit+0xb/0xd
      [  989.919504]  neigh_connected_output+0x9b/0xb4
      [  989.919507]  ip_finish_output2+0x234/0x294
      [  989.919509]  ? ipt_do_table+0x369/0x388
      [  989.919510]  ip_finish_output+0x12c/0x13f
      [  989.919512]  ip_output+0x53/0x87
      [  989.919513]  ip_forward_finish+0x53/0x5a
      [  989.919515]  ip_forward+0x2cb/0x3e6
      [  989.919516]  ? pskb_trim_rcsum.part.9+0x4b/0x4b
      [  989.919518]  ip_rcv_finish+0x2e2/0x321
      [  989.919519]  ip_rcv+0x26f/0x2eb
      [  989.919522]  ? vlan_do_receive+0x4f/0x289
      [  989.919523]  __netif_receive_skb_core+0x467/0x50b
      [  989.919526]  ? tcp_gro_receive+0x239/0x239
      [  989.919529]  ? inet_gro_receive+0x226/0x238
      [  989.919530]  __netif_receive_skb+0x4d/0x5f
      [  989.919532]  netif_receive_skb_internal+0x5c/0xaf
      [  989.919533]  napi_gro_receive+0x45/0x81
      [  989.919536]  ixgbe_poll+0xc8a/0xf09
      [  989.919539]  ? kmem_cache_free_bulk+0x1b6/0x1f7
      [  989.919540]  net_rx_action+0xf4/0x266
      [  989.919543]  __do_softirq+0xa8/0x19d
      [  989.919545]  irq_exit+0x5d/0x6b
      [  989.919546]  do_IRQ+0x9c/0xb5
      [  989.919548]  common_interrupt+0x93/0x93
      [  989.919548]  </IRQ>
      
      Similarly dst_clone() can use dst_hold() helper to have additional
      debugging, as a follow up to commit 44ebe791 ("net: add debug
      atomic_inc_not_zero() in dst_hold()")
      
      In net-next we will convert dst atomic_t to refcount_t for peace of
      mind.
      
      Fixes: a4c2fd7f ("net: remove DST_NOCACHE flag")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Reported-by: NPaweł Staszewski <pstaszewski@itcare.pl>
      Bisected-by: NPaweł Staszewski <pstaszewski@itcare.pl>
      Acked-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      222d7dbd