1. 18 10月, 2017 2 次提交
  2. 17 10月, 2017 2 次提交
  3. 15 10月, 2017 1 次提交
    • C
      tcp: add a tracepoint for tcp retransmission · e086101b
      Cong Wang 提交于
      We need a real-time notification for tcp retransmission
      for monitoring.
      
      Of course we could use ftrace to dynamically instrument this
      kernel function too, however we can't retrieve the connection
      information at the same time, for example perf-tools [1] reads
      /proc/net/tcp for socket details, which is slow when we have
      a lots of connections.
      
      Therefore, this patch adds a tracepoint for __tcp_retransmit_skb()
      and exposes src/dst IP addresses and ports of the connection.
      This also makes it easier to integrate into perf.
      
      Note, I expose both IPv4 and IPv6 addresses at the same time:
      for a IPv4 socket, v4 mapped address is used as IPv6 addresses,
      for a IPv6 socket, LOOPBACK4_IPV6 is already filled by kernel.
      Also, add sk and skb pointers as they are useful for BPF.
      
      1. https://github.com/brendangregg/perf-tools/blob/master/net/tcpretrans
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NBrendan Gregg <bgregg@netflix.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e086101b
  4. 11 10月, 2017 4 次提交
  5. 05 10月, 2017 8 次提交
  6. 04 10月, 2017 2 次提交
  7. 03 10月, 2017 2 次提交
  8. 29 9月, 2017 5 次提交
    • C
      net: Set sk_prot_creator when cloning sockets to the right proto · 9d538fa6
      Christoph Paasch 提交于
      sk->sk_prot and sk->sk_prot_creator can differ when the app uses
      IPV6_ADDRFORM (transforming an IPv6-socket to an IPv4-one).
      Which is why sk_prot_creator is there to make sure that sk_prot_free()
      does the kmem_cache_free() on the right kmem_cache slab.
      
      Now, if such a socket gets transformed back to a listening socket (using
      connect() with AF_UNSPEC) we will allocate an IPv4 tcp_sock through
      sk_clone_lock() when a new connection comes in. But sk_prot_creator will
      still point to the IPv6 kmem_cache (as everything got copied in
      sk_clone_lock()). When freeing, we will thus put this
      memory back into the IPv6 kmem_cache although it was allocated in the
      IPv4 cache. I have seen memory corruption happening because of this.
      
      With slub-debugging and MEMCG_KMEM enabled this gives the warning
      	"cache_from_obj: Wrong slab cache. TCPv6 but object is from TCP"
      
      A C-program to trigger this:
      
      void main(void)
      {
              int fd = socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP);
              int new_fd, newest_fd, client_fd;
              struct sockaddr_in6 bind_addr;
              struct sockaddr_in bind_addr4, client_addr1, client_addr2;
              struct sockaddr unsp;
              int val;
      
              memset(&bind_addr, 0, sizeof(bind_addr));
              bind_addr.sin6_family = AF_INET6;
              bind_addr.sin6_port = ntohs(42424);
      
              memset(&client_addr1, 0, sizeof(client_addr1));
              client_addr1.sin_family = AF_INET;
              client_addr1.sin_port = ntohs(42424);
              client_addr1.sin_addr.s_addr = inet_addr("127.0.0.1");
      
              memset(&client_addr2, 0, sizeof(client_addr2));
              client_addr2.sin_family = AF_INET;
              client_addr2.sin_port = ntohs(42421);
              client_addr2.sin_addr.s_addr = inet_addr("127.0.0.1");
      
              memset(&unsp, 0, sizeof(unsp));
              unsp.sa_family = AF_UNSPEC;
      
              bind(fd, (struct sockaddr *)&bind_addr, sizeof(bind_addr));
      
              listen(fd, 5);
      
              client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
              connect(client_fd, (struct sockaddr *)&client_addr1, sizeof(client_addr1));
              new_fd = accept(fd, NULL, NULL);
              close(fd);
      
              val = AF_INET;
              setsockopt(new_fd, SOL_IPV6, IPV6_ADDRFORM, &val, sizeof(val));
      
              connect(new_fd, &unsp, sizeof(unsp));
      
              memset(&bind_addr4, 0, sizeof(bind_addr4));
              bind_addr4.sin_family = AF_INET;
              bind_addr4.sin_port = ntohs(42421);
              bind(new_fd, (struct sockaddr *)&bind_addr4, sizeof(bind_addr4));
      
              listen(new_fd, 5);
      
              client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
              connect(client_fd, (struct sockaddr *)&client_addr2, sizeof(client_addr2));
      
              newest_fd = accept(new_fd, NULL, NULL);
              close(new_fd);
      
              close(client_fd);
              close(new_fd);
      }
      
      As far as I can see, this bug has been there since the beginning of the
      git-days.
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d538fa6
    • F
      rtnetlink: rtnl_have_link_slave_info doesn't need rtnl · 4c82a95e
      Florian Westphal 提交于
      it can be switched to rcu.
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c82a95e
    • F
      b1e66b9a
    • F
      rtnetlink: add helpers to dump vf information · 250fc3df
      Florian Westphal 提交于
      similar to earlier patches, split out more parts of this function to
      better see what is happening and where we assume rtnl is locked.
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      250fc3df
    • F
      rtnetlink: add helper to put master and link ifindexes · 79110a04
      Florian Westphal 提交于
      rtnl_fill_ifinfo currently requires caller to hold the rtnl mutex.
      Unfortunately the function is quite large which makes it harder to see
      which spots require the lock, which spots assume it and which ones could
      do without.
      
      Add helpers to factor out the ifindex dumping, one can use rcu to avoid
      rtnl dependency.
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79110a04
  9. 27 9月, 2017 3 次提交
    • D
      bpf: add meta pointer for direct access · de8f3a83
      Daniel Borkmann 提交于
      This work enables generic transfer of metadata from XDP into skb. The
      basic idea is that we can make use of the fact that the resulting skb
      must be linear and already comes with a larger headroom for supporting
      bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
      on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
      for adjusting a new pointer called xdp->data_meta. Thus, the packet has
      a flexible and programmable room for meta data, followed by the actual
      packet data. struct xdp_buff is therefore laid out that we first point
      to data_hard_start, then data_meta directly prepended to data followed
      by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
      account whether we have meta data already prepended and if so, memmove()s
      this along with the given offset provided there's enough room.
      
      xdp->data_meta is optional and programs are not required to use it. The
      rationale is that when we process the packet in XDP (e.g. as DoS filter),
      we can push further meta data along with it for the XDP_PASS case, and
      give the guarantee that a clsact ingress BPF program on the same device
      can pick this up for further post-processing. Since we work with skb
      there, we can also set skb->mark, skb->priority or other skb meta data
      out of BPF, thus having this scratch space generic and programmable
      allows for more flexibility than defining a direct 1:1 transfer of
      potentially new XDP members into skb (it's also more efficient as we
      don't need to initialize/handle each of such new members). The facility
      also works together with GRO aggregation. The scratch space at the head
      of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
      yet supporting xdp->data_meta can simply be set up with xdp->data_meta
      as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
      such that the subsequent match against xdp->data for later access is
      guaranteed to fail.
      
      The verifier treats xdp->data_meta/xdp->data the same way as we treat
      xdp->data/xdp->data_end pointer comparisons. The requirement for doing
      the compare against xdp->data is that it hasn't been modified from it's
      original address we got from ctx access. It may have a range marking
      already from prior successful xdp->data/xdp->data_end pointer comparisons
      though.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8f3a83
    • D
      bpf: rename bpf_compute_data_end into bpf_compute_data_pointers · 6aaae2b6
      Daniel Borkmann 提交于
      Just do the rename into bpf_compute_data_pointers() as we'll add
      one more pointer here to recompute.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6aaae2b6
    • T
      datagram: Remove redundant unlikely() · 98e4fcff
      Tobias Klauser 提交于
      IS_ERR() already implies unlikely(), so it can be omitted.
      Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98e4fcff
  10. 26 9月, 2017 2 次提交
    • A
      neigh: make strucrt neigh_table::entry_size unsigned int · 01ccdf12
      Alexey Dobriyan 提交于
      Key length can't be negative.
      
      Leave comparisons against nla_len() signed just in case truncated attribute
      can sneak in there.
      
      Space savings:
      
      	add/remove: 0/0 grow/shrink: 0/7 up/down: 0/-7 (-7)
      	function                                     old     new   delta
      	pneigh_delete                                273     272      -1
      	mlx5e_rep_netevent_event                    1415    1414      -1
      	mlx5e_create_encap_header_ipv6              1194    1193      -1
      	mlx5e_create_encap_header_ipv4              1071    1070      -1
      	cxgb4_l2t_get                               1104    1103      -1
      	__pneigh_lookup                               69      68      -1
      	__neigh_create                              2452    2451      -1
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01ccdf12
    • E
      net: speed up skb_rbtree_purge() · 7c90584c
      Eric Dumazet 提交于
      As measured in my prior patch ("sch_netem: faster rb tree removal"),
      rbtree_postorder_for_each_entry_safe() is nice looking but much slower
      than using rb_next() directly, except when tree is small enough
      to fit in CPU caches (then the cost is the same)
      
      Also note that there is not even an increase of text size :
      $ size net/core/skbuff.o.before net/core/skbuff.o
         text	   data	    bss	    dec	    hex	filename
        40711	   1298	      0	  42009	   a419	net/core/skbuff.o.before
        40711	   1298	      0	  42009	   a419	net/core/skbuff.o
      
      From: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c90584c
  11. 23 9月, 2017 2 次提交
  12. 22 9月, 2017 1 次提交
    • F
      net: ethtool: Add back transceiver type · 19cab887
      Florian Fainelli 提交于
      Commit 3f1ac7a7 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")
      deprecated the ethtool_cmd::transceiver field, which was fine in
      premise, except that the PHY library was actually using it to report the
      type of transceiver: internal or external.
      
      Use the first word of the reserved field to put this __u8 transceiver
      field back in. It is made read-only, and we don't expect the
      ETHTOOL_xLINKSETTINGS API to be doing anything with this anyway, so this
      is mostly for the legacy path where we do:
      
      ethtool_get_settings()
      -> dev->ethtool_ops->get_link_ksettings()
         -> convert_link_ksettings_to_legacy_settings()
      
      to have no information loss compared to the legacy get_settings API.
      
      Fixes: 3f1ac7a7 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19cab887
  13. 21 9月, 2017 1 次提交
  14. 20 9月, 2017 1 次提交
    • D
      bpf: fix ri->map_owner pointer on bpf_prog_realloc · 7c300131
      Daniel Borkmann 提交于
      Commit 109980b8 ("bpf: don't select potentially stale
      ri->map from buggy xdp progs") passed the pointer to the prog
      itself to be loaded into r4 prior on bpf_redirect_map() helper
      call, so that we can store the owner into ri->map_owner out of
      the helper.
      
      Issue with that is that the actual address of the prog is still
      subject to change when subsequent rewrites occur that require
      slow path in bpf_prog_realloc() to alloc more memory, e.g. from
      patching inlining helper functions or constant blinding. Thus,
      we really need to take prog->aux as the address we're holding,
      which also works with prog clones as they share the same aux
      object.
      
      Instead of then fetching aux->prog during runtime, which could
      potentially incur cache misses due to false sharing, we are
      going to just use aux for comparison on the map owner. This
      will also keep the patchlet of the same size, and later check
      in xdp_map_invalid() only accesses read-only aux pointer from
      the prog, it's also in the same cacheline already from prior
      access when calling bpf_func.
      
      Fixes: 109980b8 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c300131
  15. 14 9月, 2017 1 次提交
    • E
      net_sched: gen_estimator: fix scaling error in bytes/packets samples · ca558e18
      Eric Dumazet 提交于
      Denys reported wrong rate estimations with HTB classes.
      
      It appears the bug was added in linux-4.10, since my tests
      where using intervals of one second only.
      
      HTB using 4 sec default rate estimators, reported rates
      were 4x higher.
      
      We need to properly scale the bytes/packets samples before
      integrating them in EWMA.
      
      Tested:
       echo 1 >/sys/module/sch_htb/parameters/htb_rate_est
      
       Setup HTB with one class with a rate/cail of 5Gbit
      
       Generate traffic on this class
      
       tc -s -d cl sh dev eth0 classid 7002:11
      class htb 7002:11 parent 7002:1 prio 5 quantum 200000 rate 5Gbit ceil
      5Gbit linklayer ethernet burst 80000b/1 mpu 0b cburst 80000b/1 mpu 0b
      level 0 rate_handle 1
       Sent 1488215421648 bytes 982969243 pkt (dropped 0, overlimits 0
      requeues 0)
       rate 5Gbit 412814pps backlog 136260b 2p requeues 0
       TCP pkts/rtx 982969327/45 bytes 1488215557414/68130
       lended: 22732826 borrowed: 0 giants: 0
       tokens: -1684 ctokens: -1684
      
      Fixes: 1c0d32fd ("net_sched: gen_estimator: complete rewrite of rate estimators")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDenys Fedoryshchenko <nuclearcat@nuclearcat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca558e18
  16. 12 9月, 2017 1 次提交
    • J
      xdp: implement xdp_redirect_map for generic XDP · 96c5508e
      Jesper Dangaard Brouer 提交于
      Using bpf_redirect_map is allowed for generic XDP programs, but the
      appropriate map lookup was never performed in xdp_do_generic_redirect().
      
      Instead the map-index is directly used as the ifindex.  For the
      xdp_redirect_map sample in SKB-mode '-S', this resulted in trying
      sending on ifindex 0 which isn't valid, resulting in getting SKB
      packets dropped.  Thus, the reported performance numbers are wrong in
      commit 24251c26 ("samples/bpf: add option for native and skb mode
      for redirect apps") for the 'xdp_redirect_map -S' case.
      
      Before commit 109980b8 ("bpf: don't select potentially stale
      ri->map from buggy xdp progs") it could crash the kernel.  Like this
      commit also check that the map_owner owner is correct before
      dereferencing the map pointer.  But make sure that this API misusage
      can be caught by a tracepoint. Thus, allowing userspace via
      tracepoints to detect misbehaving bpf_progs.
      
      Fixes: 6103aa96 ("net: implement XDP_REDIRECT for xdp generic")
      Fixes: 24251c26 ("samples/bpf: add option for native and skb mode for redirect apps")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96c5508e
  17. 09 9月, 2017 2 次提交
    • D
      bpf: make error reporting in bpf_warn_invalid_xdp_action more clear · 9beb8bed
      Daniel Borkmann 提交于
      Differ between illegal XDP action code and just driver
      unsupported one to provide better feedback when we throw
      a one-time warning here. Reason is that with 814abfab
      ("xdp: add bpf_redirect helper function") not all drivers
      support the new XDP return code yet and thus they will
      fall into their 'default' case when checking for return
      codes after program return, which then triggers a
      bpf_warn_invalid_xdp_action() stating that the return
      code is illegal, but from XDP perspective it's not.
      
      I decided not to place something like a XDP_ACT_MAX define
      into uapi i) given we don't have this either for all other
      program types, ii) future action codes could have further
      encoding there, which would render such define unsuitable
      and we wouldn't be able to rip it out again, and iii) we
      rarely add new action codes.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9beb8bed
    • J
      net: rcu lock and preempt disable missing around generic xdp · bbbe211c
      John Fastabend 提交于
      do_xdp_generic must be called inside rcu critical section with preempt
      disabled to ensure BPF programs are valid and per-cpu variables used
      for redirect operations are consistent. This patch ensures this is true
      and fixes the splat below.
      
      The netif_receive_skb_internal() code path is now broken into two rcu
      critical sections. I decided it was better to limit the preempt_enable/disable
      block to just the xdp static key portion and the fallout is more
      rcu_read_lock/unlock calls. Seems like the best option to me.
      
      [  607.596901] =============================
      [  607.596906] WARNING: suspicious RCU usage
      [  607.596912] 4.13.0-rc4+ #570 Not tainted
      [  607.596917] -----------------------------
      [  607.596923] net/core/dev.c:3948 suspicious rcu_dereference_check() usage!
      [  607.596927]
      [  607.596927] other info that might help us debug this:
      [  607.596927]
      [  607.596933]
      [  607.596933] rcu_scheduler_active = 2, debug_locks = 1
      [  607.596938] 2 locks held by pool/14624:
      [  607.596943]  #0:  (rcu_read_lock_bh){......}, at: [<ffffffff95445ffd>] ip_finish_output2+0x14d/0x890
      [  607.596973]  #1:  (rcu_read_lock_bh){......}, at: [<ffffffff953c8e3a>] __dev_queue_xmit+0x14a/0xfd0
      [  607.597000]
      [  607.597000] stack backtrace:
      [  607.597006] CPU: 5 PID: 14624 Comm: pool Not tainted 4.13.0-rc4+ #570
      [  607.597011] Hardware name: Dell Inc. Precision Tower 5810/0HHV7N, BIOS A17 03/01/2017
      [  607.597016] Call Trace:
      [  607.597027]  dump_stack+0x67/0x92
      [  607.597040]  lockdep_rcu_suspicious+0xdd/0x110
      [  607.597054]  do_xdp_generic+0x313/0xa50
      [  607.597068]  ? time_hardirqs_on+0x5b/0x150
      [  607.597076]  ? mark_held_locks+0x6b/0xc0
      [  607.597088]  ? netdev_pick_tx+0x150/0x150
      [  607.597117]  netif_rx_internal+0x205/0x3f0
      [  607.597127]  ? do_xdp_generic+0xa50/0xa50
      [  607.597144]  ? lock_downgrade+0x2b0/0x2b0
      [  607.597158]  ? __lock_is_held+0x93/0x100
      [  607.597187]  netif_rx+0x119/0x190
      [  607.597202]  loopback_xmit+0xfd/0x1b0
      [  607.597214]  dev_hard_start_xmit+0x127/0x4e0
      
      Fixes: d4455169 ("net: xdp: support xdp generic on virtual devices")
      Fixes: b5cdae32 ("net: Generic XDP")
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bbbe211c