1. 23 4月, 2017 1 次提交
  2. 22 4月, 2017 4 次提交
    • T
      netpoll: Check for skb->queue_mapping · c70b17b7
      Tushar Dave 提交于
      Reducing real_num_tx_queues needs to be in sync with skb queue_mapping
      otherwise skbs with queue_mapping greater than real_num_tx_queues
      can be sent to the underlying driver and can result in kernel panic.
      
      One such event is running netconsole and enabling VF on the same
      device. Or running netconsole and changing number of tx queues via
      ethtool on same device.
      
      e.g.
      Unable to handle kernel NULL pointer dereference
      tsk->{mm,active_mm}->context = 0000000000001525
      tsk->{mm,active_mm}->pgd = fff800130ff9a000
                    \|/ ____ \|/
                    "@'/ .. \`@"
                    /_| \__/ |_\
                       \__U_/
      kworker/48:1(475): Oops [#1]
      CPU: 48 PID: 475 Comm: kworker/48:1 Tainted: G           OE
      4.11.0-rc3-davem-net+ #7
      Workqueue: events queue_process
      task: fff80013113299c0 task.stack: fff800131132c000
      TSTATE: 0000004480e01600 TPC: 00000000103f9e3c TNPC: 00000000103f9e40 Y:
      00000000    Tainted: G           OE
      TPC: <ixgbe_xmit_frame_ring+0x7c/0x6c0 [ixgbe]>
      g0: 0000000000000000 g1: 0000000000003fff g2: 0000000000000000 g3:
      0000000000000001
      g4: fff80013113299c0 g5: fff8001fa6808000 g6: fff800131132c000 g7:
      00000000000000c0
      o0: fff8001fa760c460 o1: fff8001311329a50 o2: fff8001fa7607504 o3:
      0000000000000003
      o4: fff8001f96e63a40 o5: fff8001311d77ec0 sp: fff800131132f0e1 ret_pc:
      000000000049ed94
      RPC: <set_next_entity+0x34/0xb80>
      l0: 0000000000000000 l1: 0000000000000800 l2: 0000000000000000 l3:
      0000000000000000
      l4: 000b2aa30e34b10d l5: 0000000000000000 l6: 0000000000000000 l7:
      fff8001fa7605028
      i0: fff80013111a8a00 i1: fff80013155a0780 i2: 0000000000000000 i3:
      0000000000000000
      i4: 0000000000000000 i5: 0000000000100000 i6: fff800131132f1a1 i7:
      00000000103fa4b0
      I7: <ixgbe_xmit_frame+0x30/0xa0 [ixgbe]>
      Call Trace:
       [00000000103fa4b0] ixgbe_xmit_frame+0x30/0xa0 [ixgbe]
       [0000000000998c74] netpoll_start_xmit+0xf4/0x200
       [0000000000998e10] queue_process+0x90/0x160
       [0000000000485fa8] process_one_work+0x188/0x480
       [0000000000486410] worker_thread+0x170/0x4c0
       [000000000048c6b8] kthread+0xd8/0x120
       [0000000000406064] ret_from_fork+0x1c/0x2c
       [0000000000000000]           (null)
      Disabling lock debugging due to kernel taint
      Caller[00000000103fa4b0]: ixgbe_xmit_frame+0x30/0xa0 [ixgbe]
      Caller[0000000000998c74]: netpoll_start_xmit+0xf4/0x200
      Caller[0000000000998e10]: queue_process+0x90/0x160
      Caller[0000000000485fa8]: process_one_work+0x188/0x480
      Caller[0000000000486410]: worker_thread+0x170/0x4c0
      Caller[000000000048c6b8]: kthread+0xd8/0x120
      Caller[0000000000406064]: ret_from_fork+0x1c/0x2c
      Caller[0000000000000000]:           (null)
      Signed-off-by: NTushar Dave <tushar.n.dave@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c70b17b7
    • D
      bpf: add napi_id read access to __sk_buff · b1d9fc41
      Daniel Borkmann 提交于
      Add napi_id access to __sk_buff for socket filter program types, tc
      program types and other bpf_convert_ctx_access() users. Having access
      to skb->napi_id is useful for per RX queue listener siloing, f.e.
      in combination with SO_ATTACH_REUSEPORT_EBPF and when busy polling is
      used, meaning SO_REUSEPORT enabled listeners can then select the
      corresponding socket at SYN time already [1]. The skb is marked via
      skb_mark_napi_id() early in the receive path (e.g., napi_gro_receive()).
      
      Currently, sockets can only use SO_INCOMING_NAPI_ID from 6d433902
      ("net: Introduce SO_INCOMING_NAPI_ID") as a socket option to look up
      the NAPI ID associated with the queue for steering, which requires a
      prior sk_mark_napi_id() after the socket was looked up.
      
      Semantics for the __sk_buff napi_id access are similar, meaning if
      skb->napi_id is < MIN_NAPI_ID (e.g. outgoing packets using sender_cpu),
      then an invalid napi_id of 0 is returned to the program, otherwise a
      valid non-zero napi_id.
      
        [1] http://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdfSuggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1d9fc41
    • I
      gso: Validate assumption of frag_list segementation · 43170c4e
      Ilan Tayari 提交于
      Commit 07b26c94 ("gso: Support partial splitting at the frag_list
      pointer") assumes that all SKBs in a frag_list (except maybe the last
      one) contain the same amount of GSO payload.
      
      This assumption is not always correct, resulting in the following
      warning message in the log:
          skb_segment: too many frags
      
      For example, mlx5 driver in Striding RQ mode creates some RX SKBs with
      one frag, and some with 2 frags.
      After GRO, the frag_list SKBs end up having different amounts of payload.
      If this frag_list SKB is then forwarded, the aforementioned assumption
      is violated.
      
      Validate the assumption, and fall back to software GSO if it not true.
      
      Change-Id: Ia03983f4a47b6534dd987d7a2aad96d54d46d212
      Fixes: 07b26c94 ("gso: Support partial splitting at the frag_list pointer")
      Signed-off-by: NIlan Tayari <ilant@mellanox.com>
      Signed-off-by: NIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43170c4e
    • M
      Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning · 7acf8a1e
      Matthew Whitehead 提交于
      Constants used for tuning are generally a bad idea, especially as hardware
      changes over time. Replace the constant 2 jiffies with sysctl variable
      netdev_budget_usecs to enable sysadmins to tune the softirq processing.
      Also document the variable.
      
      For example, a very fast machine might tune this to 1000 microseconds,
      while my regression testing 486DX-25 needs it to be 4000 microseconds on
      a nearly idle network to prevent time_squeeze from being incremented.
      
      Version 2: changed jiffies to microseconds for predictable units.
      Signed-off-by: NMatthew Whitehead <tedheadster@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7acf8a1e
  3. 21 4月, 2017 1 次提交
  4. 18 4月, 2017 4 次提交
    • D
      net: rtnetlink: plumb extended ack to doit function · c21ef3e3
      David Ahern 提交于
      Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse
      for doit functions that call it directly.
      
      This is the first step to using extended error reporting in rtnetlink.
      >From here individual subsystems can be updated to set netlink_ext_ack as
      needed.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c21ef3e3
    • I
      gso: Validate assumption of frag_list segementation · 7a7a9bd7
      Ilan Tayari 提交于
      Commit 07b26c94 ("gso: Support partial splitting at the frag_list
      pointer") assumes that all SKBs in a frag_list (except maybe the last
      one) contain the same amount of GSO payload.
      
      This assumption is not always correct, resulting in the following
      warning message in the log:
          skb_segment: too many frags
      
      For example, mlx5 driver in Striding RQ mode creates some RX SKBs with
      one frag, and some with 2 frags.
      After GRO, the frag_list SKBs end up having different amounts of payload.
      If this frag_list SKB is then forwarded, the aforementioned assumption
      is violated.
      
      Validate the assumption, and fall back to software GSO if it not true.
      
      Fixes: 07b26c94 ("gso: Support partial splitting at the frag_list pointer")
      Signed-off-by: NIlan Tayari <ilant@mellanox.com>
      Signed-off-by: NIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a7a9bd7
    • C
      Add uid and cookie bpf helper to cg_skb_func_proto · 9fd0f315
      Chenbo Feng 提交于
      BPF helper functions get_socket_cookie and get_socket_uid can be
      used for network traffic classifications, among others. Expose
      them also to programs of type BPF_PROG_TYPE_CGROUP_SKB. As of
      commit 8f917bba ("bpf: pass sk to helper functions") the
      required skb->sk function is available at both cgroup bpf ingress
      and egress hooks. With these two new helper, cg_skb_func_proto is
      effectively the same as sk_filter_func_proto.
      
      Change since V1:
      Instead of add the helper to cg_skb_func_proto, redirect the
      cg_skb_func_proto to sk_filter_func_proto since all helper function
      in sk_filter_func_proto are applicable to cg_skb_func_proto now.
      Signed-off-by: NChenbo Feng <fengc@google.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fd0f315
    • W
      net-timestamp: avoid use-after-free in ip_recv_error · 1862d620
      Willem de Bruijn 提交于
      Syzkaller reported a use-after-free in ip_recv_error at line
      
          info->ipi_ifindex = skb->dev->ifindex;
      
      This function is called on dequeue from the error queue, at which
      point the device pointer may no longer be valid.
      
      Save ifindex on enqueue in __skb_complete_tx_timestamp, when the
      pointer is valid or NULL. Store it in temporary storage skb->cb.
      
      It is safe to reference skb->dev here, as called from device drivers
      or dev_queue_xmit. The exception is when called from tcp_ack_tstamp;
      in that case it is NULL and ifindex is set to 0 (invalid).
      
      Do not return a pktinfo cmsg if ifindex is 0. This maintains the
      current behavior of not returning a cmsg if skb->dev was NULL.
      
      On dequeue, the ipv4 path will cast from sock_exterr_skb to
      in_pktinfo. Both have ifindex as their first element, so no explicit
      conversion is needed. This is by design, introduced in commit
      0b922b7a ("net: original ingress device index in PKTINFO"). For
      ipv6 ip6_datagram_support_cmsg converts to in6_pktinfo.
      
      Fixes: 829ae9d6 ("net-timestamp: allow reading recv cmsg on errqueue with origin tstamp")
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1862d620
  5. 14 4月, 2017 12 次提交
  6. 13 4月, 2017 1 次提交
    • I
      gso: Support frag_list splitting with head_frag · eaffadbb
      Ilan Tayari 提交于
      A driver may use build_skb() for received packets.
      These SKBs then have a head_frag.
      
      Since commit d7e8883c ("net: make GRO aware of
      skb->head_frag"), GRO may build frag_list SKBs out of
      head_frag received SKBs.
      In such a case, the chained SKBs end up with a head_frag.
      
      Commit 07b26c94 ("gso: Support partial splitting at
      the frag_list pointer") adds partial segmentation of frag_list
      SKB chains into individual SKBs.
      However, this is not done if the chained SKBs have any
      linear part, because the device may not be able to DMA
      the private linear buffer.
      
      A chained frag_list SKB with head_frag is wrongfully
      detected in this case as having a private linear part
      and thus falls back to software GSO, while in fact the
      linear part is backed by a DMA page just like any other frag.
      
      This causes low performance when forwarding those packets
      that were built with build_skb()
      
      Allow partial segmentation at the frag_list pointer for
      chained SKBs with head_frag.
      
      Note that such SKBs can only be created by GRO, when applied
      to received packets with head_frag.
      Also note that this change only affects the data path that
      performs the partial segmentation at frag_list pointer, and
      not any of the other more common data paths.
      Signed-off-by: NIlan Tayari <ilant@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eaffadbb
  7. 12 4月, 2017 4 次提交
  8. 10 4月, 2017 1 次提交
  9. 08 4月, 2017 1 次提交
  10. 05 4月, 2017 2 次提交
    • V
      rtnl: Add support for netdev event to link messages · def12888
      Vlad Yasevich 提交于
      When netdev events happen, a rtnetlink_event() handler will send
      messages for every event in it's white list.  These messages contain
      current information about a particular device, but they do not include
      the iformation about which event just happened.  The consumer of
      the message has to try to infer this information.  In some cases
      (ex: NETDEV_NOTIFY_PEERS), that is not possible.
      
      This patch adds a new extension to RTM_NEWLINK message called IFLA_EVENT
      that would have an encoding of the which event triggered this
      message.  This would allow the the message consumer to easily determine
      if it is interested in a particular event or not.
      Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      def12888
    • V
      rtnetlink: Convert rtnetlink_event to white list · 5138e86f
      Vlad Yasevich 提交于
      The rtnetlink_event currently functions as a blacklist where
      we block cerntain netdev events from being sent to user space.
      As a result, events have been added to the system that userspace
      probably doesn't care about.
      
      This patch converts the implementation to the white list so that
      newly events would have to be specifically added to the list to
      be sent to userspace.  This would force new event implementers to
      consider whether a given event is usefull to user space or if it's
      just a kernel event.
      Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5138e86f
  11. 04 4月, 2017 5 次提交
  12. 03 4月, 2017 1 次提交
    • A
      make skb_copy_datagram_msg() et.al. preserve ->msg_iter on error · 32786821
      Al Viro 提交于
      Fixes the mess observed in e.g. rsync over a noisy link we'd been
      seeing since last Summer.  What happens is that we copy part of
      a datagram before noticing a checksum mismatch.  Datagram will be
      resent, all right, but we want the next try go into the same place,
      not after it...
      
      All this family of primitives (copy/checksum and copy a datagram
      into destination) is "all or nothing" sort of interface - either
      we get 0 (meaning that copy had been successful) or we get an
      error (and no way to tell how much had been copied before we ran
      into whatever error it had been).  Make all of them leave iterator
      unadvanced in case of errors - all callers must be able to cope
      with that (an error might've been caught before the iterator had
      been advanced), it costs very little to arrange, it's safer for
      callers and actually fixes at least one bug in said callers.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      32786821
  13. 02 4月, 2017 1 次提交
    • A
      bpf: introduce BPF_PROG_TEST_RUN command · 1cf1cae9
      Alexei Starovoitov 提交于
      development and testing of networking bpf programs is quite cumbersome.
      Despite availability of user space bpf interpreters the kernel is
      the ultimate authority and execution environment.
      Current test frameworks for TC include creation of netns, veth,
      qdiscs and use of various packet generators just to test functionality
      of a bpf program. XDP testing is even more complicated, since
      qemu needs to be started with gro/gso disabled and precise queue
      configuration, transferring of xdp program from host into guest,
      attaching to virtio/eth0 and generating traffic from the host
      while capturing the results from the guest.
      
      Moreover analyzing performance bottlenecks in XDP program is
      impossible in virtio environment, since cost of running the program
      is tiny comparing to the overhead of virtio packet processing,
      so performance testing can only be done on physical nic
      with another server generating traffic.
      
      Furthermore ongoing changes to user space control plane of production
      applications cannot be run on the test servers leaving bpf programs
      stubbed out for testing.
      
      Last but not least, the upstream llvm changes are validated by the bpf
      backend testsuite which has no ability to test the code generated.
      
      To improve this situation introduce BPF_PROG_TEST_RUN command
      to test and performance benchmark bpf programs.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cf1cae9
  14. 31 3月, 2017 1 次提交
    • P
      sock: avoid dirtying sk_stamp, if possible · 6c7c98ba
      Paolo Abeni 提交于
      sock_recv_ts_and_drops() unconditionally set sk->sk_stamp for
      every packet, even if the SOCK_TIMESTAMP flag is not set in the
      related socket.
      If selinux is enabled, this cause a cache miss for every packet
      since sk->sk_stamp and sk->sk_security share the same cacheline.
      With this change sk_stamp is set only if the SOCK_TIMESTAMP
      flag is set, and is cleared for the first packet, so that the user
      perceived behavior is unchanged.
      
      This gives up to 5% speed-up under udp-flood with small packets.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c7c98ba
  15. 29 3月, 2017 1 次提交