1. 25 12月, 2016 1 次提交
  2. 24 12月, 2016 1 次提交
    • I
      neigh: Send netevent after marking neigh as dead · 53f800e3
      Ido Schimmel 提交于
      neigh_cleanup_and_release() is always called after marking a neighbour
      as dead, but it only notifies user space and not in-kernel listeners of
      the netevent notification chain.
      
      This can cause multiple problems. In my specific use case, it causes the
      listener (a switch driver capable of L3 offloads) to believe a neighbour
      entry is still valid, and is thus erroneously kept in the device's
      table.
      
      Fix that by sending a netevent after marking the neighbour as dead.
      
      Fixes: a6bf9e93 ("mlxsw: spectrum_router: Offload neighbours based on NUD state change")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53f800e3
  3. 18 12月, 2016 1 次提交
  4. 15 12月, 2016 1 次提交
  5. 10 12月, 2016 1 次提交
  6. 09 12月, 2016 4 次提交
    • M
      bpf: xdp: Allow head adjustment in XDP prog · 17bedab2
      Martin KaFai Lau 提交于
      This patch allows XDP prog to extend/remove the packet
      data at the head (like adding or removing header).  It is
      done by adding a new XDP helper bpf_xdp_adjust_head().
      
      It also renames bpf_helper_changes_skb_data() to
      bpf_helper_changes_pkt_data() to better reflect
      that XDP prog does not work on skb.
      
      This patch adds one "xdp_adjust_head" bit to bpf_prog for the
      XDP-capable driver to check if the XDP prog requires
      bpf_xdp_adjust_head() support.  The driver can then decide
      to error out during XDP_SETUP_PROG.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17bedab2
    • E
      udp: under rx pressure, try to condense skbs · c8c8b127
      Eric Dumazet 提交于
      Under UDP flood, many softirq producers try to add packets to
      UDP receive queue, and one user thread is burning one cpu trying
      to dequeue packets as fast as possible.
      
      Two parts of the per packet cost are :
      - copying payload from kernel space to user space,
      - freeing memory pieces associated with skb.
      
      If socket is under pressure, softirq handler(s) can try to pull in
      skb->head the payload of the packet if it fits.
      
      Meaning the softirq handler(s) can free/reuse the page fragment
      immediately, instead of letting udp_recvmsg() do this hundreds of usec
      later, possibly from another node.
      
      Additional gains :
      - We reduce skb->truesize and thus can store more packets per SO_RCVBUF
      - We avoid cache line misses at copyout() time and consume_skb() time,
      and avoid one put_page() with potential alien freeing on NUMA hosts.
      
      This comes at the cost of a copy, bounded to available tail room, which
      is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger
      than necessary)
      
      This patch gave me about 5 % increase in throughput in my tests.
      
      skb_condense() helper could probably used in other contexts.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8c8b127
    • E
      net: rfs: add a jump label · 13bfff25
      Eric Dumazet 提交于
      RFS is not commonly used, so add a jump label to avoid some conditionals
      in fast path.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13bfff25
    • S
      flow dissector: ICMP support · 972d3876
      Simon Horman 提交于
      Allow dissection of ICMP(V6) type and code. This should only occur
      if a packet is ICMP(V6) and the dissector has FLOW_DISSECTOR_KEY_ICMP set.
      
      There are currently no users of FLOW_DISSECTOR_KEY_ICMP.
      A follow-up patch will allow FLOW_DISSECTOR_KEY_ICMP to be used by
      the flower classifier.
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      972d3876
  7. 06 12月, 2016 3 次提交
    • E
      net/udp: do not touch skb->peeked unless really needed · a297569f
      Eric Dumazet 提交于
      In UDP recvmsg() path we currently access 3 cache lines from an skb
      while holding receive queue lock, plus another one if packet is
      dequeued, since we need to change skb->next->prev
      
      1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
      2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
      3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)
      
      skb->peeked is only needed to make sure 0-length packets are properly
      handled while MSG_PEEK is operated.
      
      I had first the intent to remove skb->peeked but the "MSG_PEEK at
      non-zero offset" support added by Sam Kumar makes this not possible.
      
      This patch avoids one cache line miss during the locked section, when
      skb->len and skb->peeked do not have to be read.
      
      It also avoids the skb_set_peeked() cost for non empty UDP datagrams.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a297569f
    • D
      bpf: remove type arg from __is_valid_{,xdp_}access · 1afaf661
      Daniel Borkmann 提交于
      Commit d691f9e8 ("bpf: allow programs to write to certain skb
      fields") pushed access type check outside of __is_valid_access()
      to have different restrictions for socket filters and tc programs.
      type is thus not used anymore within __is_valid_access() and should
      be removed as a function argument. Same for __is_valid_xdp_access()
      introduced by 6a773a15 ("bpf: add XDP prog type for early driver
      filter").
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1afaf661
    • E
      net_sched: gen_estimator: complete rewrite of rate estimators · 1c0d32fd
      Eric Dumazet 提交于
      1) Old code was hard to maintain, due to complex lock chains.
         (We probably will be able to remove some kfree_rcu() in callers)
      
      2) Using a single timer to update all estimators does not scale.
      
      3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
         is not supposed to work well)
      
      In this rewrite :
      
      - I removed the RB tree that had to be scanned in
        gen_estimator_active(). qdisc dumps should be much faster.
      
      - Each estimator has its own timer.
      
      - Estimations are maintained in net_rate_estimator structure,
        instead of dirtying the qdisc. Minor, but part of the simplification.
      
      - Reading the estimator uses RCU and a seqcount to provide proper
        support for 32bit kernels.
      
      - We reduce memory need when estimators are not used, since
        we store a pointer, instead of the bytes/packets counters.
      
      - xt_rateest_mt() no longer has to grab a spinlock.
        (In the future, xt_rateest_tg() could be switched to per cpu counters)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c0d32fd
  8. 04 12月, 2016 4 次提交
    • E
      net_sched: gen_estimator: account for timer drifts · 12efa1fa
      Eric Dumazet 提交于
      Under heavy stress, timer used in estimators tend to slowly be delayed
      by a few jiffies, leading to inaccuracies.
      
      Lets remember what was the last scheduled jiffies so that we get more
      precise estimations, without having to add a multiply/divide in the loop
      to account for the drifts.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12efa1fa
    • A
      netns: fix net_generic() "id - 1" bloat · 6af2d5ff
      Alexey Dobriyan 提交于
      net_generic() function is both a) inline and b) used ~600 times.
      
      It has the following code inside
      
      		...
      	ptr = ng->ptr[id - 1];
      		...
      
      "id" is never compile time constant so compiler is forced to subtract 1.
      And those decrements or LEA [r32 - 1] instructions add up.
      
      We also start id'ing from 1 to catch bugs where pernet sybsystem id
      is not initialized and 0. This is quite pointless idea (nothing will
      work or immediate interference with first registered subsystem) in
      general but it hints what needs to be done for code size reduction.
      
      Namely, overlaying allocation of pointer array and fixed part of
      structure in the beginning and using usual base-0 addressing.
      
      Ids are just cookies, their exact values do not matter, so lets start
      with 3 on x86_64.
      
      Code size savings (oh boy): -4.2 KB
      
      As usual, ignore the initial compiler stupidity part of the table.
      
      	add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208)
      	function                                     old     new   delta
      	tipc_nametbl_insert_publ                    1250    1270     +20
      	nlmclnt_lookup_host                          686     703     +17
      	nfsd4_encode_fattr                          5930    5941     +11
      	nfs_get_client                              1050    1061     +11
      	register_pernet_operations                   333     342      +9
      	tcf_mirred_init                              843     849      +6
      	tcf_bpf_init                                1143    1149      +6
      	gss_setup_upcall                             990     994      +4
      	idmap_name_to_id                             432     434      +2
      	ops_init                                     274     275      +1
      	nfsd_inject_forget_client                    259     260      +1
      	nfs4_alloc_client                            612     613      +1
      	tunnel_key_walker                            164     163      -1
      
      		...
      
      	tipc_bcbase_select_primary                   392     360     -32
      	mac80211_hwsim_new_radio                    2808    2767     -41
      	ipip6_tunnel_ioctl                          2228    2186     -42
      	tipc_bcast_rcv                               715     672     -43
      	tipc_link_build_proto_msg                   1140    1089     -51
      	nfsd4_lock                                  3851    3796     -55
      	tipc_mon_rcv                                1012     956     -56
      	Total: Before=156643951, After=156639743, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6af2d5ff
    • A
      netns: add dummy struct inside "struct net_generic" · 9bfc7b99
      Alexey Dobriyan 提交于
      This is precursor to fixing "[id - 1]" bloat inside net_generic().
      
      Name "s" is chosen to complement name "u" often used for dummy unions.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9bfc7b99
    • A
      netns: publish net_generic correctly · 1a9a0592
      Alexey Dobriyan 提交于
      Publishing net_generic pointer is done with silly mistake: new array is
      published BEFORE setting freshly acquired pernet subsystem pointer.
      
      	memcpy
      	rcu_assign_pointer
      	kfree_rcu
      	ng->ptr[id - 1] = data;
      
      This bug was introduced with commit dec827d1
      ("[NETNS]: The generic per-net pointers.") in the glorious days of
      chopping networking stack into containers proper 8.5 years ago (whee...)
      
      How it didn't trigger for so long?
      Well, you need quite specific set of conditions:
      
      *) race window opens once per pernet subsystem addition
         (read: modprobe or boot)
      
      *) not every pernet subsystem is eligible (need ->id and ->size)
      
      *) not every pernet subsystem is vulnerable (need incorrect or absense
         of ordering of register_pernet_sybsys() and actually using net_generic())
      
      *) to hide the bug even more, default is to preallocate 13 pointers which
         is actually quite a lot. You need IPv6, netfilter, bridging etc together
         loaded to trigger reallocation in the first place. Trimmed down
         config are OK.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a9a0592
  9. 03 12月, 2016 5 次提交
  10. 02 12月, 2016 4 次提交
  11. 01 12月, 2016 1 次提交
    • Z
      neigh: remove duplicate check for same neigh · 18502acd
      Zhang Shengju 提交于
      Currently loop index 'idx' is used as the index in the neigh list of interest.
      It's increased only when the neigh is dumped. It's not the absolute index in
      the list. Because there is no info to record which neigh has already be scanned
      by previous loop. This will cause the filtered out neighs to be scanned mulitple
      times.
      
      This patch make idx as the absolute index in the list, it will increase no matter
      whether the neigh is filtered. This will prevent the above problem.
      
      And this is in line with other dump functions.
      
      v2:
       - take David Ahern's advice to do simple change
      Signed-off-by: NZhang Shengju <zhangshengju@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18502acd
  12. 30 11月, 2016 2 次提交
    • D
      bpf, xdp: allow to pass flags to dev_change_xdp_fd · 85de8576
      Daniel Borkmann 提交于
      Add an IFLA_XDP_FLAGS attribute that can be passed for setting up
      XDP along with IFLA_XDP_FD, which eventually allows user space to
      implement typical add/replace/delete logic for programs. Right now,
      calling into dev_change_xdp_fd() will always replace previous programs.
      
      When passed XDP_FLAGS_UPDATE_IF_NOEXIST, we can handle this more
      graceful when requested by returning -EBUSY in case we try to
      attach a new program, but we find that another one is already
      attached. This will be used by upcoming front-end for iproute2 as
      well.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85de8576
    • F
      tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING · 1c885808
      Francis Yan 提交于
      This patch exports the sender chronograph stats via the socket
      SO_TIMESTAMPING channel. Currently we can instrument how long a
      particular application unit of data was queued in TCP by tracking
      SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having
      these sender chronograph stats exported simultaneously along with
      these timestamps allow further breaking down the various sender
      limitation.  For example, a video server can tell if a particular
      chunk of video on a connection takes a long time to deliver because
      TCP was experiencing small receive window. It is not possible to
      tell before this patch without packet traces.
      
      To prepare these stats, the user needs to set
      SOF_TIMESTAMPING_OPT_STATS and SOF_TIMESTAMPING_OPT_TSONLY flags
      while requesting other SOF_TIMESTAMPING TX timestamps. When the
      timestamps are available in the error queue, the stats are returned
      in a separate control message of type SCM_TIMESTAMPING_OPT_STATS,
      in a list of TLVs (struct nlattr) of types: TCP_NLA_BUSY_TIME,
      TCP_NLA_RWND_LIMITED, TCP_NLA_SNDBUF_LIMITED. Unit is microsecond.
      Signed-off-by: NFrancis Yan <francisyyan@gmail.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c885808
  13. 28 11月, 2016 1 次提交
  14. 26 11月, 2016 4 次提交
  15. 25 11月, 2016 4 次提交
  16. 24 11月, 2016 1 次提交
  17. 23 11月, 2016 2 次提交