1. 18 5月, 2015 5 次提交
  2. 15 5月, 2015 14 次提交
    • F
      net: core: set qdisc pkt len before tc_classify · 3365495c
      Florian Westphal 提交于
      commit d2788d34 ("net: sched: further simplify handle_ing")
      removed the call to qdisc_enqueue_root().
      
      However, after this removal we no longer set qdisc pkt length.
      This breaks traffic policing on ingress.
      
      This is the minimum fix: set qdisc pkt length before tc_classify.
      
      Only setting the length does remove support for 'stab' on ingress, but
      as Alexei pointed out:
       "Though it was allowed to add qdisc_size_table to ingress, it's useless.
        Nothing takes advantage of recomputed qdisc_pkt_len".
      
      Jamal suggested to use qdisc_pkt_len_init(), but as Eric mentioned that
      would result in qdisc_pkt_len_init to no longer get inlined due to the
      additional 2nd call site.
      
      ingress policing is rare and GRO doesn't really work that well with police
      on ingress, as we see packets > mtu and drop skbs that  -- without
      aggregation -- would still have fitted the policier budget.
      Thus to have reliable/smooth ingress policing GRO has to be turned off.
      
      Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Fixes: d2788d34 ("net: sched: further simplify handle_ing")
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3365495c
    • N
      netns: fix unbalanced spin_lock on error · 0c58a2db
      Nicolas Dichtel 提交于
      Unlock was missing on error path.
      
      Fixes: 95f38411 ("netns: use a spin_lock to protect nsid management")
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c58a2db
    • A
      ip_tunnel: Report Rx dropped in ip_tunnel_get_stats64 · c24a5964
      Alexander Duyck 提交于
      The rx_dropped stat wasn't being reported when ip_tunnel_get_stats64 was
      called.  This was leading to some confusing results in my debug as I was
      seeing rx_errors increment but no other value which pointed me toward the
      type of error being seen.
      
      This change corrects that by using netdev_stats_to_stats64 to copy all
      available dev stats instead of just the few that were hand picked.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c24a5964
    • W
      packet: fix warnings in rollover lock contention · 54d7c01d
      Willem de Bruijn 提交于
      Avoid two xchg calls whose return values were unused, causing a
      warning on some architectures.
      
      The relevant variable is a hint and read without mutual exclusion.
      This fix makes all writers hold the receive_queue lock.
      Suggested-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54d7c01d
    • Y
      tipc: use sock_create_kern interface to create kernel socket · fa787ae0
      Ying Xue 提交于
      After commit eeb1bd5c ("net: Add a struct net parameter to
      sock_create_kern"), we should use sock_create_kern() to create kernel
      socket as the interface doesn't reference count struct net any more.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa787ae0
    • B
      cls_flower: Fix compile error · dd3aa3b5
      Brian Haley 提交于
      Fix compile error in net/sched/cls_flower.c
      
          net/sched/cls_flower.c: In function ‘fl_set_key’:
          net/sched/cls_flower.c:240:3: error: implicit declaration of
           function ‘tcf_change_indev’ [-Werror=implicit-function-declaration]
             err = tcf_change_indev(net, tb[TCA_FLOWER_INDEV]);
      
      Introduced in 77b9900e
      
      Fixes: 77b9900e ("tc: introduce Flower classifier")
      Signed-off-by: NBrian Haley <brian.haley@hp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd3aa3b5
    • J
      tipc: add packet sequence number at instant of transmission · dd3f9e70
      Jon Paul Maloy 提交于
      Currently, the packet sequence number is updated and added to each
      packet at the moment a packet is added to the link backlog queue.
      This is wasteful, since it forces the code to traverse the send
      packet list packet by packet when adding them to the backlog queue.
      It would be better to just splice the whole packet list into the
      backlog queue when that is the right action to do.
      
      In this commit, we do this change. Also, since the sequence numbers
      cannot now be assigned to the packets at the moment they are added
      the backlog queue, we do instead calculate and add them at the moment
      of transmission, when the backlog queue has to be traversed anyway.
      We do this in the function tipc_link_push_packet().
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd3f9e70
    • J
      tipc: improve link congestion algorithm · f21e897e
      Jon Paul Maloy 提交于
      The link congestion algorithm used until now implies two problems.
      
      - It is too generous towards lower-level messages in situations of high
        load by giving "absolute" bandwidth guarantees to the different
        priority levels. LOW traffic is guaranteed 10%, MEDIUM is guaranted
        20%, HIGH is guaranteed 30%, and CRITICAL is guaranteed 40% of the
        available bandwidth. But, in the absence of higher level traffic, the
        ratio between two distinct levels becomes unreasonable. E.g. if there
        is only LOW and MEDIUM traffic on a system, the former is guaranteed
        1/3 of the bandwidth, and the latter 2/3. This again means that if
        there is e.g. one LOW user and 10 MEDIUM users, the  former will have
        33.3% of the bandwidth, and the others will have to compete for the
        remainder, i.e. each will end up with 6.7% of the capacity.
      
      - Packets of type MSG_BUNDLER are created at SYSTEM importance level,
        but only after the packets bundled into it have passed the congestion
        test for their own respective levels. Since bundled packets don't
        result in incrementing the level counter for their own importance,
        only occasionally for the SYSTEM level counter, they do in practice
        obtain SYSTEM level importance. Hence, the current implementation
        provides a gap in the congestion algorithm that in the worst case
        may lead to a link reset.
      
      We now refine the congestion algorithm as follows:
      
      - A message is accepted to the link backlog only if its own level
        counter, and all superior level counters, permit it.
      
      - The importance of a created bundle packet is set according to its
        contents. A bundle packet created from messges at levels LOW to
        CRITICAL is given importance level CRITICAL, while a bundle created
        from a SYSTEM level message is given importance SYSTEM. In the latter
        case only subsequent SYSTEM level messages are allowed to be bundled
        into it.
      
      This solves the first problem described above, by making the bandwidth
      guarantee relative to the total number of users at all levels; only
      the upper limit for each level remains absolute. In the example
      described above, the single LOW user would use 1/11th of the bandwidth,
      the same as each of the ten MEDIUM users, but he still has the same
      guarantee against starvation as the latter ones.
      
      The fix also solves the second problem. If the CRITICAL level is filled
      up by bundle packets of that level, no lower level packets will be
      accepted any more.
      Suggested-by: NGergely Kiss <gergely.kiss@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f21e897e
    • J
      tipc: simplify link supervision checkpointing · cd4eee3c
      Jon Paul Maloy 提交于
      We change the sequence number checkpointing that is performed
      by the timer in order to discover if the peer is active. Currently,
      we store a checkpoint of the next expected sequence number "rcv_nxt"
      at each timer expiration, and compare it to the current expected
      number at next timeout expiration. Instead, we now use the already
      existing field "silent_intv_cnt" for this task. We step the counter
      at each timeout expiration, and zero it at each valid received packet.
      If no valid packet has been received from the peer after "abort_limit"
      number of silent timer intervals, the link is declared faulty and reset.
      
      We also remove the multiple instances of timer activation from inside
      the FSM function "link_state_event()", and now do it at only one place;
      at the end of the timer function itself.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd4eee3c
    • J
      tipc: rename fields in struct tipc_link · a97b9d3f
      Jon Paul Maloy 提交于
      We rename some fields in struct tipc_link, in order to give them more
      descriptive names:
      
      next_in_no -> rcv_nxt
      next_out_no-> snd_nxt
      fsm_msg_cnt-> silent_intv_cnt
      cont_intv  -> keepalive_intv
      last_retransmitted -> last_retransm
      
      There are no functional changes in this commit.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a97b9d3f
    • J
      tipc: simplify packet sequence number handling · e4bf4f76
      Jon Paul Maloy 提交于
      Although the sequence number in the TIPC protocol is 16 bits, we have
      until now stored it internally as an unsigned 32 bits integer.
      We got around this by always doing explicit modulo-65535 operations
      whenever we need to access a sequence number.
      
      We now make the incoming and outgoing sequence numbers to unsigned
      16-bit integers, and remove the modulo operations where applicable.
      
      We also move the arithmetic inline functions for 16 bit integers
      to core.h, and the function buf_seqno() to msg.h, so they can easily
      be accessed from anywhere in the code.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4bf4f76
    • J
      tipc: simplify include dependencies · a6bf70f7
      Jon Paul Maloy 提交于
      When we try to add new inline functions in the code, we sometimes
      run into circular include dependencies.
      
      The main problem is that the file core.h, which really should be at
      the root of the dependency chain, instead is a leaf. I.e., core.h
      includes a number of header files that themselves should be allowed
      to include core.h. In reality this is unnecessary, because core.h does
      not need to know the full signature of any of the structs it refers to,
      only their type declaration.
      
      In this commit, we remove all dependencies from core.h towards any
      other tipc header file.
      
      As a consequence of this change, we can now move the function
      tipc_own_addr(net) from addr.c to addr.h, and make it inline.
      
      There are no functional changes in this commit.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6bf70f7
    • J
      tipc: simplify link timer handling · 75b44b01
      Jon Paul Maloy 提交于
      Prior to this commit, the link timer has been running at a "continuity
      interval" of configured link tolerance/4. When a timer wakes up and
      discovers that there has been no sign of life from the peer during the
      previous interval, it divides its own timer interval by another factor
      four, and starts sending one probe per new interval. When the configured
      link tolerance time has passed without answer, i.e. after 16 unacked
      probes, the link is declared faulty and reset.
      
      This is unnecessary complex. It is sufficient to continue with the
      original continuity interval, and instead reset the link after four
      missed probe responses. This makes the timer handling in the link
      simpler, and opens up for some planned later changes in this area.
      This commit implements this change.
      Reviewed-by: NRichard Alpe <richard.alpe@ericsson.com>
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75b44b01
    • J
      tipc: simplify resetting and disabling of bearers · b1c29f6b
      Jon Paul Maloy 提交于
      Since commit 4b475e3f2f8e4e241de101c8240f1d74d0470494
      ("tipc: eliminate delayed link deletion at link failover") the extra
      boolean parameter "shutting_down" is not any longer needed for the
      functions bearer_disable() and tipc_link_delete_list().
      
      Furhermore, the function tipc_link_reset_links(), called from
      bearer_reset()  is now unnecessary. We can just as well delete
      all the links, as we do in bearer_disable(), and start over with
      creating new links.
      
      This commit introduces those changes.
      Reviewed-by: NErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1c29f6b
  3. 14 5月, 2015 21 次提交
    • P
      netfilter: add netfilter ingress hook after handle_ing() under unique static key · e687ad60
      Pablo Neira 提交于
      This patch adds the Netfilter ingress hook just after the existing tc ingress
      hook, that seems to be the consensus solution for this.
      
      Note that the Netfilter hook resides under the global static key that enables
      ingress filtering. Nonetheless, Netfilter still also has its own static key for
      minimal impact on the existing handle_ing().
      
      * Without this patch:
      
      Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
        16086246pps 7721Mb/sec (7721398080bps) errors: 100000000
      
          42.46%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
          25.92%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
           7.81%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
           5.62%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
           2.70%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
           2.34%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
           1.44%  kpktgend_0   [kernel.kallsyms]   [k] __build_skb
      
      * With this patch:
      
      Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
        16090536pps 7723Mb/sec (7723457280bps) errors: 100000000
      
          41.23%  kpktgend_0      [kernel.kallsyms]  [k] __netif_receive_skb_core
          26.57%  kpktgend_0      [kernel.kallsyms]  [k] kfree_skb
           7.72%  kpktgend_0      [pktgen]           [k] pktgen_thread_worker
           5.55%  kpktgend_0      [kernel.kallsyms]  [k] ip_rcv
           2.78%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_internal
           2.06%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_sk
           1.43%  kpktgend_0      [kernel.kallsyms]  [k] __build_skb
      
      * Without this patch + tc ingress:
      
              tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                      u32 match ip dst 4.3.2.1/32
      
      Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
        10788648pps 5178Mb/sec (5178551040bps) errors: 100000000
      
          40.99%  kpktgend_0   [kernel.kallsyms]  [k] __netif_receive_skb_core
          17.50%  kpktgend_0   [kernel.kallsyms]  [k] kfree_skb
          11.77%  kpktgend_0   [cls_u32]          [k] u32_classify
           5.62%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify_compat
           5.18%  kpktgend_0   [pktgen]           [k] pktgen_thread_worker
           3.23%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify
           2.97%  kpktgend_0   [kernel.kallsyms]  [k] ip_rcv
           1.83%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_internal
           1.50%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_sk
           0.99%  kpktgend_0   [kernel.kallsyms]  [k] __build_skb
      
      * With this patch + tc ingress:
      
              tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                      u32 match ip dst 4.3.2.1/32
      
      Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
        10743194pps 5156Mb/sec (5156733120bps) errors: 100000000
      
          42.01%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
          17.78%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
          11.70%  kpktgend_0   [cls_u32]           [k] u32_classify
           5.46%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify_compat
           5.16%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
           2.98%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
           2.84%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify
           1.96%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
           1.57%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
      
      Note that the results are very similar before and after.
      
      I can see gcc gets the code under the ingress static key out of the hot path.
      Then, on that cold branch, it generates the code to accomodate the netfilter
      ingress static key. My explanation for this is that this reduces the pressure
      on the instruction cache for non-users as the new code is out of the hot path,
      and it comes with minimal impact for tc ingress users.
      
      Using gcc version 4.8.4 on:
      
      Architecture:          x86_64
      CPU op-mode(s):        32-bit, 64-bit
      Byte Order:            Little Endian
      CPU(s):                8
      [...]
      L1d cache:             16K
      L1i cache:             64K
      L2 cache:              2048K
      L3 cache:              8192K
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e687ad60
    • P
      net: add CONFIG_NET_INGRESS to enable ingress filtering · 1cf51900
      Pablo Neira 提交于
      This new config switch enables the ingress filtering infrastructure that is
      controlled through the ingress_needed static key. This prepares the
      introduction of the Netfilter ingress hook that resides under this unique
      static key.
      
      Note that CONFIG_SCH_INGRESS automatically selects this, that should be no
      problem since this also depends on CONFIG_NET_CLS_ACT.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cf51900
    • P
      f7191483
    • A
      net: Reserve skb headroom and set skb->dev even if using __alloc_skb · a080e7bd
      Alexander Duyck 提交于
      When I had inlined __alloc_rx_skb into __netdev_alloc_skb and
      __napi_alloc_skb I had overlooked the fact that there was a return in the
      __alloc_rx_skb.  As a result we weren't reserving headroom or setting the
      skb->dev in certain cases.  This change corrects that by adding a couple of
      jump labels to jump to depending on __alloc_skb either succeeding or failing.
      
      Fixes: 9451980a ("net: Use cached copy of pfmemalloc to avoid accessing page")
      Reported-by: NFelipe Balbi <balbi@ti.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Tested-by: NKevin Hilman <khilman@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a080e7bd
    • J
    • J
      geneve: Rename support library as geneve_core · 11e1fa46
      John W. Linville 提交于
      net/ipv4/geneve.c -> net/ipv4/geneve_core.c
      
      This name better reflects the purpose of the module.
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11e1fa46
    • J
      geneve: move definition of geneve_hdr() to geneve.h · 35d32e8f
      John W. Linville 提交于
      This is a static inline with identical definitions in multiple places...
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35d32e8f
    • J
      geneve: remove MODULE_ALIAS_RTNL_LINK from net/ipv4/geneve.c · 125907ae
      John W. Linville 提交于
      This file is essentially a library for implementing the geneve
      encapsulation protocol.  The file does not register any rtnl_link_ops,
      so the MODULE_ALIAS_RTNL_LINK macro is inappropriate here.
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      125907ae
    • W
      packet: rollover statistics · a9b63918
      Willem de Bruijn 提交于
      Rollover indicates exceptional conditions. Export a counter to inform
      socket owners of this state.
      
      If no socket with sufficient room is found, rollover fails. Also count
      these events.
      
      Finally, also count when flows are rolled over early thanks to huge
      flow detection, to validate its correctness.
      
      Tested:
        Read counters in bench_rollover on all other tests in the patchset
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9b63918
    • W
      packet: rollover huge flows before small flows · 3b3a5b0a
      Willem de Bruijn 提交于
      Migrate flows from a socket to another socket in the fanout group not
      only when the socket is full. Start migrating huge flows early, to
      divert possible 4-tuple attacks without affecting normal traffic.
      
      Introduce fanout_flow_is_huge(). This detects huge flows, which are
      defined as taking up more than half the load. It does so cheaply, by
      storing the rxhashes of the N most recent packets. If over half of
      these are the same rxhash as the current packet, then drop it. This
      only protects against 4-tuple attacks. N is chosen to fit all data in
      a single cache line.
      
      Tested:
        Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.
      
          lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s
          cpu         rx       rx.k     drop.k   rollover     r.huge   r.failed
            0         14         14          0          0          0          0
            1         20         20          0          0          0          0
            2         16         16          0          0          0          0
            3    6168824    6168824          0    4867721    4867721          0
            4    4867741    4867741          0          0          0          0
            5         12         12          0          0          0          0
            6         15         15          0          0          0          0
            7         17         17          0          0          0          0
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b3a5b0a
    • W
      packet: rollover lock contention avoidance · 2ccdbaa6
      Willem de Bruijn 提交于
      Rollover has to call packet_rcv_has_room on sockets in the fanout
      group to find a socket to migrate to. This operation is expensive
      especially if the packet sockets use rings, when a lock has to be
      acquired.
      
      Avoid pounding on the lock by all sockets by temporarily marking a
      socket as "under memory pressure" when such pressure is detected.
      While set, only the socket owner may call packet_rcv_has_room on the
      socket. Once it detects normal conditions, it clears the flag. The
      socket is not used as a victim by any other socket in the meantime.
      
      Under reasonably balanced load, each socket writer frequently calls
      packet_rcv_has_room and clears its own pressure field. As a backup
      for when the socket is rarely written to, also clear the flag on
      reading (packet_recvmsg, packet_poll) if this can be done cheaply
      (i.e., without calling packet_rcv_has_room). This is only for
      edge cases.
      
      Tested:
        Ran bench_rollover: a process with 8 sockets in a single fanout
        group, each pinned to a single cpu that receives one nic recv
        interrupt. RPS and RFS are disabled. The benchmark uses packet
        rx_ring, which has to take a lock when determining whether a
        socket has room.
      
        Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread
        uniformly across the packet sockets (and inserted an iptables
        rule to drop in PREROUTING to avoid protocol stack processing).
      
        Without this patch, all sockets try to migrate traffic to
        neighbors, causing lock contention when searching for a non-
        empty neighbor. The lock is the top 9 entries.
      
          perf record -a -g sleep 5
      
          -  17.82%   bench_rollover  [kernel.kallsyms]    [k] _raw_spin_lock
             - _raw_spin_lock
                - 99.00% spin_lock
          	 + 81.77% packet_rcv_has_room.isra.41
          	 + 18.23% tpacket_rcv
                + 0.84% packet_rcv_has_room.isra.41
          +   5.20%      ksoftirqd/6  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.15%      ksoftirqd/1  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.14%      ksoftirqd/2  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.12%      ksoftirqd/7  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.12%      ksoftirqd/5  [kernel.kallsyms]    [k] _raw_spin_lock
          +   5.10%      ksoftirqd/4  [kernel.kallsyms]    [k] _raw_spin_lock
          +   4.66%      ksoftirqd/0  [kernel.kallsyms]    [k] _raw_spin_lock
          +   4.45%      ksoftirqd/3  [kernel.kallsyms]    [k] _raw_spin_lock
          +   1.55%   bench_rollover  [kernel.kallsyms]    [k] packet_rcv_has_room.isra.41
      
        On net-next with this patch, this lock contention is no longer a
        top entry. Most time is spent in the actual read function. Next up
        are other locks:
      
          +  15.52%  bench_rollover  bench_rollover     [.] reader
          +   4.68%         swapper  [kernel.kallsyms]  [k] memcpy_erms
          +   2.77%         swapper  [kernel.kallsyms]  [k] packet_lookup_frame.isra.51
          +   2.56%     ksoftirqd/1  [kernel.kallsyms]  [k] memcpy_erms
          +   2.16%         swapper  [kernel.kallsyms]  [k] tpacket_rcv
          +   1.93%         swapper  [kernel.kallsyms]  [k] mlx4_en_process_rx_cq
      
        Looking closer at the remaining _raw_spin_lock, the cost of probing
        in rollover is now comparable to the cost of taking the lock later
        in tpacket_rcv.
      
          -   1.51%         swapper  [kernel.kallsyms]  [k] _raw_spin_lock
             - _raw_spin_lock
                + 33.41% packet_rcv_has_room
                + 28.15% tpacket_rcv
                + 19.54% enqueue_to_backlog
                + 6.45% __free_pages_ok
                + 2.78% packet_rcv_fanout
                + 2.13% fanout_demux_rollover
                + 2.01% netif_receive_skb_internal
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ccdbaa6
    • W
      packet: rollover only to socket with headroom · 9954729b
      Willem de Bruijn 提交于
      Only migrate flows to sockets that have sufficient headroom, where
      sufficient is defined as having at least 25% empty space.
      
      The kernel has three different buffer types: a regular socket, a ring
      with frames (TPACKET_V[12]) or a ring with blocks (TPACKET_V3). The
      latter two do not expose a read pointer to the kernel, so headroom is
      not computed easily. All three needs a different implementation to
      estimate free space.
      
      Tested:
        Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.
      
        bench_rollover has as many sockets as there are NIC receive queues
        in the system. Each socket is owned by a process that is pinned to
        one of the receive cpus. RFS is disabled. RPS is enabled with an
        identity mapping (cpu x -> cpu x), to count drops with softnettop.
      
          lpbb5:/export/hda3/willemb# ./bench_rollover -r -l 1000 -s
          Press [Enter] to exit
      
          cpu         rx       rx.k     drop.k   rollover     r.huge   r.failed
            0         16         16          0          0          0          0
            1         21         21          0          0          0          0
            2    5227502    5227502          0          0          0          0
            3         18         18          0          0          0          0
            4    6083289    6083289          0    5227496          0          0
            5         22         22          0          0          0          0
            6         21         21          0          0          0          0
            7          9          9          0          0          0          0
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9954729b
    • W
      packet: rollover prepare: per-socket state · 0648ab70
      Willem de Bruijn 提交于
      Replace rollover state per fanout group with state per socket. Future
      patches will add fields to the new structure.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0648ab70
    • W
      packet: rollover prepare: move code out of callsites · ad377cab
      Willem de Bruijn 提交于
      packet_rcv_fanout calls fanout_demux_rollover twice. Move all rollover
      logic into the callee to simplify these callsites, especially with
      upcoming changes.
      
      The main differences between the two callsites is that the FLAG
      variant tests whether the socket previously selected by another
      mode (RR, RND, HASH, ..) has room before migrating flows, whereas the
      rollover mode has no original socket to test.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad377cab
    • E
      ipv4: __ip_local_out_sk() is static · 7d771aaa
      Eric Dumazet 提交于
      __ip_local_out_sk() is only used from net/ipv4/ip_output.c
      
      net/ipv4/ip_output.c:94:5: warning: symbol '__ip_local_out_sk' was not
      declared. Should it be static?
      
      Fixes: 7026b1dd ("netfilter: Pass socket pointer down through okfn().")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d771aaa
    • E
      tcp/dccp: tw_timer_handler() is static · 216f8bb9
      Eric Dumazet 提交于
      tw_timer_handler() is only used from net/ipv4/inet_timewait_sock.c
      
      Fixes: 789f558c ("tcp/dccp: get rid of central timewait timer")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      216f8bb9
    • J
      tc: introduce Flower classifier · 77b9900e
      Jiri Pirko 提交于
      This patch introduces a flow-based filter. So far, the very essential
      packet fields are supported.
      
      This patch is only the first step. There is a lot of potential performance
      improvements possible to implement. Also a lot of features are missing
      now. They will be addressed in follow-up patches.
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77b9900e
    • J
      59346afe
    • J
      67a900cc
    • J
      flow_dissector: introduce support for ipv6 addressses · b924933c
      Jiri Pirko 提交于
      So far, only hashes made out of ipv6 addresses could be dissected. This
      patch introduces support for dissection of full ipv6 addresses.
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b924933c
    • J