1. 20 12月, 2018 6 次提交
  2. 18 12月, 2018 4 次提交
  3. 17 12月, 2018 2 次提交
  4. 16 12月, 2018 3 次提交
  5. 15 12月, 2018 2 次提交
  6. 13 12月, 2018 3 次提交
  7. 11 12月, 2018 2 次提交
  8. 10 12月, 2018 1 次提交
  9. 08 12月, 2018 4 次提交
    • S
      neighbour: Avoid writing before skb->head in neigh_hh_output() · e6ac64d4
      Stefano Brivio 提交于
      While skb_push() makes the kernel panic if the skb headroom is less than
      the unaligned hardware header size, it will proceed normally in case we
      copy more than that because of alignment, and we'll silently corrupt
      adjacent slabs.
      
      In the case fixed by the previous patch,
      "ipv6: Check available headroom in ip6_xmit() even without options", we
      end up in neigh_hh_output() with 14 bytes headroom, 14 bytes hardware
      header and write 16 bytes, starting 2 bytes before the allocated buffer.
      
      Always check we're not writing before skb->head and, if the headroom is
      not enough, warn and drop the packet.
      
      v2:
       - instead of panicking with BUG_ON(), WARN_ON_ONCE() and drop the packet
         (Eric Dumazet)
       - if we avoid the panic, though, we need to explicitly check the headroom
         before the memcpy(), otherwise we'll have corrupted slabs on a running
         kernel, after we warn
       - use __skb_push() instead of skb_push(), as the headroom check is
         already implemented here explicitly (Eric Dumazet)
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6ac64d4
    • D
      neighbor: Improve garbage collection · 58956317
      David Ahern 提交于
      The existing garbage collection algorithm has a number of problems:
      
      1. The gc algorithm will not evict PERMANENT entries as those entries
         are managed by userspace, yet the existing algorithm walks the entire
         hash table which means it always considers PERMANENT entries when
         looking for entries to evict. In some use cases (e.g., EVPN) there
         can be tens of thousands of PERMANENT entries leading to wasted
         CPU cycles when gc kicks in. As an example, with 32k permanent
         entries, neigh_alloc has been observed taking more than 4 msec per
         invocation.
      
      2. Currently, when the number of neighbor entries hits gc_thresh2 and
         the last flush for the table was more than 5 seconds ago gc kicks in
         walks the entire hash table evicting *all* entries not in PERMANENT
         or REACHABLE state and not marked as externally learned. There is no
         discriminator on when the neigh entry was created or if it just moved
         from REACHABLE to another NUD_VALID state (e.g., NUD_STALE).
      
         It is possible for entries to be created or for established neighbor
         entries to be moved to STALE (e.g., an external node sends an ARP
         request) right before the 5 second window lapses:
      
              -----|---------x|----------|-----
                  t-5         t         t+5
      
         If that happens those entries are evicted during gc causing unnecessary
         thrashing on neighbor entries and userspace caches trying to track them.
      
         Further, this contradicts the description of gc_thresh2 which says
         "Entries older than 5 seconds will be cleared".
      
         One workaround is to make gc_thresh2 == gc_thresh3 but that negates the
         whole point of having separate thresholds.
      
      3. Clearing *all* neigh non-PERMANENT/REACHABLE/externally learned entries
         when gc_thresh2 is exceeded is over kill and contributes to trashing
         especially during startup.
      
      This patch addresses these problems as follows:
      
      1. Use of a separate list_head to track entries that can be garbage
         collected along with a separate counter. PERMANENT entries are not
         added to this list.
      
         The gc_thresh parameters are only compared to the new counter, not the
         total entries in the table. The forced_gc function is updated to only
         walk this new gc_list looking for entries to evict.
      
      2. Entries are added to the list head at the tail and removed from the
         front.
      
      3. Entries are only evicted if they were last updated more than 5 seconds
         ago, adhering to the original intent of gc_thresh2.
      
      4. Forced gc is stopped once the number of gc_entries drops below
         gc_thresh2.
      
      5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped
         when allocating a new neighbor for a PERMANENT entry. By extension this
         means there are no explicit limits on the number of PERMANENT entries
         that can be created, but this is no different than FIB entries or FDB
         entries.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58956317
    • P
      vxlan: Add vxlan_fdb_clear_offload() · e5ff4b19
      Petr Machata 提交于
      When a driver unoffloads all FDB entries en bloc, it's inefficient to
      send the switchdev notification one by one. Add a helper that walks the
      FDB table, unsetting the offload flag on RDST with a given VNI.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e5ff4b19
    • P
      vxlan: Add vxlan_fdb_replay() · 4f89f5b5
      Petr Machata 提交于
      When a VXLAN device becomes relevant to a driver (such as when it is
      attached to an offloaded bridge), the driver will generally need to walk
      the existing FDB entries and offload them.
      
      Add a function vxlan_fdb_replay() to call a given notifier block for
      each FDB entry with a given VNI.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f89f5b5
  10. 07 12月, 2018 1 次提交
  11. 06 12月, 2018 1 次提交
  12. 05 12月, 2018 1 次提交
    • E
      tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWAT · a74f0fa0
      Eric Dumazet 提交于
      TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12
      as a step to enable bigger tcp sndbuf limits.
      
      It works reasonably well, but the following happens :
      
      Once the limit is reached, TCP stack generates
      an [E]POLLOUT event for every incoming ACK packet.
      
      This causes a high number of context switches.
      
      This patch implements the strategy David Miller added
      in sock_def_write_space() :
      
       - If TCP socket has a notsent_lowat constraint of X bytes,
         allow sendmsg() to fill up to X bytes, but send [E]POLLOUT
         only if number of notsent bytes is below X/2
      
      This considerably reduces TCP_NOTSENT_LOWAT overhead,
      while allowing to keep the pipe full.
      
      Tested:
       100 ms RTT netem testbed between A and B, 100 concurrent TCP_STREAM
      
      A:/# cat /proc/sys/net/ipv4/tcp_wmem
      4096	262144	64000000
      A:/# super_netperf 100 -H B -l 1000 -- -K bbr &
      
      A:/# grep TCP /proc/net/sockstat
      TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 1364904 # This is about 54 MB of memory per flow :/
      
      A:/# vmstat 5 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       0  0      0 256220672  13532 694976    0    0    10     0   28   14  0  1 99  0  0
       2  0      0 256320016  13532 698480    0    0   512     0 715901 5927  0 10 90  0  0
       0  0      0 256197232  13532 700992    0    0   735    13 771161 5849  0 11 89  0  0
       1  0      0 256233824  13532 703320    0    0   512    23 719650 6635  0 11 89  0  0
       2  0      0 256226880  13532 705780    0    0   642     4 775650 6009  0 12 88  0  0
      
      A:/# echo 2097152 >/proc/sys/net/ipv4/tcp_notsent_lowat
      
      A:/# grep TCP /proc/net/sockstat
      TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 86411 # 3.5 MB per flow
      
      A:/# vmstat 5 5  # check that context switches have not inflated too much.
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       2  0      0 260386512  13592 662148    0    0    10     0   17   14  0  1 99  0  0
       0  0      0 260519680  13592 604184    0    0   512    13 726843 12424  0 10 90  0  0
       1  1      0 260435424  13592 598360    0    0   512    25 764645 12925  0 10 90  0  0
       1  0      0 260855392  13592 578380    0    0   512     7 722943 13624  0 11 88  0  0
       1  0      0 260445008  13592 601176    0    0   614    34 772288 14317  0 10 90  0  0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a74f0fa0
  13. 04 12月, 2018 4 次提交
  14. 01 12月, 2018 5 次提交
  15. 27 11月, 2018 1 次提交