1. 11 12月, 2014 1 次提交
    • D
      net: replace remaining users of arch_fast_hash with jhash · 87545899
      Daniel Borkmann 提交于
      This patch effectively reverts commit 500f8087 ("net: ovs: use CRC32
      accelerated flow hash if available"), and other remaining arch_fast_hash()
      users such as from nfsd via commit 6282cd56 ("NFSD: Don't hand out
      delegations for 30 seconds after recalling them.") where it has been used
      as a hash function for bloom filtering.
      
      While we think that these users are actually not much of concern, it has
      been requested to remove the arch_fast_hash() library bits that arose
      from [1] entirely as per recent discussion [2]. The main argument is that
      using it as a hash may introduce bias due to its linearity (see avalanche
      criterion) and thus makes it less clear (though we tried to document that)
      when this security/performance trade-off is actually acceptable for a
      general purpose library function.
      
      Lets therefore avoid any further confusion on this matter and remove it to
      prevent any future accidental misuse of it. For the time being, this is
      going to make hashing of flow keys a bit more expensive in the ovs case,
      but future work could reevaluate a different hashing discipline.
      
        [1] https://patchwork.ozlabs.org/patch/299369/
        [2] https://patchwork.ozlabs.org/patch/418756/
      
      Cc: Neil Brown <neilb@suse.de>
      Cc: Francesco Fusco <fusco@ntop.org>
      Cc: Jesse Gross <jesse@nicira.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87545899
  2. 10 11月, 2014 1 次提交
  3. 06 11月, 2014 1 次提交
  4. 01 7月, 2014 1 次提交
  5. 23 5月, 2014 2 次提交
  6. 17 5月, 2014 3 次提交
    • J
      openvswitch: Per NUMA node flow stats. · 63e7959c
      Jarno Rajahalme 提交于
      Keep kernel flow stats for each NUMA node rather than each (logical)
      CPU.  This avoids using the per-CPU allocator and removes most of the
      kernel-side OVS locking overhead otherwise on the top of perf reports
      and allows OVS to scale better with higher number of threads.
      
      With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup
      rate doubles on a server with two hyper-threaded physical CPUs (16
      logical cores each) compared to the current OVS master.  Tested with
      non-trivial flow table with a TCP port match rule forcing all new
      connections with unique port numbers to OVS userspace.  The IP
      addresses are still wildcarded, so the kernel flows are not considered
      as exact match 5-tuple flows.  This type of flows can be expected to
      appear in large numbers as the result of more effective wildcarding
      made possible by improvements in OVS userspace flow classifier.
      
      Perf results for this test (master):
      
      Events: 305K cycles
      +   8.43%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
      +   5.64%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
      +   4.75%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
      +   3.32%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
      +   2.61%     ovs-vswitchd  [kernel.kallsyms]   [k] pcpu_alloc_area
      +   2.19%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask_range
      +   2.03%          swapper  [kernel.kallsyms]   [k] intel_idle
      +   1.84%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
      +   1.64%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
      +   1.58%     ovs-vswitchd  libc-2.15.so        [.] 0x7f4e6
      +   1.07%     ovs-vswitchd  [kernel.kallsyms]   [k] memset
      +   1.03%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
      +   0.92%          swapper  [kernel.kallsyms]   [k] __ticket_spin_lock
      ...
      
      And after this patch:
      
      Events: 356K cycles
      +   6.85%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
      +   4.63%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
      +   3.06%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
      +   2.81%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask_range
      +   2.51%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
      +   2.27%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
      +   1.84%     ovs-vswitchd  libc-2.15.so        [.] 0x15d30f
      +   1.74%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
      +   1.47%          swapper  [kernel.kallsyms]   [k] intel_idle
      +   1.34%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask
      +   1.33%     ovs-vswitchd  ovs-vswitchd        [.] rule_actions_unref
      +   1.16%     ovs-vswitchd  ovs-vswitchd        [.] hindex_node_with_hash
      +   1.16%     ovs-vswitchd  ovs-vswitchd        [.] do_xlate_actions
      +   1.09%     ovs-vswitchd  ovs-vswitchd        [.] ofproto_rule_ref
      +   1.01%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
      ...
      
      There is a small increase in kernel spinlock overhead due to the same
      spinlock being shared between multiple cores of the same physical CPU,
      but that is barely visible in the netperf TCP_CRR test performance
      (maybe ~1% performance drop, hard to tell exactly due to variance in
      the test results), when testing for kernel module throughput (with no
      userspace activity, handful of kernel flows).
      
      On flow setup, a single stats instance is allocated (for the NUMA node
      0).  As CPUs from multiple NUMA nodes start updating stats, new
      NUMA-node specific stats instances are allocated.  This allocation on
      the packet processing code path is made to never block or look for
      emergency memory pools, minimizing the allocation latency.  If the
      allocation fails, the existing preallocated stats instance is used.
      Also, if only CPUs from one NUMA-node are updating the preallocated
      stats instance, no additional stats instances are allocated.  This
      eliminates the need to pre-allocate stats instances that will not be
      used, also relieving the stats reader from the burden of reading stats
      that are never used.
      Signed-off-by: NJarno Rajahalme <jrajahalme@nicira.com>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      63e7959c
    • J
      openvswitch: Remove 5-tuple optimization. · 23dabf88
      Jarno Rajahalme 提交于
      The 5-tuple optimization becomes unnecessary with a later per-NUMA
      node stats patch.  Remove it first to make the changes easier to
      grasp.
      Signed-off-by: NJarno Rajahalme <jrajahalme@nicira.com>
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      23dabf88
    • D
      openvswitch: use const in some local vars and casts · 7085130b
      Daniele Di Proietto 提交于
      In few functions, const formal parameters are assigned or cast to
      non-const.
      These changes suppress warnings if compiled with -Wcast-qual.
      Signed-off-by: NDaniele Di Proietto <daniele.di.proietto@gmail.com>
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      7085130b
  7. 05 2月, 2014 2 次提交
  8. 10 1月, 2014 1 次提交
  9. 07 1月, 2014 6 次提交
  10. 18 12月, 2013 1 次提交
    • F
      net: ovs: use CRC32 accelerated flow hash if available · 500f8087
      Francesco Fusco 提交于
      Currently OVS uses jhash2() for calculating flow hashes in its
      internal flow_hash() function. The performance of the flow_hash()
      function is critical, as the input data can be hundreds of bytes
      long.
      
      OVS is largely deployed in x86_64 based datacenters.  Therefore,
      we argue that the performance critical fast path of OVS should
      exploit underlying CPU features in order to reduce the per packet
      processing costs. We replace jhash2 with the hash implementation
      provided by the kernel hash lib, which exploits the crc32l
      instruction to achieve high performance
      
      Our patch greatly reduces the hash footprint from ~200 cycles of
      jhash2() to around ~90 cycles in case of ovs_flow_hash_crc()
      (measured with rdtsc over maximum length flow keys on an i7 Intel
      CPU).
      
      Additionally, we wrote a microbenchmark to stress the flow table
      performance. The benchmark inserts random flows into the flow
      hash and then performs lookups. Our hash deployed on a CRC32
      capable CPU reduces the lookup for 1000 flows, 100 masks from
      ~10,100us to ~6,700us, for example.
      
      Thus, simply use the newly introduced arch_fast_hash2() as a
      drop-in replacement.
      Signed-off-by: NFrancesco Fusco <ffusco@redhat.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NThomas Graf <tgraf@redhat.com>
      Acked-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      500f8087
  11. 02 11月, 2013 1 次提交
  12. 23 10月, 2013 1 次提交
  13. 04 10月, 2013 3 次提交