1. 10 2月, 2017 2 次提交
    • J
      openvswitch: Pack struct sw_flow_key. · 316d4d78
      Jarno Rajahalme 提交于
      struct sw_flow_key has two 16-bit holes. Move the most matched
      conntrack match fields there.  In some typical cases this reduces the
      size of the key that needs to be hashed into half and into one cache
      line.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      316d4d78
    • J
      openvswitch: Add original direction conntrack tuple to sw_flow_key. · 9dd7f890
      Jarno Rajahalme 提交于
      Add the fields of the conntrack original direction 5-tuple to struct
      sw_flow_key.  The new fields are initially marked as non-existent, and
      are populated whenever a conntrack action is executed and either finds
      or generates a conntrack entry.  This means that these fields exist
      for all packets that were not rejected by conntrack as untrackable.
      
      The original tuple fields in the sw_flow_key are filled from the
      original direction tuple of the conntrack entry relating to the
      current packet, or from the original direction tuple of the master
      conntrack entry, if the current conntrack entry has a master.
      Generally, expected connections of connections having an assigned
      helper (e.g., FTP), have a master conntrack entry.
      
      The main purpose of the new conntrack original tuple fields is to
      allow matching on them for policy decision purposes, with the premise
      that the admissibility of tracked connections reply packets (as well
      as original direction packets), and both direction packets of any
      related connections may be based on ACL rules applying to the master
      connection's original direction 5-tuple.  This also makes it easier to
      make policy decisions when the actual packet headers might have been
      transformed by NAT, as the original direction 5-tuple represents the
      packet headers before any such transformation.
      
      When using the original direction 5-tuple the admissibility of return
      and/or related packets need not be based on the mere existence of a
      conntrack entry, allowing separation of admission policy from the
      established conntrack state.  While existence of a conntrack entry is
      required for admission of the return or related packets, policy
      changes can render connections that were initially admitted to be
      rejected or dropped afterwards.  If the admission of the return and
      related packets was based on mere conntrack state (e.g., connection
      being in an established state), a policy change that would make the
      connection rejected or dropped would need to find and delete all
      conntrack entries affected by such a change.  When using the original
      direction 5-tuple matching the affected conntrack entries can be
      allowed to time out instead, as the established state of the
      connection would not need to be the basis for packet admission any
      more.
      
      It should be noted that the directionality of related connections may
      be the same or different than that of the master connection, and
      neither the original direction 5-tuple nor the conntrack state bits
      carry this information.  If needed, the directionality of the master
      connection can be stored in master's conntrack mark or labels, which
      are automatically inherited by the expected related connections.
      
      The fact that neither ARP nor ND packets are trackable by conntrack
      allows mutual exclusion between ARP/ND and the new conntrack original
      tuple fields.  Hence, the IP addresses are overlaid in union with ARP
      and ND fields.  This allows the sw_flow_key to not grow much due to
      this patch, but it also means that we must be careful to never use the
      new key fields with ARP or ND packets.  ARP is easy to distinguish and
      keep mutually exclusive based on the ethernet type, but ND being an
      ICMPv6 protocol requires a bit more attention.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9dd7f890
  2. 13 11月, 2016 1 次提交
    • J
      openvswitch: add mac_proto field to the flow key · 329f45bc
      Jiri Benc 提交于
      Use a hole in the structure. We support only Ethernet so far and will add
      a support for L2-less packets shortly. We could use a bool to indicate
      whether the Ethernet header is present or not but the approach with the
      mac_proto field is more generic and occupies the same number of bytes in the
      struct, while allowing later extensibility. It also makes the code in the
      next patches more self explaining.
      
      It would be nice to use ARPHRD_ constants but those are u16 which would be
      waste. Thus define our own constants.
      
      Another upside of this is that we can overload this new field to also denote
      whether the flow key is valid. This has the advantage that on
      refragmentation, we don't have to reparse the packet but can rely on the
      stored eth.type. This is especially important for the next patches in this
      series - instead of adding another branch for L2-less packets before calling
      ovs_fragment, we can just remove all those branches completely.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      329f45bc
  3. 19 9月, 2016 1 次提交
  4. 09 9月, 2016 1 次提交
  5. 19 3月, 2016 1 次提交
  6. 07 10月, 2015 1 次提交
  7. 05 10月, 2015 1 次提交
  8. 28 8月, 2015 4 次提交
    • J
      openvswitch: Allow matching on conntrack label · c2ac6673
      Joe Stringer 提交于
      Allow matching and setting the ct_label field. As with ct_mark, this is
      populated by executing the CT action. The label field may be modified by
      specifying a label and mask nested under the CT action. It is stored as
      metadata attached to the connection. Label modification occurs after
      lookup, and will only persist when the conntrack entry is committed by
      providing the COMMIT flag to the CT action. Labels are currently fixed
      to 128 bits in size.
      Signed-off-by: NJoe Stringer <joestringer@nicira.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2ac6673
    • J
      openvswitch: Allow matching on conntrack mark · 182e3042
      Joe Stringer 提交于
      Allow matching and setting the ct_mark field. As with ct_state and
      ct_zone, these fields are populated when the CT action is executed. To
      write to this field, a value and mask can be specified as a nested
      attribute under the CT action. This data is stored with the conntrack
      entry, and is executed after the lookup occurs for the CT action. The
      conntrack entry itself must be committed using the COMMIT flag in the CT
      action flags for this change to persist.
      Signed-off-by: NJustin Pettit <jpettit@nicira.com>
      Signed-off-by: NJoe Stringer <joestringer@nicira.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      182e3042
    • J
      openvswitch: Add conntrack action · 7f8a436e
      Joe Stringer 提交于
      Expose the kernel connection tracker via OVS. Userspace components can
      make use of the CT action to populate the connection state (ct_state)
      field for a flow. This state can be subsequently matched.
      
      Exposed connection states are OVS_CS_F_*:
      - NEW (0x01) - Beginning of a new connection.
      - ESTABLISHED (0x02) - Part of an existing connection.
      - RELATED (0x04) - Related to an established connection.
      - INVALID (0x20) - Could not track the connection for this packet.
      - REPLY_DIR (0x40) - This packet is in the reply direction for the flow.
      - TRACKED (0x80) - This packet has been sent through conntrack.
      
      When the CT action is executed by itself, it will send the packet
      through the connection tracker and populate the ct_state field with one
      or more of the connection state flags above. The CT action will always
      set the TRACKED bit.
      
      When the COMMIT flag is passed to the conntrack action, this specifies
      that information about the connection should be stored. This allows
      subsequent packets for the same (or related) connections to be
      correlated with this connection. Sending subsequent packets for the
      connection through conntrack allows the connection tracker to consider
      the packets as ESTABLISHED, RELATED, and/or REPLY_DIR.
      
      The CT action may optionally take a zone to track the flow within. This
      allows connections with the same 5-tuple to be kept logically separate
      from connections in other zones. If the zone is specified, then the
      "ct_zone" match field will be subsequently populated with the zone id.
      
      IP fragments are handled by transparently assembling them as part of the
      CT action. The maximum received unit (MRU) size is tracked so that
      refragmentation can occur during output.
      
      IP frag handling contributed by Andy Zhou.
      
      Based on original design by Justin Pettit.
      Signed-off-by: NJoe Stringer <joestringer@nicira.com>
      Signed-off-by: NJustin Pettit <jpettit@nicira.com>
      Signed-off-by: NAndy Zhou <azhou@nicira.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f8a436e
    • J
      openvswitch: Serialize acts with original netlink len · 8e2fed1c
      Joe Stringer 提交于
      Previously, we used the kernel-internal netlink actions length to
      calculate the size of messages to serialize back to userspace.
      However,the sw_flow_actions may not be formatted exactly the same as the
      actions on the wire, so store the original actions length when
      de-serializing and re-use the original length when serializing.
      Signed-off-by: NJoe Stringer <joestringer@nicira.com>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e2fed1c
  9. 22 7月, 2015 2 次提交
  10. 27 1月, 2015 1 次提交
    • J
      openvswitch: Add support for unique flow IDs. · 74ed7ab9
      Joe Stringer 提交于
      Previously, flows were manipulated by userspace specifying a full,
      unmasked flow key. This adds significant burden onto flow
      serialization/deserialization, particularly when dumping flows.
      
      This patch adds an alternative way to refer to flows using a
      variable-length "unique flow identifier" (UFID). At flow setup time,
      userspace may specify a UFID for a flow, which is stored with the flow
      and inserted into a separate table for lookup, in addition to the
      standard flow table. Flows created using a UFID must be fetched or
      deleted using the UFID.
      
      All flow dump operations may now be made more terse with OVS_UFID_F_*
      flags. For example, the OVS_UFID_F_OMIT_KEY flag allows responses to
      omit the flow key from a datapath operation if the flow has a
      corresponding UFID. This significantly reduces the time spent assembling
      and transacting netlink messages. With all OVS_UFID_F_OMIT_* flags
      enabled, the datapath only returns the UFID and statistics for each flow
      during flow dump, increasing ovs-vswitchd revalidator performance by 40%
      or more.
      Signed-off-by: NJoe Stringer <joestringer@nicira.com>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74ed7ab9
  11. 15 1月, 2015 1 次提交
  12. 10 11月, 2014 3 次提交
  13. 06 11月, 2014 1 次提交
  14. 06 10月, 2014 2 次提交
  15. 16 9月, 2014 3 次提交
  16. 30 6月, 2014 1 次提交
  17. 23 5月, 2014 2 次提交
    • J
      openvswitch: Fix ovs_flow_stats_get/clear RCU dereference. · 86ec8dba
      Jarno Rajahalme 提交于
      For ovs_flow_stats_get() using ovsl_dereference() was wrong, since
      flow dumps call this with RCU read lock.
      
      ovs_flow_stats_clear() is always called with ovs_mutex, so can use
      ovsl_dereference().
      
      Also, make the ovs_flow_stats_get() 'flow' argument const to make
      later patches cleaner.
      Signed-off-by: NJarno Rajahalme <jrajahalme@nicira.com>
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      86ec8dba
    • J
      openvswitch: Compact sw_flow_key. · 1139e241
      Jarno Rajahalme 提交于
      Minimize padding in sw_flow_key and move 'tp' top the main struct.
      These changes simplify code when accessing the transport port numbers
      and the tcp flags, and makes the sw_flow_key 8 bytes smaller on 64-bit
      systems (128->120 bytes).  These changes also make the keys for IPv4
      packets to fit in one cache line.
      
      There is a valid concern for safety of packing the struct
      ovs_key_ipv4_tunnel, as it would be possible to take the address of
      the tun_id member as a __be64 * which could result in unaligned access
      in some systems. However:
      
      - sw_flow_key itself is 64-bit aligned, so the tun_id within is
        always
        64-bit aligned.
      - We never make arrays of ovs_key_ipv4_tunnel (which would force
        every
        second tun_key to be misaligned).
      - We never take the address of the tun_id in to a __be64 *.
      - Whereever we use struct ovs_key_ipv4_tunnel outside the
        sw_flow_key,
        it is in stack (on tunnel input functions), where compiler has full
        control of the alignment.
      Signed-off-by: NJarno Rajahalme <jrajahalme@nicira.com>
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      1139e241
  18. 17 5月, 2014 2 次提交
    • J
      openvswitch: Per NUMA node flow stats. · 63e7959c
      Jarno Rajahalme 提交于
      Keep kernel flow stats for each NUMA node rather than each (logical)
      CPU.  This avoids using the per-CPU allocator and removes most of the
      kernel-side OVS locking overhead otherwise on the top of perf reports
      and allows OVS to scale better with higher number of threads.
      
      With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup
      rate doubles on a server with two hyper-threaded physical CPUs (16
      logical cores each) compared to the current OVS master.  Tested with
      non-trivial flow table with a TCP port match rule forcing all new
      connections with unique port numbers to OVS userspace.  The IP
      addresses are still wildcarded, so the kernel flows are not considered
      as exact match 5-tuple flows.  This type of flows can be expected to
      appear in large numbers as the result of more effective wildcarding
      made possible by improvements in OVS userspace flow classifier.
      
      Perf results for this test (master):
      
      Events: 305K cycles
      +   8.43%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
      +   5.64%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
      +   4.75%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
      +   3.32%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
      +   2.61%     ovs-vswitchd  [kernel.kallsyms]   [k] pcpu_alloc_area
      +   2.19%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask_range
      +   2.03%          swapper  [kernel.kallsyms]   [k] intel_idle
      +   1.84%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
      +   1.64%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
      +   1.58%     ovs-vswitchd  libc-2.15.so        [.] 0x7f4e6
      +   1.07%     ovs-vswitchd  [kernel.kallsyms]   [k] memset
      +   1.03%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
      +   0.92%          swapper  [kernel.kallsyms]   [k] __ticket_spin_lock
      ...
      
      And after this patch:
      
      Events: 356K cycles
      +   6.85%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
      +   4.63%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
      +   3.06%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
      +   2.81%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask_range
      +   2.51%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
      +   2.27%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
      +   1.84%     ovs-vswitchd  libc-2.15.so        [.] 0x15d30f
      +   1.74%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
      +   1.47%          swapper  [kernel.kallsyms]   [k] intel_idle
      +   1.34%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask
      +   1.33%     ovs-vswitchd  ovs-vswitchd        [.] rule_actions_unref
      +   1.16%     ovs-vswitchd  ovs-vswitchd        [.] hindex_node_with_hash
      +   1.16%     ovs-vswitchd  ovs-vswitchd        [.] do_xlate_actions
      +   1.09%     ovs-vswitchd  ovs-vswitchd        [.] ofproto_rule_ref
      +   1.01%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
      ...
      
      There is a small increase in kernel spinlock overhead due to the same
      spinlock being shared between multiple cores of the same physical CPU,
      but that is barely visible in the netperf TCP_CRR test performance
      (maybe ~1% performance drop, hard to tell exactly due to variance in
      the test results), when testing for kernel module throughput (with no
      userspace activity, handful of kernel flows).
      
      On flow setup, a single stats instance is allocated (for the NUMA node
      0).  As CPUs from multiple NUMA nodes start updating stats, new
      NUMA-node specific stats instances are allocated.  This allocation on
      the packet processing code path is made to never block or look for
      emergency memory pools, minimizing the allocation latency.  If the
      allocation fails, the existing preallocated stats instance is used.
      Also, if only CPUs from one NUMA-node are updating the preallocated
      stats instance, no additional stats instances are allocated.  This
      eliminates the need to pre-allocate stats instances that will not be
      used, also relieving the stats reader from the burden of reading stats
      that are never used.
      Signed-off-by: NJarno Rajahalme <jrajahalme@nicira.com>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      63e7959c
    • J
      openvswitch: Remove 5-tuple optimization. · 23dabf88
      Jarno Rajahalme 提交于
      The 5-tuple optimization becomes unnecessary with a later per-NUMA
      node stats patch.  Remove it first to make the changes easier to
      grasp.
      Signed-off-by: NJarno Rajahalme <jrajahalme@nicira.com>
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      23dabf88
  19. 07 1月, 2014 2 次提交
  20. 02 11月, 2013 2 次提交
    • J
      openvswitch: TCP flags matching support. · 5eb26b15
      Jarno Rajahalme 提交于
          tcp_flags=flags/mask
              Bitwise  match on TCP flags.  The flags and mask are 16-bit num‐
              bers written in decimal or in hexadecimal prefixed by 0x.   Each
              1-bit  in  mask requires that the corresponding bit in port must
              match.  Each 0-bit in mask causes the corresponding  bit  to  be
              ignored.
      
              TCP  protocol  currently  defines  9 flag bits, and additional 3
              bits are reserved (must be transmitted as zero), see  RFCs  793,
              3168, and 3540.  The flag bits are, numbering from the least
              significant bit:
      
              0: FIN No more data from sender.
      
              1: SYN Synchronize sequence numbers.
      
              2: RST Reset the connection.
      
              3: PSH Push function.
      
              4: ACK Acknowledgement field significant.
      
              5: URG Urgent pointer field significant.
      
              6: ECE ECN Echo.
      
              7: CWR Congestion Windows Reduced.
      
              8: NS  Nonce Sum.
      
              9-11:  Reserved.
      
              12-15: Not matchable, must be zero.
      Signed-off-by: NJarno Rajahalme <jrajahalme@nicira.com>
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      5eb26b15
    • J
      openvswitch: Widen TCP flags handling. · df23e9f6
      Jarno Rajahalme 提交于
      Widen TCP flags handling from 7 bits (uint8_t) to 12 bits (uint16_t).
      The kernel interface remains at 8 bits, which makes no functional
      difference now, as none of the higher bits is currently of interest
      to the userspace.
      Signed-off-by: NJarno Rajahalme <jrajahalme@nicira.com>
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      df23e9f6
  21. 04 10月, 2013 1 次提交
  22. 06 9月, 2013 1 次提交
  23. 28 8月, 2013 1 次提交
    • A
      openvswitch: optimize flow compare and mask functions · 5828cd9a
      Andy Zhou 提交于
      Make sure the sw_flow_key structure and valid mask boundaries are always
      machine word aligned. Optimize the flow compare and mask operations
      using machine word size operations. This patch improves throughput on
      average by 15% when CPU is the bottleneck of forwarding packets.
      
      This patch is inspired by ideas and code from a patch submitted by Peter
      Klausler titled "replace memcmp() with specialized comparator".
      However, The original patch only optimizes for architectures
      support unaligned machine word access. This patch optimizes for all
      architectures.
      Signed-off-by: NAndy Zhou <azhou@nicira.com>
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      5828cd9a
  24. 27 8月, 2013 2 次提交
  25. 24 8月, 2013 1 次提交
    • A
      openvswitch: Mega flow implementation · 03f0d916
      Andy Zhou 提交于
      Add wildcarded flow support in kernel datapath.
      
      Wildcarded flow can improve OVS flow set up performance by avoid sending
      matching new flows to the user space program. The exact performance boost
      will largely dependent on wildcarded flow hit rate.
      
      In case all new flows hits wildcard flows, the flow set up rate is
      within 5% of that of linux bridge module.
      
      Pravin has made significant contributions to this patch. Including API
      clean ups and bug fixes.
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NAndy Zhou <azhou@nicira.com>
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      03f0d916