1. 24 11月, 2017 1 次提交
    • W
      net: accept UFO datagrams from tuntap and packet · 0c19f846
      Willem de Bruijn 提交于
      Tuntap and similar devices can inject GSO packets. Accept type
      VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.
      
      Processes are expected to use feature negotiation such as TUNSETOFFLOAD
      to detect supported offload types and refrain from injecting other
      packets. This process breaks down with live migration: guest kernels
      do not renegotiate flags, so destination hosts need to expose all
      features that the source host does.
      
      Partially revert the UFO removal from 182e0b6b~1..d9d30adf.
      This patch introduces nearly(*) no new code to simplify verification.
      It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
      insertion and software UFO segmentation.
      
      It does not reinstate protocol stack support, hardware offload
      (NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception
      of VIRTIO_NET_HDR_GSO_UDP packets in tuntap.
      
      To support SKB_GSO_UDP reappearing in the stack, also reinstate
      logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD
      by squashing in commit 93991221 ("net: skb_needs_check() removes
      CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee6
      ("net: avoid skb_warn_bad_offload false positives on UFO").
      
      (*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id,
      ipv6_proxy_select_ident is changed to return a __be32 and this is
      assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted
      at the end of the enum to minimize code churn.
      
      Tested
        Booted a v4.13 guest kernel with QEMU. On a host kernel before this
        patch `ethtool -k eth0` shows UFO disabled. After the patch, it is
        enabled, same as on a v4.13 host kernel.
      
        A UFO packet sent from the guest appears on the tap device:
          host:
            nc -l -p -u 8000 &
            tcpdump -n -i tap0
      
          guest:
            dd if=/dev/zero of=payload.txt bs=1 count=2000
            nc -u 192.16.1.1 8000 < payload.txt
      
        Direct tap to tap transmission of VIRTIO_NET_HDR_GSO_UDP succeeds,
        packets arriving fragmented:
      
          ./with_tap_pair.sh ./tap_send_ufo tap0 tap1
          (from https://github.com/wdebruij/kerneltools/tree/master/tests)
      
      Changes
        v1 -> v2
          - simplified set_offload change (review comment)
          - documented test procedure
      
      Link: http://lkml.kernel.org/r/<CAF=yD-LuUeDuL9YWPJD9ykOZ0QCjNeznPDr6whqZ9NGMNF12Mw@mail.gmail.com>
      Fixes: fb652fdf ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.")
      Reported-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c19f846
  2. 08 11月, 2017 1 次提交
    • Y
      openvswitch: enable NSH support · b2d0f5d5
      Yi Yang 提交于
      v16->17
       - Fixed disputed check code: keep them in nsh_push and nsh_pop
         but also add them in __ovs_nla_copy_actions
      
      v15->v16
       - Add csum recalculation for nsh_push, nsh_pop and set_nsh
         pointed out by Pravin
       - Move nsh key into the union with ipv4 and ipv6 and add
         check for nsh key in match_validate pointed out by Pravin
       - Add nsh check in validate_set and __ovs_nla_copy_actions
      
      v14->v15
       - Check size in nsh_hdr_from_nlattr
       - Fixed four small issues pointed out By Jiri and Eric
      
      v13->v14
       - Rename skb_push_nsh to nsh_push per Dave's comment
       - Rename skb_pop_nsh to nsh_pop per Dave's comment
      
      v12->v13
       - Fix NSH header length check in set_nsh
      
      v11->v12
       - Fix missing changes old comments pointed out
       - Fix new comments for v11
      
      v10->v11
       - Fix the left three disputable comments for v9
         but not fixed in v10.
      
      v9->v10
       - Change struct ovs_key_nsh to
             struct ovs_nsh_key_base base;
             __be32 context[NSH_MD1_CONTEXT_SIZE];
       - Fix new comments for v9
      
      v8->v9
       - Fix build error reported by daily intel build
         because nsh module isn't selected by openvswitch
      
      v7->v8
       - Rework nested value and mask for OVS_KEY_ATTR_NSH
       - Change pop_nsh to adapt to nsh kernel module
       - Fix many issues per comments from Jiri Benc
      
      v6->v7
       - Remove NSH GSO patches in v6 because Jiri Benc
         reworked it as another patch series and they have
         been merged.
       - Change it to adapt to nsh kernel module added by NSH
         GSO patch series
      
      v5->v6
       - Fix the rest comments for v4.
       - Add NSH GSO support for VxLAN-gpe + NSH and
         Eth + NSH.
      
      v4->v5
       - Fix many comments by Jiri Benc and Eric Garver
         for v4.
      
      v3->v4
       - Add new NSH match field ttl
       - Update NSH header to the latest format
         which will be final format and won't change
         per its author's confirmation.
       - Fix comments for v3.
      
      v2->v3
       - Change OVS_KEY_ATTR_NSH to nested key to handle
         length-fixed attributes and length-variable
         attriubte more flexibly.
       - Remove struct ovs_action_push_nsh completely
       - Add code to handle nested attribute for SET_MASKED
       - Change PUSH_NSH to use the nested OVS_KEY_ATTR_NSH
         to transfer NSH header data.
       - Fix comments and coding style issues by Jiri and Eric
      
      v1->v2
       - Change encap_nsh and decap_nsh to push_nsh and pop_nsh
       - Dynamically allocate struct ovs_action_push_nsh for
         length-variable metadata.
      
      OVS master and 2.8 branch has merged NSH userspace
      patch series, this patch is to enable NSH support
      in kernel data path in order that OVS can support
      NSH in compat mode by porting this.
      Signed-off-by: NYi Yang <yi.y.yang@intel.com>
      Acked-by: NJiri Benc <jbenc@redhat.com>
      Acked-by: NEric Garver <e@erig.me>
      Acked-by: NPravin Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2d0f5d5
  3. 20 7月, 2017 2 次提交
    • T
      openvswitch: Optimize operations for OvS flow_stats. · c4b2bf6b
      Tonghao Zhang 提交于
      When calling the flow_free() to free the flow, we call many times
      (cpu_possible_mask, eg. 128 as default) cpumask_next(). That will
      take up our CPU usage if we call the flow_free() frequently.
      When we put all packets to userspace via upcall, and OvS will send
      them back via netlink to ovs_packet_cmd_execute(will call flow_free).
      
      The test topo is shown as below. VM01 sends TCP packets to VM02,
      and OvS forward packtets. When testing, we use perf to report the
      system performance.
      
      VM01 --- OvS-VM --- VM02
      
      Without this patch, perf-top show as below: The flow_free() is
      3.02% CPU usage.
      
      	4.23%  [kernel]            [k] _raw_spin_unlock_irqrestore
      	3.62%  [kernel]            [k] __do_softirq
      	3.16%  [kernel]            [k] __memcpy
      	3.02%  [kernel]            [k] flow_free
      	2.42%  libc-2.17.so        [.] __memcpy_ssse3_back
      	2.18%  [kernel]            [k] copy_user_generic_unrolled
      	2.17%  [kernel]            [k] find_next_bit
      
      When applied this patch, perf-top show as below: Not shown on
      the list anymore.
      
      	4.11%  [kernel]            [k] _raw_spin_unlock_irqrestore
      	3.79%  [kernel]            [k] __do_softirq
      	3.46%  [kernel]            [k] __memcpy
      	2.73%  libc-2.17.so        [.] __memcpy_ssse3_back
      	2.25%  [kernel]            [k] copy_user_generic_unrolled
      	1.89%  libc-2.17.so        [.] _int_malloc
      	1.53%  ovs-vswitchd        [.] xlate_actions
      
      With this patch, the TCP throughput(we dont use Megaflow Cache
      + Microflow Cache) between VMs is 1.18Gbs/sec up to 1.30Gbs/sec
      (maybe ~10% performance imporve).
      
      This patch adds cpumask struct, the cpu_used_mask stores the cpu_id
      that the flow used. And we only check the flow_stats on the cpu we
      used, and it is unncessary to check all possible cpu when getting,
      cleaning, and updating the flow_stats. Adding the cpu_used_mask to
      sw_flow struct does’t increase the cacheline number.
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4b2bf6b
    • T
      openvswitch: Optimize updating for OvS flow_stats. · c57c054e
      Tonghao Zhang 提交于
      In the ovs_flow_stats_update(), we only use the node
      var to alloc flow_stats struct. But this is not a
      common case, it is unnecessary to call the numa_node_id()
      everytime. This patch is not a bugfix, but there maybe
      a small increase.
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c57c054e
  4. 18 7月, 2017 1 次提交
  5. 02 4月, 2017 1 次提交
  6. 10 2月, 2017 1 次提交
    • J
      openvswitch: Add original direction conntrack tuple to sw_flow_key. · 9dd7f890
      Jarno Rajahalme 提交于
      Add the fields of the conntrack original direction 5-tuple to struct
      sw_flow_key.  The new fields are initially marked as non-existent, and
      are populated whenever a conntrack action is executed and either finds
      or generates a conntrack entry.  This means that these fields exist
      for all packets that were not rejected by conntrack as untrackable.
      
      The original tuple fields in the sw_flow_key are filled from the
      original direction tuple of the conntrack entry relating to the
      current packet, or from the original direction tuple of the master
      conntrack entry, if the current conntrack entry has a master.
      Generally, expected connections of connections having an assigned
      helper (e.g., FTP), have a master conntrack entry.
      
      The main purpose of the new conntrack original tuple fields is to
      allow matching on them for policy decision purposes, with the premise
      that the admissibility of tracked connections reply packets (as well
      as original direction packets), and both direction packets of any
      related connections may be based on ACL rules applying to the master
      connection's original direction 5-tuple.  This also makes it easier to
      make policy decisions when the actual packet headers might have been
      transformed by NAT, as the original direction 5-tuple represents the
      packet headers before any such transformation.
      
      When using the original direction 5-tuple the admissibility of return
      and/or related packets need not be based on the mere existence of a
      conntrack entry, allowing separation of admission policy from the
      established conntrack state.  While existence of a conntrack entry is
      required for admission of the return or related packets, policy
      changes can render connections that were initially admitted to be
      rejected or dropped afterwards.  If the admission of the return and
      related packets was based on mere conntrack state (e.g., connection
      being in an established state), a policy change that would make the
      connection rejected or dropped would need to find and delete all
      conntrack entries affected by such a change.  When using the original
      direction 5-tuple matching the affected conntrack entries can be
      allowed to time out instead, as the established state of the
      connection would not need to be the basis for packet admission any
      more.
      
      It should be noted that the directionality of related connections may
      be the same or different than that of the master connection, and
      neither the original direction 5-tuple nor the conntrack state bits
      carry this information.  If needed, the directionality of the master
      connection can be stored in master's conntrack mark or labels, which
      are automatically inherited by the expected related connections.
      
      The fact that neither ARP nor ND packets are trackable by conntrack
      allows mutual exclusion between ARP/ND and the new conntrack original
      tuple fields.  Hence, the IP addresses are overlaid in union with ARP
      and ND fields.  This allows the sw_flow_key to not grow much due to
      this patch, but it also means that we must be careful to never use the
      new key fields with ARP or ND packets.  ARP is easy to distinguish and
      keep mutually exclusive based on the ethernet type, but ND being an
      ICMPv6 protocol requires a bit more attention.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9dd7f890
  7. 28 12月, 2016 1 次提交
    • P
      openvswitch: upcall: Fix vlan handling. · df30f740
      pravin shelar 提交于
      Networking stack accelerate vlan tag handling by
      keeping topmost vlan header in skb. This works as
      long as packet remains in OVS datapath. But during
      OVS upcall vlan header is pushed on to the packet.
      When such packet is sent back to OVS datapath, core
      networking stack might not handle it correctly. Following
      patch avoids this issue by accelerating the vlan tag
      during flow key extract. This simplifies datapath by
      bringing uniform packet processing for packets from
      all code paths.
      
      Fixes: 5108bbad ("openvswitch: add processing of L3 packets").
      CC: Jarno Rajahalme <jarno@ovn.org>
      CC: Jiri Benc <jbenc@redhat.com>
      Signed-off-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df30f740
  8. 13 11月, 2016 2 次提交
    • J
      openvswitch: add processing of L3 packets · 5108bbad
      Jiri Benc 提交于
      Support receiving, extracting flow key and sending of L3 packets (packets
      without an Ethernet header).
      
      Note that even after this patch, non-Ethernet interfaces are still not
      allowed to be added to bridges. Similarly, netlink interface for sending and
      receiving L3 packets to/from user space is not in place yet.
      
      Based on previous versions by Lorand Jakab and Simon Horman.
      Signed-off-by: NLorand Jakab <lojakab@cisco.com>
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5108bbad
    • J
      openvswitch: add mac_proto field to the flow key · 329f45bc
      Jiri Benc 提交于
      Use a hole in the structure. We support only Ethernet so far and will add
      a support for L2-less packets shortly. We could use a bool to indicate
      whether the Ethernet header is present or not but the approach with the
      mac_proto field is more generic and occupies the same number of bytes in the
      struct, while allowing later extensibility. It also makes the code in the
      next patches more self explaining.
      
      It would be nice to use ARPHRD_ constants but those are u16 which would be
      waste. Thus define our own constants.
      
      Another upside of this is that we can overload this new field to also denote
      whether the flow key is valid. This has the advantage that on
      refragmentation, we don't have to reparse the packet but can rely on the
      stored eth.type. This is especially important for the next patches in this
      series - instead of adding another branch for L2-less packets before calling
      ovs_fragment, we can just remove all those branches completely.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      329f45bc
  9. 13 10月, 2016 1 次提交
  10. 03 10月, 2016 1 次提交
  11. 21 9月, 2016 1 次提交
  12. 19 9月, 2016 2 次提交
  13. 09 9月, 2016 1 次提交
  14. 07 10月, 2015 1 次提交
  15. 01 9月, 2015 1 次提交
  16. 30 8月, 2015 3 次提交
  17. 28 8月, 2015 2 次提交
    • J
      openvswitch: Allow matching on conntrack label · c2ac6673
      Joe Stringer 提交于
      Allow matching and setting the ct_label field. As with ct_mark, this is
      populated by executing the CT action. The label field may be modified by
      specifying a label and mask nested under the CT action. It is stored as
      metadata attached to the connection. Label modification occurs after
      lookup, and will only persist when the conntrack entry is committed by
      providing the COMMIT flag to the CT action. Labels are currently fixed
      to 128 bits in size.
      Signed-off-by: NJoe Stringer <joestringer@nicira.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2ac6673
    • J
      openvswitch: Add conntrack action · 7f8a436e
      Joe Stringer 提交于
      Expose the kernel connection tracker via OVS. Userspace components can
      make use of the CT action to populate the connection state (ct_state)
      field for a flow. This state can be subsequently matched.
      
      Exposed connection states are OVS_CS_F_*:
      - NEW (0x01) - Beginning of a new connection.
      - ESTABLISHED (0x02) - Part of an existing connection.
      - RELATED (0x04) - Related to an established connection.
      - INVALID (0x20) - Could not track the connection for this packet.
      - REPLY_DIR (0x40) - This packet is in the reply direction for the flow.
      - TRACKED (0x80) - This packet has been sent through conntrack.
      
      When the CT action is executed by itself, it will send the packet
      through the connection tracker and populate the ct_state field with one
      or more of the connection state flags above. The CT action will always
      set the TRACKED bit.
      
      When the COMMIT flag is passed to the conntrack action, this specifies
      that information about the connection should be stored. This allows
      subsequent packets for the same (or related) connections to be
      correlated with this connection. Sending subsequent packets for the
      connection through conntrack allows the connection tracker to consider
      the packets as ESTABLISHED, RELATED, and/or REPLY_DIR.
      
      The CT action may optionally take a zone to track the flow within. This
      allows connections with the same 5-tuple to be kept logically separate
      from connections in other zones. If the zone is specified, then the
      "ct_zone" match field will be subsequently populated with the zone id.
      
      IP fragments are handled by transparently assembling them as part of the
      CT action. The maximum received unit (MRU) size is tracked so that
      refragmentation can occur during output.
      
      IP frag handling contributed by Andy Zhou.
      
      Based on original design by Justin Pettit.
      Signed-off-by: NJoe Stringer <joestringer@nicira.com>
      Signed-off-by: NJustin Pettit <jpettit@nicira.com>
      Signed-off-by: NAndy Zhou <azhou@nicira.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f8a436e
  18. 22 7月, 2015 1 次提交
  19. 06 5月, 2015 1 次提交
  20. 15 4月, 2015 1 次提交
    • D
      mm: remove GFP_THISNODE · 4167e9b2
      David Rientjes 提交于
      NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE.
      
      GFP_THISNODE is a secret combination of gfp bits that have different
      behavior than expected.  It is a combination of __GFP_THISNODE,
      __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page
      allocator slowpath to fail without trying reclaim even though it may be
      used in combination with __GFP_WAIT.
      
      An example of the problem this creates: commit e97ca8e5 ("mm: fix
      GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE
      that really just wanted __GFP_THISNODE.  The problem doesn't end there,
      however, because even it was a no-op for alloc_misplaced_dst_page(),
      which also sets __GFP_NORETRY and __GFP_NOWARN, and
      migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT
      is set in GFP_TRANSHUGE.  Converting GFP_THISNODE to __GFP_THISNODE is a
      no-op in these cases since the page allocator special-cases
      __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN.
      
      It's time to just remove GFP_THISNODE entirely.  We leave __GFP_THISNODE
      to restrict an allocation to a local node, but remove GFP_THISNODE and
      its obscurity.  Instead, we require that a caller clear __GFP_WAIT if it
      wants to avoid reclaim.
      
      This allows the aforementioned functions to actually reclaim as they
      should.  It also enables any future callers that want to do
      __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim.  The
      rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT.
      
      Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so
      it is unchanged.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Jarno Rajahalme <jrajahalme@nicira.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4167e9b2
  21. 12 2月, 2015 1 次提交
  22. 15 1月, 2015 1 次提交
  23. 14 1月, 2015 1 次提交
  24. 03 1月, 2015 1 次提交
  25. 10 11月, 2014 2 次提交
  26. 06 11月, 2014 1 次提交
  27. 18 10月, 2014 2 次提交
  28. 06 10月, 2014 3 次提交
  29. 16 9月, 2014 2 次提交