1. 10 2月, 2017 10 次提交
    • J
      openvswitch: Pack struct sw_flow_key. · 316d4d78
      Jarno Rajahalme 提交于
      struct sw_flow_key has two 16-bit holes. Move the most matched
      conntrack match fields there.  In some typical cases this reduces the
      size of the key that needs to be hashed into half and into one cache
      line.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      316d4d78
    • J
      openvswitch: Add force commit. · dd41d33f
      Jarno Rajahalme 提交于
      Stateful network admission policy may allow connections to one
      direction and reject connections initiated in the other direction.
      After policy change it is possible that for a new connection an
      overlapping conntrack entry already exists, where the original
      direction of the existing connection is opposed to the new
      connection's initial packet.
      
      Most importantly, conntrack state relating to the current packet gets
      the "reply" designation based on whether the original direction tuple
      or the reply direction tuple matched.  If this "directionality" is
      wrong w.r.t. to the stateful network admission policy it may happen
      that packets in neither direction are correctly admitted.
      
      This patch adds a new "force commit" option to the OVS conntrack
      action that checks the original direction of an existing conntrack
      entry.  If that direction is opposed to the current packet, the
      existing conntrack entry is deleted and a new one is subsequently
      created in the correct direction.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd41d33f
    • J
      openvswitch: Add original direction conntrack tuple to sw_flow_key. · 9dd7f890
      Jarno Rajahalme 提交于
      Add the fields of the conntrack original direction 5-tuple to struct
      sw_flow_key.  The new fields are initially marked as non-existent, and
      are populated whenever a conntrack action is executed and either finds
      or generates a conntrack entry.  This means that these fields exist
      for all packets that were not rejected by conntrack as untrackable.
      
      The original tuple fields in the sw_flow_key are filled from the
      original direction tuple of the conntrack entry relating to the
      current packet, or from the original direction tuple of the master
      conntrack entry, if the current conntrack entry has a master.
      Generally, expected connections of connections having an assigned
      helper (e.g., FTP), have a master conntrack entry.
      
      The main purpose of the new conntrack original tuple fields is to
      allow matching on them for policy decision purposes, with the premise
      that the admissibility of tracked connections reply packets (as well
      as original direction packets), and both direction packets of any
      related connections may be based on ACL rules applying to the master
      connection's original direction 5-tuple.  This also makes it easier to
      make policy decisions when the actual packet headers might have been
      transformed by NAT, as the original direction 5-tuple represents the
      packet headers before any such transformation.
      
      When using the original direction 5-tuple the admissibility of return
      and/or related packets need not be based on the mere existence of a
      conntrack entry, allowing separation of admission policy from the
      established conntrack state.  While existence of a conntrack entry is
      required for admission of the return or related packets, policy
      changes can render connections that were initially admitted to be
      rejected or dropped afterwards.  If the admission of the return and
      related packets was based on mere conntrack state (e.g., connection
      being in an established state), a policy change that would make the
      connection rejected or dropped would need to find and delete all
      conntrack entries affected by such a change.  When using the original
      direction 5-tuple matching the affected conntrack entries can be
      allowed to time out instead, as the established state of the
      connection would not need to be the basis for packet admission any
      more.
      
      It should be noted that the directionality of related connections may
      be the same or different than that of the master connection, and
      neither the original direction 5-tuple nor the conntrack state bits
      carry this information.  If needed, the directionality of the master
      connection can be stored in master's conntrack mark or labels, which
      are automatically inherited by the expected related connections.
      
      The fact that neither ARP nor ND packets are trackable by conntrack
      allows mutual exclusion between ARP/ND and the new conntrack original
      tuple fields.  Hence, the IP addresses are overlaid in union with ARP
      and ND fields.  This allows the sw_flow_key to not grow much due to
      this patch, but it also means that we must be careful to never use the
      new key fields with ARP or ND packets.  ARP is easy to distinguish and
      keep mutually exclusive based on the ethernet type, but ND being an
      ICMPv6 protocol requires a bit more attention.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9dd7f890
    • J
      openvswitch: Inherit master's labels. · 09aa98ad
      Jarno Rajahalme 提交于
      We avoid calling into nf_conntrack_in() for expected connections, as
      that would remove the expectation that we want to stick around until
      we are ready to commit the connection.  Instead, we do a lookup in the
      expectation table directly.  However, after a successful expectation
      lookup we have set the flow key label field from the master
      connection, whereas nf_conntrack_in() does not do this.  This leads to
      master's labels being inherited after an expectation lookup, but those
      labels not being inherited after the corresponding conntrack action
      with a commit flag.
      
      This patch resolves the problem by changing the commit code path to
      also inherit the master's labels to the expected connection.
      Resolving this conflict in favor of inheriting the labels allows more
      information be passed from the master connection to related
      connections, which would otherwise be much harder if the 32 bits in
      the connmark are not enough.  Labels can still be set explicitly, so
      this change only affects the default values of the labels in presense
      of a master connection.
      
      Fixes: 7f8a436e ("openvswitch: Add conntrack action")
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09aa98ad
    • J
      openvswitch: Refactor labels initialization. · 6ffcea79
      Jarno Rajahalme 提交于
      Refactoring conntrack labels initialization makes changes in later
      patches easier to review.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ffcea79
    • J
      openvswitch: Simplify labels length logic. · b87cec38
      Jarno Rajahalme 提交于
      Since 23014011 ("netfilter: conntrack: support a fixed size of 128
      distinct labels"), the size of conntrack labels extension has fixed to
      128 bits, so we do not need to check for labels sizes shorter than 128
      at run-time.  This patch simplifies labels length logic accordingly,
      but allows the conntrack labels size to be increased in the future
      without breaking the build.  In the event of conntrack labels
      increasing in size OVS would still be able to deal with the 128 first
      label bits.
      Suggested-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b87cec38
    • J
      openvswitch: Unionize ovs_key_ct_label with a u32 array. · cb80d58f
      Jarno Rajahalme 提交于
      Make the array of labels in struct ovs_key_ct_label an union, adding a
      u32 array of the same byte size as the existing u8 array.  It is
      faster to loop through the labels 32 bits at the time, which is also
      the alignment of netlink attributes.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb80d58f
    • J
      openvswitch: Do not trigger events for unconfirmed connections. · 193e3096
      Jarno Rajahalme 提交于
      Receiving change events before the 'new' event for the connection has
      been received can be confusing.  Avoid triggering change events for
      setting conntrack mark or labels before the conntrack entry has been
      confirmed.
      
      Fixes: 182e3042 ("openvswitch: Allow matching on conntrack mark")
      Fixes: c2ac6673 ("openvswitch: Allow matching on conntrack label")
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      193e3096
    • J
      openvswitch: Use inverted tuple in ovs_ct_find_existing() if NATted. · 9ff464db
      Jarno Rajahalme 提交于
      The conntrack lookup for existing connections fails to invert the
      packet 5-tuple for NATted packets, and therefore fails to find the
      existing conntrack entry.  Conntrack only stores 5-tuples for incoming
      packets, and there are various situations where a lookup on a packet
      that has already been transformed by NAT needs to be made.  Looking up
      an existing conntrack entry upon executing packet received from the
      userspace is one of them.
      
      This patch fixes ovs_ct_find_existing() to invert the packet 5-tuple
      for the conntrack lookup whenever the packet has already been
      transformed by conntrack from its input form as evidenced by one of
      the NAT flags being set in the conntrack state metadata.
      
      Fixes: 05752523 ("openvswitch: Interface with NAT.")
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ff464db
    • J
      openvswitch: Fix comments for skb->_nfct · 5e17da63
      Jarno Rajahalme 提交于
      Fix comments referring to skb 'nfct' and 'nfctinfo' fields now that
      they are combined into '_nfct'.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Acked-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e17da63
  2. 02 2月, 2017 2 次提交
  3. 30 1月, 2017 1 次提交
  4. 16 1月, 2017 1 次提交
    • L
      openvswitch: maintain correct checksum state in conntrack actions · 75f01a4c
      Lance Richardson 提交于
      When executing conntrack actions on skbuffs with checksum mode
      CHECKSUM_COMPLETE, the checksum must be updated to account for
      header pushes and pulls. Otherwise we get "hw csum failure"
      logs similar to this (ICMP packet received on geneve tunnel
      via ixgbe NIC):
      
      [  405.740065] genev_sys_6081: hw csum failure
      [  405.740106] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G          I     4.10.0-rc3+ #1
      [  405.740108] Call Trace:
      [  405.740110]  <IRQ>
      [  405.740113]  dump_stack+0x63/0x87
      [  405.740116]  netdev_rx_csum_fault+0x3a/0x40
      [  405.740118]  __skb_checksum_complete+0xcf/0xe0
      [  405.740120]  nf_ip_checksum+0xc8/0xf0
      [  405.740124]  icmp_error+0x1de/0x351 [nf_conntrack_ipv4]
      [  405.740132]  nf_conntrack_in+0xe1/0x550 [nf_conntrack]
      [  405.740137]  ? find_bucket.isra.2+0x62/0x70 [openvswitch]
      [  405.740143]  __ovs_ct_lookup+0x95/0x980 [openvswitch]
      [  405.740145]  ? netif_rx_internal+0x44/0x110
      [  405.740149]  ovs_ct_execute+0x147/0x4b0 [openvswitch]
      [  405.740153]  do_execute_actions+0x22e/0xa70 [openvswitch]
      [  405.740157]  ovs_execute_actions+0x40/0x120 [openvswitch]
      [  405.740161]  ovs_dp_process_packet+0x84/0x120 [openvswitch]
      [  405.740166]  ovs_vport_receive+0x73/0xd0 [openvswitch]
      [  405.740168]  ? udp_rcv+0x1a/0x20
      [  405.740170]  ? ip_local_deliver_finish+0x93/0x1e0
      [  405.740172]  ? ip_local_deliver+0x6f/0xe0
      [  405.740174]  ? ip_rcv_finish+0x3a0/0x3a0
      [  405.740176]  ? ip_rcv_finish+0xdb/0x3a0
      [  405.740177]  ? ip_rcv+0x2a7/0x400
      [  405.740180]  ? __netif_receive_skb_core+0x970/0xa00
      [  405.740185]  netdev_frame_hook+0xd3/0x160 [openvswitch]
      [  405.740187]  __netif_receive_skb_core+0x1dc/0xa00
      [  405.740194]  ? ixgbe_clean_rx_irq+0x46d/0xa20 [ixgbe]
      [  405.740197]  __netif_receive_skb+0x18/0x60
      [  405.740199]  netif_receive_skb_internal+0x40/0xb0
      [  405.740201]  napi_gro_receive+0xcd/0x120
      [  405.740204]  gro_cell_poll+0x57/0x80 [geneve]
      [  405.740206]  net_rx_action+0x260/0x3c0
      [  405.740209]  __do_softirq+0xc9/0x28c
      [  405.740211]  irq_exit+0xd9/0xf0
      [  405.740213]  do_IRQ+0x51/0xd0
      [  405.740215]  common_interrupt+0x93/0x93
      
      Fixes: 7f8a436e ("openvswitch: Add conntrack action")
      Signed-off-by: NLance Richardson <lrichard@redhat.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75f01a4c
  5. 09 1月, 2017 1 次提交
  6. 28 12月, 2016 1 次提交
    • P
      openvswitch: upcall: Fix vlan handling. · df30f740
      pravin shelar 提交于
      Networking stack accelerate vlan tag handling by
      keeping topmost vlan header in skb. This works as
      long as packet remains in OVS datapath. But during
      OVS upcall vlan header is pushed on to the packet.
      When such packet is sent back to OVS datapath, core
      networking stack might not handle it correctly. Following
      patch avoids this issue by accelerating the vlan tag
      during flow key extract. This simplifies datapath by
      bringing uniform packet processing for packets from
      all code paths.
      
      Fixes: 5108bbad ("openvswitch: add processing of L3 packets").
      CC: Jarno Rajahalme <jarno@ovn.org>
      CC: Jiri Benc <jbenc@redhat.com>
      Signed-off-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df30f740
  7. 21 12月, 2016 1 次提交
  8. 01 12月, 2016 1 次提交
  9. 18 11月, 2016 1 次提交
    • A
      netns: make struct pernet_operations::id unsigned int · c7d03a00
      Alexey Dobriyan 提交于
      Make struct pernet_operations::id unsigned.
      
      There are 2 reasons to do so:
      
      1)
      This field is really an index into an zero based array and
      thus is unsigned entity. Using negative value is out-of-bound
      access by definition.
      
      2)
      On x86_64 unsigned 32-bit data which are mixed with pointers
      via array indexing or offsets added or subtracted to pointers
      are preffered to signed 32-bit data.
      
      "int" being used as an array index needs to be sign-extended
      to 64-bit before being used.
      
      	void f(long *p, int i)
      	{
      		g(p[i]);
      	}
      
        roughly translates to
      
      	movsx	rsi, esi
      	mov	rdi, [rsi+...]
      	call 	g
      
      MOVSX is 3 byte instruction which isn't necessary if the variable is
      unsigned because x86_64 is zero extending by default.
      
      Now, there is net_generic() function which, you guessed it right, uses
      "int" as an array index:
      
      	static inline void *net_generic(const struct net *net, int id)
      	{
      		...
      		ptr = ng->ptr[id - 1];
      		...
      	}
      
      And this function is used a lot, so those sign extensions add up.
      
      Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
      messing with code generation):
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      
      Unfortunately some functions actually grow bigger.
      This is a semmingly random artefact of code generation with register
      allocator being used differently. gcc decides that some variable
      needs to live in new r8+ registers and every access now requires REX
      prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
      used which is longer than [r8]
      
      However, overall balance is in negative direction:
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      	function                                     old     new   delta
      	nfsd4_lock                                  3886    3959     +73
      	tipc_link_build_proto_msg                   1096    1140     +44
      	mac80211_hwsim_new_radio                    2776    2808     +32
      	tipc_mon_rcv                                1032    1058     +26
      	svcauth_gss_legacy_init                     1413    1429     +16
      	tipc_bcbase_select_primary                   379     392     +13
      	nfsd4_exchange_id                           1247    1260     +13
      	nfsd4_setclientid_confirm                    782     793     +11
      		...
      	put_client_renew_locked                      494     480     -14
      	ip_set_sockfn_get                            730     716     -14
      	geneve_sock_add                              829     813     -16
      	nfsd4_sequence_done                          721     703     -18
      	nlmclnt_lookup_host                          708     686     -22
      	nfsd4_lockt                                 1085    1063     -22
      	nfs_get_client                              1077    1050     -27
      	tcf_bpf_init                                1106    1076     -30
      	nfsd4_encode_fattr                          5997    5930     -67
      	Total: Before=154856051, After=154854321, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d03a00
  10. 13 11月, 2016 8 次提交
  11. 03 11月, 2016 1 次提交
  12. 28 10月, 2016 3 次提交
    • J
      genetlink: mark families as __ro_after_init · 56989f6d
      Johannes Berg 提交于
      Now genl_register_family() is the only thing (other than the
      users themselves, perhaps, but I didn't find any doing that)
      writing to the family struct.
      
      In all families that I found, genl_register_family() is only
      called from __init functions (some indirectly, in which case
      I've add __init annotations to clarifly things), so all can
      actually be marked __ro_after_init.
      
      This protects the data structure from accidental corruption.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56989f6d
    • J
      genetlink: statically initialize families · 489111e5
      Johannes Berg 提交于
      Instead of providing macros/inline functions to initialize
      the families, make all users initialize them statically and
      get rid of the macros.
      
      This reduces the kernel code size by about 1.6k on x86-64
      (with allyesconfig).
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      489111e5
    • J
      genetlink: no longer support using static family IDs · a07ea4d9
      Johannes Berg 提交于
      Static family IDs have never really been used, the only
      use case was the workaround I introduced for those users
      that assumed their family ID was also their multicast
      group ID.
      
      Additionally, because static family IDs would never be
      reserved by the generic netlink code, using a relatively
      low ID would only work for built-in families that can be
      registered immediately after generic netlink is started,
      which is basically only the control family (apart from
      the workaround code, which I also had to add code for so
      it would reserve those IDs)
      
      Thus, anything other than GENL_ID_GENERATE is flawed and
      luckily not used except in the cases I mentioned. Move
      those workarounds into a few lines of code, and then get
      rid of GENL_ID_GENERATE entirely, making it more robust.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a07ea4d9
  13. 21 10月, 2016 1 次提交
    • J
      net: use core MTU range checking in core net infra · 91572088
      Jarod Wilson 提交于
      geneve:
      - Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
      - This one isn't quite as straight-forward as others, could use some
        closer inspection and testing
      
      macvlan:
      - set min/max_mtu
      
      tun:
      - set min/max_mtu, remove tun_net_change_mtu
      
      vxlan:
      - Merge __vxlan_change_mtu back into vxlan_change_mtu
      - Set max_mtu to IP_MAX_MTU and retain dynamic MTU range checks in
        change_mtu function
      - This one is also not as straight-forward and could use closer inspection
        and testing from vxlan folks
      
      bridge:
      - set max_mtu of IP_MAX_MTU and retain dynamic MTU range checks in
        change_mtu function
      
      openvswitch:
      - set min/max_mtu, remove internal_dev_change_mtu
      - note: max_mtu wasn't checked previously, it's been set to 65535, which
        is the largest possible size supported
      
      sch_teql:
      - set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)
      
      macsec:
      - min_mtu = 0, max_mtu = 65535
      
      macvlan:
      - min_mtu = 0, max_mtu = 65535
      
      ntb_netdev:
      - min_mtu = 0, max_mtu = 65535
      
      veth:
      - min_mtu = 68, max_mtu = 65535
      
      8021q:
      - min_mtu = 0, max_mtu = 65535
      
      CC: netdev@vger.kernel.org
      CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
      CC: Tom Herbert <tom@herbertland.com>
      CC: Daniel Borkmann <daniel@iogearbox.net>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Paolo Abeni <pabeni@redhat.com>
      CC: Jiri Benc <jbenc@redhat.com>
      CC: WANG Cong <xiyou.wangcong@gmail.com>
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      CC: Pravin B Shelar <pshelar@ovn.org>
      CC: Sabrina Dubroca <sd@queasysnail.net>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      CC: Pravin Shelar <pshelar@nicira.com>
      CC: Maxim Krasnyansky <maxk@qti.qualcomm.com>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91572088
  14. 20 10月, 2016 2 次提交
  15. 13 10月, 2016 3 次提交
  16. 12 10月, 2016 1 次提交
  17. 03 10月, 2016 2 次提交