提交 78a8a36f 编写于 作者: D David S. Miller
......@@ -144,6 +144,8 @@ nfc.txt
- The Linux Near Field Communication (NFS) subsystem.
olympic.txt
- IBM PCI Pit/Pit-Phy/Olympic Token Ring driver info.
openvswitch.txt
- Open vSwitch developer documentation.
operstates.txt
- Overview of network interface operational states.
packet_mmap.txt
......
Open vSwitch datapath developer documentation
=============================================
The Open vSwitch kernel module allows flexible userspace control over
flow-level packet processing on selected network devices. It can be
used to implement a plain Ethernet switch, network device bonding,
VLAN processing, network access control, flow-based network control,
and so on.
The kernel module implements multiple "datapaths" (analogous to
bridges), each of which can have multiple "vports" (analogous to ports
within a bridge). Each datapath also has associated with it a "flow
table" that userspace populates with "flows" that map from keys based
on packet headers and metadata to sets of actions. The most common
action forwards the packet to another vport; other actions are also
implemented.
When a packet arrives on a vport, the kernel module processes it by
extracting its flow key and looking it up in the flow table. If there
is a matching flow, it executes the associated actions. If there is
no match, it queues the packet to userspace for processing (as part of
its processing, userspace will likely set up a flow to handle further
packets of the same type entirely in-kernel).
Flow key compatibility
----------------------
Network protocols evolve over time. New protocols become important
and existing protocols lose their prominence. For the Open vSwitch
kernel module to remain relevant, it must be possible for newer
versions to parse additional protocols as part of the flow key. It
might even be desirable, someday, to drop support for parsing
protocols that have become obsolete. Therefore, the Netlink interface
to Open vSwitch is designed to allow carefully written userspace
applications to work with any version of the flow key, past or future.
To support this forward and backward compatibility, whenever the
kernel module passes a packet to userspace, it also passes along the
flow key that it parsed from the packet. Userspace then extracts its
own notion of a flow key from the packet and compares it against the
kernel-provided version:
- If userspace's notion of the flow key for the packet matches the
kernel's, then nothing special is necessary.
- If the kernel's flow key includes more fields than the userspace
version of the flow key, for example if the kernel decoded IPv6
headers but userspace stopped at the Ethernet type (because it
does not understand IPv6), then again nothing special is
necessary. Userspace can still set up a flow in the usual way,
as long as it uses the kernel-provided flow key to do it.
- If the userspace flow key includes more fields than the
kernel's, for example if userspace decoded an IPv6 header but
the kernel stopped at the Ethernet type, then userspace can
forward the packet manually, without setting up a flow in the
kernel. This case is bad for performance because every packet
that the kernel considers part of the flow must go to userspace,
but the forwarding behavior is correct. (If userspace can
determine that the values of the extra fields would not affect
forwarding behavior, then it could set up a flow anyway.)
How flow keys evolve over time is important to making this work, so
the following sections go into detail.
Flow key format
---------------
A flow key is passed over a Netlink socket as a sequence of Netlink
attributes. Some attributes represent packet metadata, defined as any
information about a packet that cannot be extracted from the packet
itself, e.g. the vport on which the packet was received. Most
attributes, however, are extracted from headers within the packet,
e.g. source and destination addresses from Ethernet, IP, or TCP
headers.
The <linux/openvswitch.h> header file defines the exact format of the
flow key attributes. For informal explanatory purposes here, we write
them as comma-separated strings, with parentheses indicating arguments
and nesting. For example, the following could represent a flow key
corresponding to a TCP packet that arrived on vport 1:
in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
frag=no), tcp(src=49163, dst=80)
Often we ellipsize arguments not important to the discussion, e.g.:
in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
Basic rule for evolving flow keys
---------------------------------
Some care is needed to really maintain forward and backward
compatibility for applications that follow the rules listed under
"Flow key compatibility" above.
The basic rule is obvious:
------------------------------------------------------------------
New network protocol support must only supplement existing flow
key attributes. It must not change the meaning of already defined
flow key attributes.
------------------------------------------------------------------
This rule does have less-obvious consequences so it is worth working
through a few examples. Suppose, for example, that the kernel module
did not already implement VLAN parsing. Instead, it just interpreted
the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the
packet. The flow key for any packet with an 802.1Q header would look
essentially like this, ignoring metadata:
eth(...), eth_type(0x8100)
Naively, to add VLAN support, it makes sense to add a new "vlan" flow
key attribute to contain the VLAN tag, then continue to decode the
encapsulated headers beyond the VLAN tag using the existing field
definitions. With this change, an TCP packet in VLAN 10 would have a
flow key much like this:
eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
But this change would negatively affect a userspace application that
has not been updated to understand the new "vlan" flow key attribute.
The application could, following the flow compatibility rules above,
ignore the "vlan" attribute that it does not understand and therefore
assume that the flow contained IP packets. This is a bad assumption
(the flow only contains IP packets if one parses and skips over the
802.1Q header) and it could cause the application's behavior to change
across kernel versions even though it follows the compatibility rules.
The solution is to use a set of nested attributes. This is, for
example, why 802.1Q support uses nested attributes. A TCP packet in
VLAN 10 is actually expressed as:
eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
ip(proto=6, ...), tcp(...)))
Notice how the "eth_type", "ip", and "tcp" flow key attributes are
nested inside the "encap" attribute. Thus, an application that does
not understand the "vlan" key will not see either of those attributes
and therefore will not misinterpret them. (Also, the outer eth_type
is still 0x8100, not changed to 0x0800.)
Handling malformed packets
--------------------------
Don't drop packets in the kernel for malformed protocol headers, bad
checksums, etc. This would prevent userspace from implementing a
simple Ethernet switch that forwards every packet.
Instead, in such a case, include an attribute with "empty" content.
It doesn't matter if the empty content could be valid protocol values,
as long as those values are rarely seen in practice, because userspace
can always forward all packets with those values to userspace and
handle them individually.
For example, consider a packet that contains an IP header that
indicates protocol 6 for TCP, but which is truncated just after the IP
header, so that the TCP header is missing. The flow key for this
packet would include a tcp attribute with all-zero src and dst, like
this:
eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
As another example, consider a packet with an Ethernet type of 0x8100,
indicating that a VLAN TCI should follow, but which is truncated just
after the Ethernet type. The flow key for this packet would include
an all-zero-bits vlan and an empty encap attribute, like this:
eth(...), eth_type(0x8100), vlan(0), encap()
Unlike a TCP packet with source and destination ports 0, an
all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka
VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan
attribute expressly to allow this situation to be distinguished.
Thus, the flow key in this second example unambiguously indicates a
missing or malformed VLAN TCI.
Other rules
-----------
The other rules for flow keys are much less subtle:
- Duplicate attributes are not allowed at a given nesting level.
- Ordering of attributes is not significant.
- When the kernel sends a given flow key to userspace, it always
composes it the same way. This allows userspace to hash and
compare entire flow keys that it may not be able to fully
interpret.
......@@ -4868,6 +4868,14 @@ S: Maintained
T: git git://openrisc.net/~jonas/linux
F: arch/openrisc
OPENVSWITCH
M: Jesse Gross <jesse@nicira.com>
L: dev@openvswitch.org
W: http://openvswitch.org
T: git git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch.git
S: Maintained
F: net/openvswitch/
OPL4 DRIVER
M: Clemens Ladisch <clemens@ladisch.de>
L: alsa-devel@alsa-project.org (moderated for non-subscribers)
......
......@@ -85,6 +85,30 @@ enum {
/* All generic netlink requests are serialized by a global lock. */
extern void genl_lock(void);
extern void genl_unlock(void);
#ifdef CONFIG_PROVE_LOCKING
extern int lockdep_genl_is_held(void);
#endif
/**
* rcu_dereference_genl - rcu_dereference with debug checking
* @p: The pointer to read, prior to dereferencing
*
* Do an rcu_dereference(p), but check caller either holds rcu_read_lock()
* or genl mutex. Note : Please prefer genl_dereference() or rcu_dereference()
*/
#define rcu_dereference_genl(p) \
rcu_dereference_check(p, lockdep_genl_is_held())
/**
* genl_dereference - fetch RCU pointer when updates are prevented by genl mutex
* @p: The pointer to read, prior to dereferencing
*
* Return the value of the specified RCU-protected pointer, but omit
* both the smp_read_barrier_depends() and the ACCESS_ONCE(), because
* caller holds genl mutex.
*/
#define genl_dereference(p) \
rcu_dereference_protected(p, lockdep_genl_is_held())
#endif /* __KERNEL__ */
......
......@@ -310,6 +310,40 @@ static inline __be16 vlan_get_protocol(const struct sk_buff *skb)
return protocol;
}
static inline void vlan_set_encap_proto(struct sk_buff *skb,
struct vlan_hdr *vhdr)
{
__be16 proto;
unsigned char *rawp;
/*
* Was a VLAN packet, grab the encapsulated protocol, which the layer
* three protocols care about.
*/
proto = vhdr->h_vlan_encapsulated_proto;
if (ntohs(proto) >= 1536) {
skb->protocol = proto;
return;
}
rawp = skb->data;
if (*(unsigned short *) rawp == 0xFFFF)
/*
* This is a magic hack to spot IPX packets. Older Novell
* breaks the protocol design and runs IPX over 802.3 without
* an 802.2 LLC layer. We look for FFFF which isn't a used
* 802.2 SSAP/DSAP. This won't work for fault tolerant netware
* but does for the rest.
*/
skb->protocol = htons(ETH_P_802_3);
else
/*
* Real 802.2 LLC
*/
skb->protocol = htons(ETH_P_802_2);
}
#endif /* __KERNEL__ */
/* VLAN IOCTLs are found in sockios.h */
......
/*
* Copyright (c) 2007-2011 Nicira Networks.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of version 2 of the GNU General Public
* License as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
* 02110-1301, USA
*/
#ifndef _LINUX_OPENVSWITCH_H
#define _LINUX_OPENVSWITCH_H 1
#include <linux/types.h>
/**
* struct ovs_header - header for OVS Generic Netlink messages.
* @dp_ifindex: ifindex of local port for datapath (0 to make a request not
* specific to a datapath).
*
* Attributes following the header are specific to a particular OVS Generic
* Netlink family, but all of the OVS families use this header.
*/
struct ovs_header {
int dp_ifindex;
};
/* Datapaths. */
#define OVS_DATAPATH_FAMILY "ovs_datapath"
#define OVS_DATAPATH_MCGROUP "ovs_datapath"
#define OVS_DATAPATH_VERSION 0x1
enum ovs_datapath_cmd {
OVS_DP_CMD_UNSPEC,
OVS_DP_CMD_NEW,
OVS_DP_CMD_DEL,
OVS_DP_CMD_GET,
OVS_DP_CMD_SET
};
/**
* enum ovs_datapath_attr - attributes for %OVS_DP_* commands.
* @OVS_DP_ATTR_NAME: Name of the network device that serves as the "local
* port". This is the name of the network device whose dp_ifindex is given in
* the &struct ovs_header. Always present in notifications. Required in
* %OVS_DP_NEW requests. May be used as an alternative to specifying
* dp_ifindex in other requests (with a dp_ifindex of 0).
* @OVS_DP_ATTR_UPCALL_PID: The Netlink socket in userspace that is initially
* set on the datapath port (for OVS_ACTION_ATTR_MISS). Only valid on
* %OVS_DP_CMD_NEW requests. A value of zero indicates that upcalls should
* not be sent.
* @OVS_DP_ATTR_STATS: Statistics about packets that have passed through the
* datapath. Always present in notifications.
*
* These attributes follow the &struct ovs_header within the Generic Netlink
* payload for %OVS_DP_* commands.
*/
enum ovs_datapath_attr {
OVS_DP_ATTR_UNSPEC,
OVS_DP_ATTR_NAME, /* name of dp_ifindex netdev */
OVS_DP_ATTR_UPCALL_PID, /* Netlink PID to receive upcalls */
OVS_DP_ATTR_STATS, /* struct ovs_dp_stats */
__OVS_DP_ATTR_MAX
};
#define OVS_DP_ATTR_MAX (__OVS_DP_ATTR_MAX - 1)
struct ovs_dp_stats {
__u64 n_hit; /* Number of flow table matches. */
__u64 n_missed; /* Number of flow table misses. */
__u64 n_lost; /* Number of misses not sent to userspace. */
__u64 n_flows; /* Number of flows present */
};
struct ovs_vport_stats {
__u64 rx_packets; /* total packets received */
__u64 tx_packets; /* total packets transmitted */
__u64 rx_bytes; /* total bytes received */
__u64 tx_bytes; /* total bytes transmitted */
__u64 rx_errors; /* bad packets received */
__u64 tx_errors; /* packet transmit problems */
__u64 rx_dropped; /* no space in linux buffers */
__u64 tx_dropped; /* no space available in linux */
};
/* Fixed logical ports. */
#define OVSP_LOCAL ((__u16)0)
/* Packet transfer. */
#define OVS_PACKET_FAMILY "ovs_packet"
#define OVS_PACKET_VERSION 0x1
enum ovs_packet_cmd {
OVS_PACKET_CMD_UNSPEC,
/* Kernel-to-user notifications. */
OVS_PACKET_CMD_MISS, /* Flow table miss. */
OVS_PACKET_CMD_ACTION, /* OVS_ACTION_ATTR_USERSPACE action. */
/* Userspace commands. */
OVS_PACKET_CMD_EXECUTE /* Apply actions to a packet. */
};
/**
* enum ovs_packet_attr - attributes for %OVS_PACKET_* commands.
* @OVS_PACKET_ATTR_PACKET: Present for all notifications. Contains the entire
* packet as received, from the start of the Ethernet header onward. For
* %OVS_PACKET_CMD_ACTION, %OVS_PACKET_ATTR_PACKET reflects changes made by
* actions preceding %OVS_ACTION_ATTR_USERSPACE, but %OVS_PACKET_ATTR_KEY is
* the flow key extracted from the packet as originally received.
* @OVS_PACKET_ATTR_KEY: Present for all notifications. Contains the flow key
* extracted from the packet as nested %OVS_KEY_ATTR_* attributes. This allows
* userspace to adapt its flow setup strategy by comparing its notion of the
* flow key against the kernel's.
* @OVS_PACKET_ATTR_ACTIONS: Contains actions for the packet. Used
* for %OVS_PACKET_CMD_EXECUTE. It has nested %OVS_ACTION_ATTR_* attributes.
* @OVS_PACKET_ATTR_USERDATA: Present for an %OVS_PACKET_CMD_ACTION
* notification if the %OVS_ACTION_ATTR_USERSPACE action specified an
* %OVS_USERSPACE_ATTR_USERDATA attribute.
*
* These attributes follow the &struct ovs_header within the Generic Netlink
* payload for %OVS_PACKET_* commands.
*/
enum ovs_packet_attr {
OVS_PACKET_ATTR_UNSPEC,
OVS_PACKET_ATTR_PACKET, /* Packet data. */
OVS_PACKET_ATTR_KEY, /* Nested OVS_KEY_ATTR_* attributes. */
OVS_PACKET_ATTR_ACTIONS, /* Nested OVS_ACTION_ATTR_* attributes. */
OVS_PACKET_ATTR_USERDATA, /* u64 OVS_ACTION_ATTR_USERSPACE arg. */
__OVS_PACKET_ATTR_MAX
};
#define OVS_PACKET_ATTR_MAX (__OVS_PACKET_ATTR_MAX - 1)
/* Virtual ports. */
#define OVS_VPORT_FAMILY "ovs_vport"
#define OVS_VPORT_MCGROUP "ovs_vport"
#define OVS_VPORT_VERSION 0x1
enum ovs_vport_cmd {
OVS_VPORT_CMD_UNSPEC,
OVS_VPORT_CMD_NEW,
OVS_VPORT_CMD_DEL,
OVS_VPORT_CMD_GET,
OVS_VPORT_CMD_SET
};
enum ovs_vport_type {
OVS_VPORT_TYPE_UNSPEC,
OVS_VPORT_TYPE_NETDEV, /* network device */
OVS_VPORT_TYPE_INTERNAL, /* network device implemented by datapath */
__OVS_VPORT_TYPE_MAX
};
#define OVS_VPORT_TYPE_MAX (__OVS_VPORT_TYPE_MAX - 1)
/**
* enum ovs_vport_attr - attributes for %OVS_VPORT_* commands.
* @OVS_VPORT_ATTR_PORT_NO: 32-bit port number within datapath.
* @OVS_VPORT_ATTR_TYPE: 32-bit %OVS_VPORT_TYPE_* constant describing the type
* of vport.
* @OVS_VPORT_ATTR_NAME: Name of vport. For a vport based on a network device
* this is the name of the network device. Maximum length %IFNAMSIZ-1 bytes
* plus a null terminator.
* @OVS_VPORT_ATTR_OPTIONS: Vport-specific configuration information.
* @OVS_VPORT_ATTR_UPCALL_PID: The Netlink socket in userspace that
* OVS_PACKET_CMD_MISS upcalls will be directed to for packets received on
* this port. A value of zero indicates that upcalls should not be sent.
* @OVS_VPORT_ATTR_STATS: A &struct ovs_vport_stats giving statistics for
* packets sent or received through the vport.
*
* These attributes follow the &struct ovs_header within the Generic Netlink
* payload for %OVS_VPORT_* commands.
*
* For %OVS_VPORT_CMD_NEW requests, the %OVS_VPORT_ATTR_TYPE and
* %OVS_VPORT_ATTR_NAME attributes are required. %OVS_VPORT_ATTR_PORT_NO is
* optional; if not specified a free port number is automatically selected.
* Whether %OVS_VPORT_ATTR_OPTIONS is required or optional depends on the type
* of vport.
* and other attributes are ignored.
*
* For other requests, if %OVS_VPORT_ATTR_NAME is specified then it is used to
* look up the vport to operate on; otherwise dp_idx from the &struct
* ovs_header plus %OVS_VPORT_ATTR_PORT_NO determine the vport.
*/
enum ovs_vport_attr {
OVS_VPORT_ATTR_UNSPEC,
OVS_VPORT_ATTR_PORT_NO, /* u32 port number within datapath */
OVS_VPORT_ATTR_TYPE, /* u32 OVS_VPORT_TYPE_* constant. */
OVS_VPORT_ATTR_NAME, /* string name, up to IFNAMSIZ bytes long */
OVS_VPORT_ATTR_OPTIONS, /* nested attributes, varies by vport type */
OVS_VPORT_ATTR_UPCALL_PID, /* u32 Netlink PID to receive upcalls */
OVS_VPORT_ATTR_STATS, /* struct ovs_vport_stats */
__OVS_VPORT_ATTR_MAX
};
#define OVS_VPORT_ATTR_MAX (__OVS_VPORT_ATTR_MAX - 1)
/* Flows. */
#define OVS_FLOW_FAMILY "ovs_flow"
#define OVS_FLOW_MCGROUP "ovs_flow"
#define OVS_FLOW_VERSION 0x1
enum ovs_flow_cmd {
OVS_FLOW_CMD_UNSPEC,
OVS_FLOW_CMD_NEW,
OVS_FLOW_CMD_DEL,
OVS_FLOW_CMD_GET,
OVS_FLOW_CMD_SET
};
struct ovs_flow_stats {
__u64 n_packets; /* Number of matched packets. */
__u64 n_bytes; /* Number of matched bytes. */
};
enum ovs_key_attr {
OVS_KEY_ATTR_UNSPEC,
OVS_KEY_ATTR_ENCAP, /* Nested set of encapsulated attributes. */
OVS_KEY_ATTR_PRIORITY, /* u32 skb->priority */
OVS_KEY_ATTR_IN_PORT, /* u32 OVS dp port number */
OVS_KEY_ATTR_ETHERNET, /* struct ovs_key_ethernet */
OVS_KEY_ATTR_VLAN, /* be16 VLAN TCI */
OVS_KEY_ATTR_ETHERTYPE, /* be16 Ethernet type */
OVS_KEY_ATTR_IPV4, /* struct ovs_key_ipv4 */
OVS_KEY_ATTR_IPV6, /* struct ovs_key_ipv6 */
OVS_KEY_ATTR_TCP, /* struct ovs_key_tcp */
OVS_KEY_ATTR_UDP, /* struct ovs_key_udp */
OVS_KEY_ATTR_ICMP, /* struct ovs_key_icmp */
OVS_KEY_ATTR_ICMPV6, /* struct ovs_key_icmpv6 */
OVS_KEY_ATTR_ARP, /* struct ovs_key_arp */
OVS_KEY_ATTR_ND, /* struct ovs_key_nd */
__OVS_KEY_ATTR_MAX
};
#define OVS_KEY_ATTR_MAX (__OVS_KEY_ATTR_MAX - 1)
/**
* enum ovs_frag_type - IPv4 and IPv6 fragment type
* @OVS_FRAG_TYPE_NONE: Packet is not a fragment.
* @OVS_FRAG_TYPE_FIRST: Packet is a fragment with offset 0.
* @OVS_FRAG_TYPE_LATER: Packet is a fragment with nonzero offset.
*
* Used as the @ipv4_frag in &struct ovs_key_ipv4 and as @ipv6_frag &struct
* ovs_key_ipv6.
*/
enum ovs_frag_type {
OVS_FRAG_TYPE_NONE,
OVS_FRAG_TYPE_FIRST,
OVS_FRAG_TYPE_LATER,
__OVS_FRAG_TYPE_MAX
};
#define OVS_FRAG_TYPE_MAX (__OVS_FRAG_TYPE_MAX - 1)
struct ovs_key_ethernet {
__u8 eth_src[6];
__u8 eth_dst[6];
};
struct ovs_key_ipv4 {
__be32 ipv4_src;
__be32 ipv4_dst;
__u8 ipv4_proto;
__u8 ipv4_tos;
__u8 ipv4_ttl;
__u8 ipv4_frag; /* One of OVS_FRAG_TYPE_*. */
};
struct ovs_key_ipv6 {
__be32 ipv6_src[4];
__be32 ipv6_dst[4];
__be32 ipv6_label; /* 20-bits in least-significant bits. */
__u8 ipv6_proto;
__u8 ipv6_tclass;
__u8 ipv6_hlimit;
__u8 ipv6_frag; /* One of OVS_FRAG_TYPE_*. */
};
struct ovs_key_tcp {
__be16 tcp_src;
__be16 tcp_dst;
};
struct ovs_key_udp {
__be16 udp_src;
__be16 udp_dst;
};
struct ovs_key_icmp {
__u8 icmp_type;
__u8 icmp_code;
};
struct ovs_key_icmpv6 {
__u8 icmpv6_type;
__u8 icmpv6_code;
};
struct ovs_key_arp {
__be32 arp_sip;
__be32 arp_tip;
__be16 arp_op;
__u8 arp_sha[6];
__u8 arp_tha[6];
};
struct ovs_key_nd {
__u32 nd_target[4];
__u8 nd_sll[6];
__u8 nd_tll[6];
};
/**
* enum ovs_flow_attr - attributes for %OVS_FLOW_* commands.
* @OVS_FLOW_ATTR_KEY: Nested %OVS_KEY_ATTR_* attributes specifying the flow
* key. Always present in notifications. Required for all requests (except
* dumps).
* @OVS_FLOW_ATTR_ACTIONS: Nested %OVS_ACTION_ATTR_* attributes specifying
* the actions to take for packets that match the key. Always present in
* notifications. Required for %OVS_FLOW_CMD_NEW requests, optional for
* %OVS_FLOW_CMD_SET requests.
* @OVS_FLOW_ATTR_STATS: &struct ovs_flow_stats giving statistics for this
* flow. Present in notifications if the stats would be nonzero. Ignored in
* requests.
* @OVS_FLOW_ATTR_TCP_FLAGS: An 8-bit value giving the OR'd value of all of the
* TCP flags seen on packets in this flow. Only present in notifications for
* TCP flows, and only if it would be nonzero. Ignored in requests.
* @OVS_FLOW_ATTR_USED: A 64-bit integer giving the time, in milliseconds on
* the system monotonic clock, at which a packet was last processed for this
* flow. Only present in notifications if a packet has been processed for this
* flow. Ignored in requests.
* @OVS_FLOW_ATTR_CLEAR: If present in a %OVS_FLOW_CMD_SET request, clears the
* last-used time, accumulated TCP flags, and statistics for this flow.
* Otherwise ignored in requests. Never present in notifications.
*
* These attributes follow the &struct ovs_header within the Generic Netlink
* payload for %OVS_FLOW_* commands.
*/
enum ovs_flow_attr {
OVS_FLOW_ATTR_UNSPEC,
OVS_FLOW_ATTR_KEY, /* Sequence of OVS_KEY_ATTR_* attributes. */
OVS_FLOW_ATTR_ACTIONS, /* Nested OVS_ACTION_ATTR_* attributes. */
OVS_FLOW_ATTR_STATS, /* struct ovs_flow_stats. */
OVS_FLOW_ATTR_TCP_FLAGS, /* 8-bit OR'd TCP flags. */
OVS_FLOW_ATTR_USED, /* u64 msecs last used in monotonic time. */
OVS_FLOW_ATTR_CLEAR, /* Flag to clear stats, tcp_flags, used. */
__OVS_FLOW_ATTR_MAX
};
#define OVS_FLOW_ATTR_MAX (__OVS_FLOW_ATTR_MAX - 1)
/**
* enum ovs_sample_attr - Attributes for %OVS_ACTION_ATTR_SAMPLE action.
* @OVS_SAMPLE_ATTR_PROBABILITY: 32-bit fraction of packets to sample with
* @OVS_ACTION_ATTR_SAMPLE. A value of 0 samples no packets, a value of
* %UINT32_MAX samples all packets and intermediate values sample intermediate
* fractions of packets.
* @OVS_SAMPLE_ATTR_ACTIONS: Set of actions to execute in sampling event.
* Actions are passed as nested attributes.
*
* Executes the specified actions with the given probability on a per-packet
* basis.
*/
enum ovs_sample_attr {
OVS_SAMPLE_ATTR_UNSPEC,
OVS_SAMPLE_ATTR_PROBABILITY, /* u32 number */
OVS_SAMPLE_ATTR_ACTIONS, /* Nested OVS_ACTION_ATTR_* attributes. */
__OVS_SAMPLE_ATTR_MAX,
};
#define OVS_SAMPLE_ATTR_MAX (__OVS_SAMPLE_ATTR_MAX - 1)
/**
* enum ovs_userspace_attr - Attributes for %OVS_ACTION_ATTR_USERSPACE action.
* @OVS_USERSPACE_ATTR_PID: u32 Netlink PID to which the %OVS_PACKET_CMD_ACTION
* message should be sent. Required.
* @OVS_USERSPACE_ATTR_USERDATA: If present, its u64 argument is copied to the
* %OVS_PACKET_CMD_ACTION message as %OVS_PACKET_ATTR_USERDATA,
*/
enum ovs_userspace_attr {
OVS_USERSPACE_ATTR_UNSPEC,
OVS_USERSPACE_ATTR_PID, /* u32 Netlink PID to receive upcalls. */
OVS_USERSPACE_ATTR_USERDATA, /* u64 optional user-specified cookie. */
__OVS_USERSPACE_ATTR_MAX
};
#define OVS_USERSPACE_ATTR_MAX (__OVS_USERSPACE_ATTR_MAX - 1)
/**
* struct ovs_action_push_vlan - %OVS_ACTION_ATTR_PUSH_VLAN action argument.
* @vlan_tpid: Tag protocol identifier (TPID) to push.
* @vlan_tci: Tag control identifier (TCI) to push. The CFI bit must be set
* (but it will not be set in the 802.1Q header that is pushed).
*
* The @vlan_tpid value is typically %ETH_P_8021Q. The only acceptable TPID
* values are those that the kernel module also parses as 802.1Q headers, to
* prevent %OVS_ACTION_ATTR_PUSH_VLAN followed by %OVS_ACTION_ATTR_POP_VLAN
* from having surprising results.
*/
struct ovs_action_push_vlan {
__be16 vlan_tpid; /* 802.1Q TPID. */
__be16 vlan_tci; /* 802.1Q TCI (VLAN ID and priority). */
};
/**
* enum ovs_action_attr - Action types.
*
* @OVS_ACTION_ATTR_OUTPUT: Output packet to port.
* @OVS_ACTION_ATTR_USERSPACE: Send packet to userspace according to nested
* %OVS_USERSPACE_ATTR_* attributes.
* @OVS_ACTION_ATTR_SET: Replaces the contents of an existing header. The
* single nested %OVS_KEY_ATTR_* attribute specifies a header to modify and its
* value.
* @OVS_ACTION_ATTR_PUSH_VLAN: Push a new outermost 802.1Q header onto the
* packet.
* @OVS_ACTION_ATTR_POP_VLAN: Pop the outermost 802.1Q header off the packet.
* @OVS_ACTION_ATTR_SAMPLE: Probabilitically executes actions, as specified in
* the nested %OVS_SAMPLE_ATTR_* attributes.
*
* Only a single header can be set with a single %OVS_ACTION_ATTR_SET. Not all
* fields within a header are modifiable, e.g. the IPv4 protocol and fragment
* type may not be changed.
*/
enum ovs_action_attr {
OVS_ACTION_ATTR_UNSPEC,
OVS_ACTION_ATTR_OUTPUT, /* u32 port number. */
OVS_ACTION_ATTR_USERSPACE, /* Nested OVS_USERSPACE_ATTR_*. */
OVS_ACTION_ATTR_SET, /* One nested OVS_KEY_ATTR_*. */
OVS_ACTION_ATTR_PUSH_VLAN, /* struct ovs_action_push_vlan. */
OVS_ACTION_ATTR_POP_VLAN, /* No argument. */
OVS_ACTION_ATTR_SAMPLE, /* Nested OVS_SAMPLE_ATTR_*. */
__OVS_ACTION_ATTR_MAX
};
#define OVS_ACTION_ATTR_MAX (__OVS_ACTION_ATTR_MAX - 1)
#endif /* _LINUX_OPENVSWITCH_H */
......@@ -128,6 +128,8 @@ extern int genl_register_mc_group(struct genl_family *family,
struct genl_multicast_group *grp);
extern void genl_unregister_mc_group(struct genl_family *family,
struct genl_multicast_group *grp);
extern void genl_notify(struct sk_buff *skb, struct net *net, u32 pid,
u32 group, struct nlmsghdr *nlh, gfp_t flags);
/**
* genlmsg_put - Add generic netlink header to netlink message
......
......@@ -558,7 +558,7 @@ extern void ipv6_push_frag_opts(struct sk_buff *skb,
u8 *proto);
extern int ipv6_skip_exthdr(const struct sk_buff *, int start,
u8 *nexthdrp);
u8 *nexthdrp, __be16 *frag_offp);
extern int ipv6_ext_hdr(u8 nexthdr);
......
......@@ -110,39 +110,6 @@ static struct sk_buff *vlan_reorder_header(struct sk_buff *skb)
return skb;
}
static void vlan_set_encap_proto(struct sk_buff *skb, struct vlan_hdr *vhdr)
{
__be16 proto;
unsigned char *rawp;
/*
* Was a VLAN packet, grab the encapsulated protocol, which the layer
* three protocols care about.
*/
proto = vhdr->h_vlan_encapsulated_proto;
if (ntohs(proto) >= 1536) {
skb->protocol = proto;
return;
}
rawp = skb->data;
if (*(unsigned short *) rawp == 0xFFFF)
/*
* This is a magic hack to spot IPX packets. Older Novell
* breaks the protocol design and runs IPX over 802.3 without
* an 802.2 LLC layer. We look for FFFF which isn't a used
* 802.2 SSAP/DSAP. This won't work for fault tolerant netware
* but does for the rest.
*/
skb->protocol = htons(ETH_P_802_3);
else
/*
* Real 802.2 LLC
*/
skb->protocol = htons(ETH_P_802_2);
}
struct sk_buff *vlan_untag(struct sk_buff *skb)
{
struct vlan_hdr *vhdr;
......
......@@ -215,6 +215,7 @@ source "net/sched/Kconfig"
source "net/dcb/Kconfig"
source "net/dns_resolver/Kconfig"
source "net/batman-adv/Kconfig"
source "net/openvswitch/Kconfig"
config RPS
boolean
......
......@@ -69,3 +69,4 @@ obj-$(CONFIG_DNS_RESOLVER) += dns_resolver/
obj-$(CONFIG_CEPH_LIB) += ceph/
obj-$(CONFIG_BATMAN_ADV) += batman-adv/
obj-$(CONFIG_NFC) += nfc/
obj-$(CONFIG_OPENVSWITCH) += openvswitch/
......@@ -1458,6 +1458,7 @@ static int br_multicast_ipv6_rcv(struct net_bridge *br,
const struct ipv6hdr *ip6h;
u8 icmp6_type;
u8 nexthdr;
__be16 frag_off;
unsigned len;
int offset;
int err;
......@@ -1483,7 +1484,7 @@ static int br_multicast_ipv6_rcv(struct net_bridge *br,
return -EINVAL;
nexthdr = ip6h->nexthdr;
offset = ipv6_skip_exthdr(skb, sizeof(*ip6h), &nexthdr);
offset = ipv6_skip_exthdr(skb, sizeof(*ip6h), &nexthdr, &frag_off);
if (offset < 0 || nexthdr != IPPROTO_ICMPV6)
return 0;
......
......@@ -55,9 +55,10 @@ ebt_ip6_mt(const struct sk_buff *skb, struct xt_action_param *par)
return false;
if (info->bitmask & EBT_IP6_PROTO) {
uint8_t nexthdr = ih6->nexthdr;
__be16 frag_off;
int offset_ph;
offset_ph = ipv6_skip_exthdr(skb, sizeof(_ip6h), &nexthdr);
offset_ph = ipv6_skip_exthdr(skb, sizeof(_ip6h), &nexthdr, &frag_off);
if (offset_ph == -1)
return false;
if (FWINV(info->protocol != nexthdr, EBT_IP6_PROTO))
......
......@@ -113,6 +113,7 @@ ebt_log_packet(u_int8_t pf, unsigned int hooknum,
const struct ipv6hdr *ih;
struct ipv6hdr _iph;
uint8_t nexthdr;
__be16 frag_off;
int offset_ph;
ih = skb_header_pointer(skb, 0, sizeof(_iph), &_iph);
......@@ -123,7 +124,7 @@ ebt_log_packet(u_int8_t pf, unsigned int hooknum,
printk(" IPv6 SRC=%pI6 IPv6 DST=%pI6, IPv6 priority=0x%01X, Next Header=%d",
&ih->saddr, &ih->daddr, ih->priority, ih->nexthdr);
nexthdr = ih->nexthdr;
offset_ph = ipv6_skip_exthdr(skb, sizeof(_iph), &nexthdr);
offset_ph = ipv6_skip_exthdr(skb, sizeof(_iph), &nexthdr, &frag_off);
if (offset_ph == -1)
goto out;
print_ports(skb, nexthdr, offset_ph);
......
......@@ -57,6 +57,9 @@ int ipv6_ext_hdr(u8 nexthdr)
* it returns NULL.
* - First fragment header is skipped, not-first ones
* are considered as unparsable.
* - Reports the offset field of the final fragment header so it is
* possible to tell whether this is a first fragment, later fragment,
* or not fragmented.
* - ESP is unparsable for now and considered like
* normal payload protocol.
* - Note also special handling of AUTH header. Thanks to IPsec wizards.
......@@ -64,10 +67,13 @@ int ipv6_ext_hdr(u8 nexthdr)
* --ANK (980726)
*/
int ipv6_skip_exthdr(const struct sk_buff *skb, int start, u8 *nexthdrp)
int ipv6_skip_exthdr(const struct sk_buff *skb, int start, u8 *nexthdrp,
__be16 *frag_offp)
{
u8 nexthdr = *nexthdrp;
*frag_offp = 0;
while (ipv6_ext_hdr(nexthdr)) {
struct ipv6_opt_hdr _hdr, *hp;
int hdrlen;
......@@ -87,7 +93,8 @@ int ipv6_skip_exthdr(const struct sk_buff *skb, int start, u8 *nexthdrp)
if (fp == NULL)
return -1;
if (ntohs(*fp) & ~0x7)
*frag_offp = *fp;
if (ntohs(*frag_offp) & ~0x7)
break;
hdrlen = 8;
} else if (nexthdr == NEXTHDR_AUTH)
......
......@@ -135,11 +135,12 @@ static int is_ineligible(struct sk_buff *skb)
int ptr = (u8 *)(ipv6_hdr(skb) + 1) - skb->data;
int len = skb->len - ptr;
__u8 nexthdr = ipv6_hdr(skb)->nexthdr;
__be16 frag_off;
if (len < 0)
return 1;
ptr = ipv6_skip_exthdr(skb, ptr, &nexthdr);
ptr = ipv6_skip_exthdr(skb, ptr, &nexthdr, &frag_off);
if (ptr < 0)
return 0;
if (nexthdr == IPPROTO_ICMPV6) {
......@@ -596,6 +597,7 @@ static void icmpv6_notify(struct sk_buff *skb, u8 type, u8 code, __be32 info)
int inner_offset;
int hash;
u8 nexthdr;
__be16 frag_off;
if (!pskb_may_pull(skb, sizeof(struct ipv6hdr)))
return;
......@@ -603,7 +605,8 @@ static void icmpv6_notify(struct sk_buff *skb, u8 type, u8 code, __be32 info)
nexthdr = ((struct ipv6hdr *)skb->data)->nexthdr;
if (ipv6_ext_hdr(nexthdr)) {
/* now skip over extension headers */
inner_offset = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr), &nexthdr);
inner_offset = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr),
&nexthdr, &frag_off);
if (inner_offset<0)
return;
} else {
......
......@@ -280,6 +280,7 @@ int ip6_mc_input(struct sk_buff *skb)
u8 *ptr = skb_network_header(skb) + opt->ra;
struct icmp6hdr *icmp6;
u8 nexthdr = hdr->nexthdr;
__be16 frag_off;
int offset;
/* Check if the value of Router Alert
......@@ -293,7 +294,7 @@ int ip6_mc_input(struct sk_buff *skb)
goto out;
}
offset = ipv6_skip_exthdr(skb, sizeof(*hdr),
&nexthdr);
&nexthdr, &frag_off);
if (offset < 0)
goto out;
......
......@@ -329,10 +329,11 @@ static int ip6_forward_proxy_check(struct sk_buff *skb)
{
struct ipv6hdr *hdr = ipv6_hdr(skb);
u8 nexthdr = hdr->nexthdr;
__be16 frag_off;
int offset;
if (ipv6_ext_hdr(nexthdr)) {
offset = ipv6_skip_exthdr(skb, sizeof(*hdr), &nexthdr);
offset = ipv6_skip_exthdr(skb, sizeof(*hdr), &nexthdr, &frag_off);
if (offset < 0)
return 0;
} else
......
......@@ -49,6 +49,7 @@ static void send_reset(struct net *net, struct sk_buff *oldskb)
const __u8 tclass = DEFAULT_TOS_VALUE;
struct dst_entry *dst = NULL;
u8 proto;
__be16 frag_off;
struct flowi6 fl6;
if ((!(ipv6_addr_type(&oip6h->saddr) & IPV6_ADDR_UNICAST)) ||
......@@ -58,7 +59,7 @@ static void send_reset(struct net *net, struct sk_buff *oldskb)
}
proto = oip6h->nexthdr;
tcphoff = ipv6_skip_exthdr(oldskb, ((u8*)(oip6h+1) - oldskb->data), &proto);
tcphoff = ipv6_skip_exthdr(oldskb, ((u8*)(oip6h+1) - oldskb->data), &proto, &frag_off);
if ((tcphoff < 0) || (tcphoff > oldskb->len)) {
pr_debug("Cannot get TCP header.\n");
......
......@@ -116,9 +116,11 @@ ip_set_get_ip6_port(const struct sk_buff *skb, bool src,
{
int protoff;
u8 nexthdr;
__be16 frag_off;
nexthdr = ipv6_hdr(skb)->nexthdr;
protoff = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr), &nexthdr);
protoff = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr), &nexthdr,
&frag_off);
if (protoff < 0)
return false;
......
......@@ -98,6 +98,7 @@ static void audit_ip6(struct audit_buffer *ab, struct sk_buff *skb)
struct ipv6hdr _ip6h;
const struct ipv6hdr *ih;
u8 nexthdr;
__be16 frag_off;
int offset;
ih = skb_header_pointer(skb, skb_network_offset(skb), sizeof(_ip6h), &_ip6h);
......@@ -108,7 +109,7 @@ static void audit_ip6(struct audit_buffer *ab, struct sk_buff *skb)
nexthdr = ih->nexthdr;
offset = ipv6_skip_exthdr(skb, skb_network_offset(skb) + sizeof(_ip6h),
&nexthdr);
&nexthdr, &frag_off);
audit_log_format(ab, " saddr=%pI6c daddr=%pI6c proto=%hhu",
&ih->saddr, &ih->daddr, nexthdr);
......
......@@ -204,11 +204,12 @@ tcpmss_tg6(struct sk_buff *skb, const struct xt_action_param *par)
{
struct ipv6hdr *ipv6h = ipv6_hdr(skb);
u8 nexthdr;
__be16 frag_off;
int tcphoff;
int ret;
nexthdr = ipv6h->nexthdr;
tcphoff = ipv6_skip_exthdr(skb, sizeof(*ipv6h), &nexthdr);
tcphoff = ipv6_skip_exthdr(skb, sizeof(*ipv6h), &nexthdr, &frag_off);
if (tcphoff < 0)
return NF_DROP;
ret = tcpmss_mangle_packet(skb, par->targinfo,
......
......@@ -87,9 +87,10 @@ tcpoptstrip_tg6(struct sk_buff *skb, const struct xt_action_param *par)
struct ipv6hdr *ipv6h = ipv6_hdr(skb);
int tcphoff;
u_int8_t nexthdr;
__be16 frag_off;
nexthdr = ipv6h->nexthdr;
tcphoff = ipv6_skip_exthdr(skb, sizeof(*ipv6h), &nexthdr);
tcphoff = ipv6_skip_exthdr(skb, sizeof(*ipv6h), &nexthdr, &frag_off);
if (tcphoff < 0)
return NF_DROP;
......
......@@ -445,6 +445,7 @@ hashlimit_init_dst(const struct xt_hashlimit_htable *hinfo,
{
__be16 _ports[2], *ports;
u8 nexthdr;
__be16 frag_off;
int poff;
memset(dst, 0, sizeof(*dst));
......@@ -480,7 +481,7 @@ hashlimit_init_dst(const struct xt_hashlimit_htable *hinfo,
(XT_HASHLIMIT_HASH_DPT | XT_HASHLIMIT_HASH_SPT)))
return 0;
nexthdr = ipv6_hdr(skb)->nexthdr;
protoff = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr), &nexthdr);
protoff = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr), &nexthdr, &frag_off);
if ((int)protoff < 0)
return -1;
break;
......
......@@ -214,6 +214,7 @@ extract_icmp6_fields(const struct sk_buff *skb,
struct icmp6hdr *icmph, _icmph;
__be16 *ports, _ports[2];
u8 inside_nexthdr;
__be16 inside_fragoff;
int inside_hdrlen;
icmph = skb_header_pointer(skb, outside_hdrlen,
......@@ -229,7 +230,8 @@ extract_icmp6_fields(const struct sk_buff *skb,
return 1;
inside_nexthdr = inside_iph->nexthdr;
inside_hdrlen = ipv6_skip_exthdr(skb, outside_hdrlen + sizeof(_icmph) + sizeof(_inside_iph), &inside_nexthdr);
inside_hdrlen = ipv6_skip_exthdr(skb, outside_hdrlen + sizeof(_icmph) + sizeof(_inside_iph),
&inside_nexthdr, &inside_fragoff);
if (inside_hdrlen < 0)
return 1; /* hjm: Packet has no/incomplete transport layer headers. */
......
......@@ -33,6 +33,14 @@ void genl_unlock(void)
}
EXPORT_SYMBOL(genl_unlock);
#ifdef CONFIG_PROVE_LOCKING
int lockdep_genl_is_held(void)
{
return lockdep_is_held(&genl_mutex);
}
EXPORT_SYMBOL(lockdep_genl_is_held);
#endif
#define GENL_FAM_TAB_SIZE 16
#define GENL_FAM_TAB_MASK (GENL_FAM_TAB_SIZE - 1)
......@@ -946,3 +954,16 @@ int genlmsg_multicast_allns(struct sk_buff *skb, u32 pid, unsigned int group,
return genlmsg_mcast(skb, pid, group, flags);
}
EXPORT_SYMBOL(genlmsg_multicast_allns);
void genl_notify(struct sk_buff *skb, struct net *net, u32 pid, u32 group,
struct nlmsghdr *nlh, gfp_t flags)
{
struct sock *sk = net->genl_sock;
int report = 0;
if (nlh)
report = nlmsg_report(nlh);
nlmsg_notify(sk, skb, pid, group, report, flags);
}
EXPORT_SYMBOL(genl_notify);
#
# Open vSwitch
#
config OPENVSWITCH
tristate "Open vSwitch"
---help---
Open vSwitch is a multilayer Ethernet switch targeted at virtualized
environments. In addition to supporting a variety of features
expected in a traditional hardware switch, it enables fine-grained
programmatic extension and flow-based control of the network. This
control is useful in a wide variety of applications but is
particularly important in multi-server virtualization deployments,
which are often characterized by highly dynamic endpoints and the
need to maintain logical abstractions for multiple tenants.
The Open vSwitch datapath provides an in-kernel fast path for packet
forwarding. It is complemented by a userspace daemon, ovs-vswitchd,
which is able to accept configuration from a variety of sources and
translate it into packet processing rules.
See http://openvswitch.org for more information and userspace
utilities.
To compile this code as a module, choose M here: the module will be
called openvswitch.
If unsure, say N.
#
# Makefile for Open vSwitch.
#
obj-$(CONFIG_OPENVSWITCH) += openvswitch.o
openvswitch-y := \
actions.o \
datapath.o \
dp_notify.o \
flow.o \
vport.o \
vport-internal_dev.o \
vport-netdev.o \
/*
* Copyright (c) 2007-2011 Nicira Networks.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of version 2 of the GNU General Public
* License as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
* 02110-1301, USA
*/
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/skbuff.h>
#include <linux/in.h>
#include <linux/ip.h>
#include <linux/openvswitch.h>
#include <linux/tcp.h>
#include <linux/udp.h>
#include <linux/in6.h>
#include <linux/if_arp.h>
#include <linux/if_vlan.h>
#include <net/ip.h>
#include <net/checksum.h>
#include <net/dsfield.h>
#include "datapath.h"
#include "vport.h"
static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
const struct nlattr *attr, int len, bool keep_skb);
static int make_writable(struct sk_buff *skb, int write_len)
{
if (!skb_cloned(skb) || skb_clone_writable(skb, write_len))
return 0;
return pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
}
/* remove VLAN header from packet and update csum accrodingly. */
static int __pop_vlan_tci(struct sk_buff *skb, __be16 *current_tci)
{
struct vlan_hdr *vhdr;
int err;
err = make_writable(skb, VLAN_ETH_HLEN);
if (unlikely(err))
return err;
if (skb->ip_summed == CHECKSUM_COMPLETE)
skb->csum = csum_sub(skb->csum, csum_partial(skb->data
+ ETH_HLEN, VLAN_HLEN, 0));
vhdr = (struct vlan_hdr *)(skb->data + ETH_HLEN);
*current_tci = vhdr->h_vlan_TCI;
memmove(skb->data + VLAN_HLEN, skb->data, 2 * ETH_ALEN);
__skb_pull(skb, VLAN_HLEN);
vlan_set_encap_proto(skb, vhdr);
skb->mac_header += VLAN_HLEN;
skb_reset_mac_len(skb);
return 0;
}
static int pop_vlan(struct sk_buff *skb)
{
__be16 tci;
int err;
if (likely(vlan_tx_tag_present(skb))) {
skb->vlan_tci = 0;
} else {
if (unlikely(skb->protocol != htons(ETH_P_8021Q) ||
skb->len < VLAN_ETH_HLEN))
return 0;
err = __pop_vlan_tci(skb, &tci);
if (err)
return err;
}
/* move next vlan tag to hw accel tag */
if (likely(skb->protocol != htons(ETH_P_8021Q) ||
skb->len < VLAN_ETH_HLEN))
return 0;
err = __pop_vlan_tci(skb, &tci);
if (unlikely(err))
return err;
__vlan_hwaccel_put_tag(skb, ntohs(tci));
return 0;
}
static int push_vlan(struct sk_buff *skb, const struct ovs_action_push_vlan *vlan)
{
if (unlikely(vlan_tx_tag_present(skb))) {
u16 current_tag;
/* push down current VLAN tag */
current_tag = vlan_tx_tag_get(skb);
if (!__vlan_put_tag(skb, current_tag))
return -ENOMEM;
if (skb->ip_summed == CHECKSUM_COMPLETE)
skb->csum = csum_add(skb->csum, csum_partial(skb->data
+ ETH_HLEN, VLAN_HLEN, 0));
}
__vlan_hwaccel_put_tag(skb, ntohs(vlan->vlan_tci) & ~VLAN_TAG_PRESENT);
return 0;
}
static int set_eth_addr(struct sk_buff *skb,
const struct ovs_key_ethernet *eth_key)
{
int err;
err = make_writable(skb, ETH_HLEN);
if (unlikely(err))
return err;
memcpy(eth_hdr(skb)->h_source, eth_key->eth_src, ETH_ALEN);
memcpy(eth_hdr(skb)->h_dest, eth_key->eth_dst, ETH_ALEN);
return 0;
}
static void set_ip_addr(struct sk_buff *skb, struct iphdr *nh,
__be32 *addr, __be32 new_addr)
{
int transport_len = skb->len - skb_transport_offset(skb);
if (nh->protocol == IPPROTO_TCP) {
if (likely(transport_len >= sizeof(struct tcphdr)))
inet_proto_csum_replace4(&tcp_hdr(skb)->check, skb,
*addr, new_addr, 1);
} else if (nh->protocol == IPPROTO_UDP) {
if (likely(transport_len >= sizeof(struct udphdr)))
inet_proto_csum_replace4(&udp_hdr(skb)->check, skb,
*addr, new_addr, 1);
}
csum_replace4(&nh->check, *addr, new_addr);
skb->rxhash = 0;
*addr = new_addr;
}
static void set_ip_ttl(struct sk_buff *skb, struct iphdr *nh, u8 new_ttl)
{
csum_replace2(&nh->check, htons(nh->ttl << 8), htons(new_ttl << 8));
nh->ttl = new_ttl;
}
static int set_ipv4(struct sk_buff *skb, const struct ovs_key_ipv4 *ipv4_key)
{
struct iphdr *nh;
int err;
err = make_writable(skb, skb_network_offset(skb) +
sizeof(struct iphdr));
if (unlikely(err))
return err;
nh = ip_hdr(skb);
if (ipv4_key->ipv4_src != nh->saddr)
set_ip_addr(skb, nh, &nh->saddr, ipv4_key->ipv4_src);
if (ipv4_key->ipv4_dst != nh->daddr)
set_ip_addr(skb, nh, &nh->daddr, ipv4_key->ipv4_dst);
if (ipv4_key->ipv4_tos != nh->tos)
ipv4_change_dsfield(nh, 0, ipv4_key->ipv4_tos);
if (ipv4_key->ipv4_ttl != nh->ttl)
set_ip_ttl(skb, nh, ipv4_key->ipv4_ttl);
return 0;
}
/* Must follow make_writable() since that can move the skb data. */
static void set_tp_port(struct sk_buff *skb, __be16 *port,
__be16 new_port, __sum16 *check)
{
inet_proto_csum_replace2(check, skb, *port, new_port, 0);
*port = new_port;
skb->rxhash = 0;
}
static int set_udp_port(struct sk_buff *skb,
const struct ovs_key_udp *udp_port_key)
{
struct udphdr *uh;
int err;
err = make_writable(skb, skb_transport_offset(skb) +
sizeof(struct udphdr));
if (unlikely(err))
return err;
uh = udp_hdr(skb);
if (udp_port_key->udp_src != uh->source)
set_tp_port(skb, &uh->source, udp_port_key->udp_src, &uh->check);
if (udp_port_key->udp_dst != uh->dest)
set_tp_port(skb, &uh->dest, udp_port_key->udp_dst, &uh->check);
return 0;
}
static int set_tcp_port(struct sk_buff *skb,
const struct ovs_key_tcp *tcp_port_key)
{
struct tcphdr *th;
int err;
err = make_writable(skb, skb_transport_offset(skb) +
sizeof(struct tcphdr));
if (unlikely(err))
return err;
th = tcp_hdr(skb);
if (tcp_port_key->tcp_src != th->source)
set_tp_port(skb, &th->source, tcp_port_key->tcp_src, &th->check);
if (tcp_port_key->tcp_dst != th->dest)
set_tp_port(skb, &th->dest, tcp_port_key->tcp_dst, &th->check);
return 0;
}
static int do_output(struct datapath *dp, struct sk_buff *skb, int out_port)
{
struct vport *vport;
if (unlikely(!skb))
return -ENOMEM;
vport = rcu_dereference(dp->ports[out_port]);
if (unlikely(!vport)) {
kfree_skb(skb);
return -ENODEV;
}
ovs_vport_send(vport, skb);
return 0;
}
static int output_userspace(struct datapath *dp, struct sk_buff *skb,
const struct nlattr *attr)
{
struct dp_upcall_info upcall;
const struct nlattr *a;
int rem;
upcall.cmd = OVS_PACKET_CMD_ACTION;
upcall.key = &OVS_CB(skb)->flow->key;
upcall.userdata = NULL;
upcall.pid = 0;
for (a = nla_data(attr), rem = nla_len(attr); rem > 0;
a = nla_next(a, &rem)) {
switch (nla_type(a)) {
case OVS_USERSPACE_ATTR_USERDATA:
upcall.userdata = a;
break;
case OVS_USERSPACE_ATTR_PID:
upcall.pid = nla_get_u32(a);
break;
}
}
return ovs_dp_upcall(dp, skb, &upcall);
}
static int sample(struct datapath *dp, struct sk_buff *skb,
const struct nlattr *attr)
{
const struct nlattr *acts_list = NULL;
const struct nlattr *a;
int rem;
for (a = nla_data(attr), rem = nla_len(attr); rem > 0;
a = nla_next(a, &rem)) {
switch (nla_type(a)) {
case OVS_SAMPLE_ATTR_PROBABILITY:
if (net_random() >= nla_get_u32(a))
return 0;
break;
case OVS_SAMPLE_ATTR_ACTIONS:
acts_list = a;
break;
}
}
return do_execute_actions(dp, skb, nla_data(acts_list),
nla_len(acts_list), true);
}
static int execute_set_action(struct sk_buff *skb,
const struct nlattr *nested_attr)
{
int err = 0;
switch (nla_type(nested_attr)) {
case OVS_KEY_ATTR_PRIORITY:
skb->priority = nla_get_u32(nested_attr);
break;
case OVS_KEY_ATTR_ETHERNET:
err = set_eth_addr(skb, nla_data(nested_attr));
break;
case OVS_KEY_ATTR_IPV4:
err = set_ipv4(skb, nla_data(nested_attr));
break;
case OVS_KEY_ATTR_TCP:
err = set_tcp_port(skb, nla_data(nested_attr));
break;
case OVS_KEY_ATTR_UDP:
err = set_udp_port(skb, nla_data(nested_attr));
break;
}
return err;
}
/* Execute a list of actions against 'skb'. */
static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
const struct nlattr *attr, int len, bool keep_skb)
{
/* Every output action needs a separate clone of 'skb', but the common
* case is just a single output action, so that doing a clone and
* then freeing the original skbuff is wasteful. So the following code
* is slightly obscure just to avoid that. */
int prev_port = -1;
const struct nlattr *a;
int rem;
for (a = attr, rem = len; rem > 0;
a = nla_next(a, &rem)) {
int err = 0;
if (prev_port != -1) {
do_output(dp, skb_clone(skb, GFP_ATOMIC), prev_port);
prev_port = -1;
}
switch (nla_type(a)) {
case OVS_ACTION_ATTR_OUTPUT:
prev_port = nla_get_u32(a);
break;
case OVS_ACTION_ATTR_USERSPACE:
output_userspace(dp, skb, a);
break;
case OVS_ACTION_ATTR_PUSH_VLAN:
err = push_vlan(skb, nla_data(a));
if (unlikely(err)) /* skb already freed. */
return err;
break;
case OVS_ACTION_ATTR_POP_VLAN:
err = pop_vlan(skb);
break;
case OVS_ACTION_ATTR_SET:
err = execute_set_action(skb, nla_data(a));
break;
case OVS_ACTION_ATTR_SAMPLE:
err = sample(dp, skb, a);
break;
}
if (unlikely(err)) {
kfree_skb(skb);
return err;
}
}
if (prev_port != -1) {
if (keep_skb)
skb = skb_clone(skb, GFP_ATOMIC);
do_output(dp, skb, prev_port);
} else if (!keep_skb)
consume_skb(skb);
return 0;
}
/* Execute a list of actions against 'skb'. */
int ovs_execute_actions(struct datapath *dp, struct sk_buff *skb)
{
struct sw_flow_actions *acts = rcu_dereference(OVS_CB(skb)->flow->sf_acts);
return do_execute_actions(dp, skb, acts->actions,
acts->actions_len, false);
}
此差异已折叠。
/*
* Copyright (c) 2007-2011 Nicira Networks.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of version 2 of the GNU General Public
* License as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
* 02110-1301, USA
*/
#ifndef DATAPATH_H
#define DATAPATH_H 1
#include <asm/page.h>
#include <linux/kernel.h>
#include <linux/mutex.h>
#include <linux/netdevice.h>
#include <linux/skbuff.h>
#include <linux/u64_stats_sync.h>
#include <linux/version.h>
#include "flow.h"
struct vport;
#define DP_MAX_PORTS 1024
#define SAMPLE_ACTION_DEPTH 3
/**
* struct dp_stats_percpu - per-cpu packet processing statistics for a given
* datapath.
* @n_hit: Number of received packets for which a matching flow was found in
* the flow table.
* @n_miss: Number of received packets that had no matching flow in the flow
* table. The sum of @n_hit and @n_miss is the number of packets that have
* been received by the datapath.
* @n_lost: Number of received packets that had no matching flow in the flow
* table that could not be sent to userspace (normally due to an overflow in
* one of the datapath's queues).
*/
struct dp_stats_percpu {
u64 n_hit;
u64 n_missed;
u64 n_lost;
struct u64_stats_sync sync;
};
/**
* struct datapath - datapath for flow-based packet switching
* @rcu: RCU callback head for deferred destruction.
* @list_node: Element in global 'dps' list.
* @n_flows: Number of flows currently in flow table.
* @table: Current flow table. Protected by genl_lock and RCU.
* @ports: Map from port number to &struct vport. %OVSP_LOCAL port
* always exists, other ports may be %NULL. Protected by RTNL and RCU.
* @port_list: List of all ports in @ports in arbitrary order. RTNL required
* to iterate or modify.
* @stats_percpu: Per-CPU datapath statistics.
*
* Context: See the comment on locking at the top of datapath.c for additional
* locking information.
*/
struct datapath {
struct rcu_head rcu;
struct list_head list_node;
/* Flow table. */
struct flow_table __rcu *table;
/* Switch ports. */
struct vport __rcu *ports[DP_MAX_PORTS];
struct list_head port_list;
/* Stats. */
struct dp_stats_percpu __percpu *stats_percpu;
};
/**
* struct ovs_skb_cb - OVS data in skb CB
* @flow: The flow associated with this packet. May be %NULL if no flow.
*/
struct ovs_skb_cb {
struct sw_flow *flow;
};
#define OVS_CB(skb) ((struct ovs_skb_cb *)(skb)->cb)
/**
* struct dp_upcall - metadata to include with a packet to send to userspace
* @cmd: One of %OVS_PACKET_CMD_*.
* @key: Becomes %OVS_PACKET_ATTR_KEY. Must be nonnull.
* @userdata: If nonnull, its u64 value is extracted and passed to userspace as
* %OVS_PACKET_ATTR_USERDATA.
* @pid: Netlink PID to which packet should be sent. If @pid is 0 then no
* packet is sent and the packet is accounted in the datapath's @n_lost
* counter.
*/
struct dp_upcall_info {
u8 cmd;
const struct sw_flow_key *key;
const struct nlattr *userdata;
u32 pid;
};
extern struct notifier_block ovs_dp_device_notifier;
extern struct genl_multicast_group ovs_dp_vport_multicast_group;
void ovs_dp_process_received_packet(struct vport *, struct sk_buff *);
void ovs_dp_detach_port(struct vport *);
int ovs_dp_upcall(struct datapath *, struct sk_buff *,
const struct dp_upcall_info *);
const char *ovs_dp_name(const struct datapath *dp);
struct sk_buff *ovs_vport_cmd_build_info(struct vport *, u32 pid, u32 seq,
u8 cmd);
int ovs_execute_actions(struct datapath *dp, struct sk_buff *skb);
#endif /* datapath.h */
/*
* Copyright (c) 2007-2011 Nicira Networks.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of version 2 of the GNU General Public
* License as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
* 02110-1301, USA
*/
#include <linux/netdevice.h>
#include <net/genetlink.h>
#include "datapath.h"
#include "vport-internal_dev.h"
#include "vport-netdev.h"
static int dp_device_event(struct notifier_block *unused, unsigned long event,
void *ptr)
{
struct net_device *dev = ptr;
struct vport *vport;
if (ovs_is_internal_dev(dev))
vport = ovs_internal_dev_get_vport(dev);
else
vport = ovs_netdev_get_vport(dev);
if (!vport)
return NOTIFY_DONE;
switch (event) {
case NETDEV_UNREGISTER:
if (!ovs_is_internal_dev(dev)) {
struct sk_buff *notify;
notify = ovs_vport_cmd_build_info(vport, 0, 0,
OVS_VPORT_CMD_DEL);
ovs_dp_detach_port(vport);
if (IS_ERR(notify)) {
netlink_set_err(init_net.genl_sock, 0,
ovs_dp_vport_multicast_group.id,
PTR_ERR(notify));
break;
}
genlmsg_multicast(notify, 0, ovs_dp_vport_multicast_group.id,
GFP_KERNEL);
}
break;
}
return NOTIFY_DONE;
}
struct notifier_block ovs_dp_device_notifier = {
.notifier_call = dp_device_event
};
此差异已折叠。
此差异已折叠。
此差异已折叠。
/*
* Copyright (c) 2007-2011 Nicira Networks.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of version 2 of the GNU General Public
* License as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
* 02110-1301, USA
*/
#ifndef VPORT_INTERNAL_DEV_H
#define VPORT_INTERNAL_DEV_H 1
#include "datapath.h"
#include "vport.h"
int ovs_is_internal_dev(const struct net_device *);
struct vport *ovs_internal_dev_get_vport(struct net_device *);
#endif /* vport-internal_dev.h */
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册