提交 · 0ca743a5599199152a31a7146b83213c786c2eb2 · openanolis / cloud-kernel

15 10月, 2013 1 次提交

netfilter: nf_tables: add compatibility layer for x_tables · 0ca743a5

由 Pablo Neira Ayuso 提交于 10月 14, 2013

This patch adds the x_tables compatibility layer. This allows you
to use existing x_tables matches and targets from nf_tables.

This compatibility later allows us to use existing matches/targets
for features that are still missing in nf_tables. We can progressively
replace them with native nf_tables extensions. It also provides the
userspace compatibility software that allows you to express the
rule-set using the iptables syntax but using the nf_tables kernel
components.

In order to get this compatibility layer working, I've done the
following things:

* add NFNL_SUBSYS_NFT_COMPAT: this new nfnetlink subsystem is used
to query the x_tables match/target revision, so we don't need to
use the native x_table getsockopt interface.

* emulate xt structures: this required extending the struct nft_pktinfo
to include the fragment offset, which is already obtained from
ip[6]_tables and that is used by some matches/targets.

* add support for default policy to base chains, required to emulate
  x_tables.

* add NFTA_CHAIN_USE attribute to obtain the number of references to
  chains, required by x_tables emulation.

* add chain packet/byte counters using per-cpu.

* support 32-64 bits compat.

For historical reasons, this patch includes the following patches
that were posted in the netfilter-devel mailing list.

From Pablo Neira Ayuso:
* nf_tables: add default policy to base chains
* netfilter: nf_tables: add NFTA_CHAIN_USE attribute
* nf_tables: nft_compat: private data of target and matches in contiguous area
* nf_tables: validate hooks for compat match/target
* nf_tables: nft_compat: release cached matches/targets
* nf_tables: x_tables support as a compile time option
* nf_tables: fix alias for xtables over nftables module
* nf_tables: add packet and byte counters per chain
* nf_tables: fix per-chain counter stats if no counters are passed
* nf_tables: don't bump chain stats
* nf_tables: add protocol and flags for xtables over nf_tables
* nf_tables: add ip[6]t_entry emulation
* nf_tables: move specific layer 3 compat code to nf_tables_ipv[4|6]
* nf_tables: support 32bits-64bits x_tables compat
* nf_tables: fix compilation if CONFIG_COMPAT is disabled

From Patrick McHardy:
* nf_tables: move policy to struct nft_base_chain
* nf_tables: send notifications for base chain policy changes

From Alexander Primak:
* nf_tables: remove the duplicate NF_INET_LOCAL_OUT

From Nicolas Dichtel:
* nf_tables: fix compilation when nf-netlink is a module
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

0ca743a5

14 10月, 2013 8 次提交

netfilter: nf_tables: convert built-in tables/chains to chain types · 9370761c

由 Pablo Neira Ayuso 提交于 10月 10, 2013

This patch converts built-in tables/chains to chain types that
allows you to deploy customized table and chain configurations from
userspace.

After this patch, you have to specify the chain type when
creating a new chain:

 add chain ip filter output { type filter hook input priority 0; }
                              ^^^^ ------

The existing chain types after this patch are: filter, route and
nat. Note that tables are just containers of chains with no specific
semantics, which is a significant change with regards to iptables.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

9370761c

netfilter: nft_payload: add optimized payload implementation for small loads · c29b72e0

由 Patrick McHardy 提交于 10月 10, 2013

Add an optimized payload expression implementation for small (up to 4 bytes)
aligned data loads from the linear packet area.

This patch also includes original Patrick McHardy's entitled (nf_tables:
inline nft_payload_fast_eval() into main evaluation loop).
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

c29b72e0

netfilter: nf_tables: add optimized data comparison for small values · cb7dbfd0

由 Patrick McHardy 提交于 10月 10, 2013

Add an optimized version of nft_data_cmp() that only handles values of to
4 bytes length.

This patch includes original Patrick McHardy's patch entitled (nf_tables:
inline nft_cmp_fast_eval() into main evaluation loop).
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

cb7dbfd0

netfilter: nf_tables: expression ops overloading · ef1f7df9

由 Patrick McHardy 提交于 10月 10, 2013

Split the expression ops into two parts and support overloading of
the runtime expression ops based on the requested function through
a ->select_ops() callback.

This can be used to provide optimized implementations, for instance
for loading small aligned amounts of data from the packet or inlining
frequently used operations into the main evaluation loop.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

ef1f7df9

netfilter: nf_tables: add netlink set API · 20a69341

由 Patrick McHardy 提交于 10月 11, 2013

This patch adds the new netlink API for maintaining nf_tables sets
independently of the ruleset. The API supports the following operations:

- creation of sets
- deletion of sets
- querying of specific sets
- dumping of all sets

- addition of set elements
- removal of set elements
- dumping of all set elements

Sets are identified by name, each table defines an individual namespace.
The name of a set may be allocated automatically, this is mostly useful
in combination with the NFT_SET_ANONYMOUS flag, which destroys a set
automatically once the last reference has been released.

Sets can be marked constant, meaning they're not allowed to change while
linked to a rule. This allows to perform lockless operation for set
types that would otherwise require locking.

Additionally, if the implementation supports it, sets can (as before) be
used as maps, associating a data value with each key (or range), by
specifying the NFT_SET_MAP flag and can be used for interval queries by
specifying the NFT_SET_INTERVAL flag.

Set elements are added and removed incrementally. All element operations
support batching, reducing netlink message and set lookup overhead.

The old "set" and "hash" expressions are replaced by a generic "lookup"
expression, which binds to the specified set. Userspace is not aware
of the actual set implementation used by the kernel anymore, all
configuration options are generic.

Currently the implementation selection logic is largely missing and the
kernel will simply use the first registered implementation supporting the
requested operation. Eventually, the plan is to have userspace supply a
description of the data characteristics and select the implementation
based on expected performance and memory use.

This patch includes the new 'lookup' expression to look up for element
matching in the set.

This patch includes kernel-doc descriptions for this set API and it
also includes the following fixes.

From Patrick McHardy:
* netfilter: nf_tables: fix set element data type in dumps
* netfilter: nf_tables: fix indentation of struct nft_set_elem comments
* netfilter: nf_tables: fix oops in nft_validate_data_load()
* netfilter: nf_tables: fix oops while listing sets of built-in tables
* netfilter: nf_tables: destroy anonymous sets immediately if binding fails
* netfilter: nf_tables: propagate context to set iter callback
* netfilter: nf_tables: add loop detection

From Pablo Neira Ayuso:
* netfilter: nf_tables: allow to dump all existing sets
* netfilter: nf_tables: fix wrong type for flags variable in newelem
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

20a69341

netfilter: add nftables · 96518518

由 Patrick McHardy 提交于 10月 14, 2013

This patch adds nftables which is the intended successor of iptables.
This packet filtering framework reuses the existing netfilter hooks,
the connection tracking system, the NAT subsystem, the transparent
proxying engine, the logging infrastructure and the userspace packet
queueing facilities.

In a nutshell, nftables provides a pseudo-state machine with 4 general
purpose registers of 128 bits and 1 specific purpose register to store
verdicts. This pseudo-machine comes with an extensible instruction set,
a.k.a. "expressions" in the nftables jargon. The expressions included
in this patch provide the basic functionality, they are:

* bitwise: to perform bitwise operations.
* byteorder: to change from host/network endianess.
* cmp: to compare data with the content of the registers.
* counter: to enable counters on rules.
* ct: to store conntrack keys into register.
* exthdr: to match IPv6 extension headers.
* immediate: to load data into registers.
* limit: to limit matching based on packet rate.
* log: to log packets.
* meta: to match metainformation that usually comes with the skbuff.
* nat: to perform Network Address Translation.
* payload: to fetch data from the packet payload and store it into
  registers.
* reject (IPv4 only): to explicitly close connection, eg. TCP RST.

Using this instruction-set, the userspace utility 'nft' can transform
the rules expressed in human-readable text representation (using a
new syntax, inspired by tcpdump) to nftables bytecode.

nftables also inherits the table, chain and rule objects from
iptables, but in a more configurable way, and it also includes the
original datatype-agnostic set infrastructure with mapping support.
This set infrastructure is enhanced in the follow up patch (netfilter:
nf_tables: add netlink set API).

This patch includes the following components:

* the netlink API: net/netfilter/nf_tables_api.c and
  include/uapi/netfilter/nf_tables.h
* the packet filter core: net/netfilter/nf_tables_core.c
* the expressions (described above): net/netfilter/nft_*.c
* the filter tables: arp, IPv4, IPv6 and bridge:
  net/ipv4/netfilter/nf_tables_ipv4.c
  net/ipv6/netfilter/nf_tables_ipv6.c
  net/ipv4/netfilter/nf_tables_arp.c
  net/bridge/netfilter/nf_tables_bridge.c
* the NAT table (IPv4 only):
  net/ipv4/netfilter/nf_table_nat_ipv4.c
* the route table (similar to mangle):
  net/ipv4/netfilter/nf_table_route_ipv4.c
  net/ipv6/netfilter/nf_table_route_ipv6.c
* internal definitions under:
  include/net/netfilter/nf_tables.h
  include/net/netfilter/nf_tables_core.h
* It also includes an skeleton expression:
  net/netfilter/nft_expr_template.c
  and the preliminary implementation of the meta target
  net/netfilter/nft_meta_target.c

It also includes a change in struct nf_hook_ops to add a new
pointer to store private data to the hook, that is used to store
the rule list per chain.

This patch is based on the patch from Patrick McHardy, plus merged
accumulated cleanups, fixes and small enhancements to the nftables
code that has been done since 2009, which are:

From Patrick McHardy:
* nf_tables: adjust netlink handler function signatures
* nf_tables: only retry table lookup after successful table module load
* nf_tables: fix event notification echo and avoid unnecessary messages
* nft_ct: add l3proto support
* nf_tables: pass expression context to nft_validate_data_load()
* nf_tables: remove redundant definition
* nft_ct: fix maxattr initialization
* nf_tables: fix invalid event type in nf_tables_getrule()
* nf_tables: simplify nft_data_init() usage
* nf_tables: build in more core modules
* nf_tables: fix double lookup expression unregistation
* nf_tables: move expression initialization to nf_tables_core.c
* nf_tables: build in payload module
* nf_tables: use NFPROTO constants
* nf_tables: rename pid variables to portid
* nf_tables: save 48 bits per rule
* nf_tables: introduce chain rename
* nf_tables: check for duplicate names on chain rename
* nf_tables: remove ability to specify handles for new rules
* nf_tables: return error for rule change request
* nf_tables: return error for NLM_F_REPLACE without rule handle
* nf_tables: include NLM_F_APPEND/NLM_F_REPLACE flags in rule notification
* nf_tables: fix NLM_F_MULTI usage in netlink notifications
* nf_tables: include NLM_F_APPEND in rule dumps

From Pablo Neira Ayuso:
* nf_tables: fix stack overflow in nf_tables_newrule
* nf_tables: nft_ct: fix compilation warning
* nf_tables: nft_ct: fix crash with invalid packets
* nft_log: group and qthreshold are 2^16
* nf_tables: nft_meta: fix socket uid,gid handling
* nft_counter: allow to restore counters
* nf_tables: fix module autoload
* nf_tables: allow to remove all rules placed in one chain
* nf_tables: use 64-bits rule handle instead of 16-bits
* nf_tables: fix chain after rule deletion
* nf_tables: improve deletion performance
* nf_tables: add missing code in route chain type
* nf_tables: rise maximum number of expressions from 12 to 128
* nf_tables: don't delete table if in use
* nf_tables: fix basechain release

From Tomasz Bursztyka:
* nf_tables: Add support for changing users chain's name
* nf_tables: Change chain's name to be fixed sized
* nf_tables: Add support for replacing a rule by another one
* nf_tables: Update uapi nftables netlink header documentation

From Florian Westphal:
* nft_log: group is u16, snaplen u32

From Phil Oester:
* nf_tables: operational limit match
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

96518518

netfilter: nf_nat: move alloc_null_binding to nf_nat_core.c · f59cb045

由 Pablo Neira Ayuso 提交于 10月 14, 2013

Similar to nat_decode_session, alloc_null_binding is needed for both
ip_tables and nf_tables, so move it to nf_nat_core.c. This change
is required by nf_tables.

This is an adapted version of the original patch from Patrick McHardy.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

f59cb045

netfilter: pass hook ops to hookfn · 795aa6ef

由 Patrick McHardy 提交于 10月 10, 2013

Pass the hook ops to the hookfn to allow for generic hook
functions. This change is required by nf_tables.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

795aa6ef

12 10月, 2013 2 次提交

tcp: tcp_transmit_skb() optimizations · ccdbb6e9

由 Eric Dumazet 提交于 10月 10, 2013

1) We need to take a timestamp only for skb that should be cloned.

Other skbs are not in write queue and no rtt estimation is done on them.

2) the unlikely() hint is wrong for receivers (they send pure ACK)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: MF Nowlan <fitz@cs.yale.edu>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-By: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ccdbb6e9

Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge · 29b67c39

由 David S. Miller 提交于 10月 11, 2013

Included changes:
- update emails for A. Quartulli and M. Lindner in MAINTAINERS
- switch to the next on-the-wire protocol version
- introduce the T(ype) V(ersion) L(ength) V(alue) framework
- adjust the existing components to make them use the new TVLV code
- make the TT component use CRC32 instead of CRC16
- totally remove the VIS functionality (has been moved to userspace)
- reorder packet types and flags
- add static checks on packet format
- remove __packed from batadv_ogm_packet

29b67c39

11 10月, 2013 2 次提交

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next · 58308451

由 David S. Miller 提交于 10月 10, 2013

Jeff Kirsher says:

====================
This series contains updates to i40e only.

Alex provides the majority of the patches against i40e, where he does
cleanup of the Tx and RX queues and to align the code with the known
good Tx/Rx queue code in the ixgbe driver.

Anjali provides an i40e patch to update link events to not print to
the log until the device is administratively up.

Catherine provides a patch to update the driver version.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

58308451

inet: rename ir_loc_port to ir_num · b44084c2

由 Eric Dumazet 提交于 10月 10, 2013

In commit 634fb979 ("inet: includes a sock_common in request_sock")
I forgot that the two ports in sock_common do not have same byte order :

skc_dport is __be16 (network order), but skc_num is __u16 (host order)

So sparse complains because ir_loc_port (mapped into skc_num) is
considered as __u16 while it should be __be16

Let rename ir_loc_port to ireq->ir_num (analogy with inet->inet_num),
and perform appropriate htons/ntohs conversions.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NWu Fengguang <fengguang.wu@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b44084c2

10 10月, 2013 27 次提交

i40e: Bump version · d04795d6

由 Catherine Sullivan 提交于 9月 28, 2013

Update the version number of the driver.
Signed-off-by: NCatherine Sullivan <catherine.sullivan@intel.com>
Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: NKavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

d04795d6

i40e: Add support for 64 bit netstats · 980e9b11

由 Alexander Duyck 提交于 9月 28, 2013

This change brings support for 64 bit netstats to the driver. Previously
the stats were 64 bit but highly racy due to the fact that 64 bit
transactions are not atomic on 32 bit systems. This change makes is so
that the 64 bit byte and packet stats are reliable on all architectures.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: NKavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

980e9b11

i40e: Move rings from pointer to array to array of pointers · 9f65e15b

由 Alexander Duyck 提交于 9月 28, 2013

Allocate the queue pairs individually instead of as a group. This
allows for much easier queue management as it is possible to dynamically
resize the queues without having to free and allocate the entire block.

Ease statistic collection by treating Tx/Rx queue pairs as a single
unit. Each pair is allocated together and starts with a Tx queue and
ends with an Rx queue. By ordering them this way it is possible to know
the Rx offset based on a pointer to the Tx queue.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: NKavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

9f65e15b

i40e: Replace ring container array with linked list · cd0b6fa6

由 Alexander Duyck 提交于 9月 28, 2013

This replaces the ring container array with a linked list.  The idea is
to make the logic much easier to deal with since this will allow us to
call a simple helper function from the q_vectors to go through the
entire list.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: NKavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

cd0b6fa6

i40e: Move q_vectors from pointer to array to array of pointers · 493fb300

由 Alexander Duyck 提交于 9月 28, 2013

Allocate the q_vectors individually. The advantage to this is that it
allows for easier freeing and allocation. In addition it makes it so
that we could do node specific allocations at some point in the future.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

493fb300

i40e: Split bytes and packets from Rx/Tx stats · a114d0a6

由 Alexander Duyck 提交于 9月 28, 2013

This makes it so that the Tx and Rx byte and packet counts are
separated from the rest of the statistics. This allows for better
isolation of these stats when we move them into the 64 bit statistics.

Simplify things by re-ordering how the stats display in ethtool.
Instead of displaying all of the Tx queues as a block, followed by all
the Rx queues, the new order is Tx[0], Rx[0], Tx[1], Rx[1], ..., Tx[n],
Rx[n]. This reduces the loops and cleans up the display for testing
purposes since it is very easy to verify if flow director is doing the
right thing as the Tx and Rx queue pair are shown in pairs.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: NKavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

a114d0a6

i40e: Add support for Tx byte queue limits · 7070ce0a

由 Alexander Duyck 提交于 9月 28, 2013

Implement BQL (byte queue limit) support in i40e.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: NKavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

7070ce0a

i40e: Drop dead code and flags from Tx hotpath · b1941306

由 Alexander Duyck 提交于 9月 28, 2013

Drop Tx flag and TXSW which is tested but never set.

As a result of this change we can drop a complicated check that always
resulted in the final result of i40e_tx_csum being equal to the
CHECKSUM_PARTIAL value. As such we can replace the entire function call
with just a check for skb->summed == CHECKSUM_PARTIAL.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: NKavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

b1941306

i40e: clean up Tx fast path · a5e9c572

由 Alexander Duyck 提交于 9月 28, 2013

Sync the fast path for i40e_tx_map and i40e_clean_tx_irq so that they
are similar to igb and ixgbe.

- Only update the Tx descriptor ring in tx_map
- Make skb mapping always on the first buffer in the chain
- Drop the use of MAPPED_AS_PAGE Tx flag
- Only store flags on the first buffer_info structure
Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: NKavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

a5e9c572

i40e: Do not directly increment Tx next_to_use · fc4ac67b

由 Alexander Duyck 提交于 9月 28, 2013

Avoid directly incrementing next_to_use for multiple reasons. The main
reason being that if we directly increment it then it can attain a state
where it is equal to the ring count. Technically this is a state it
should not be able to reach but the way this is written it now can.

This patch pulls the value off into a register and then increments it
and writes back either the value or 0 depending on if the value is equal
to the ring count.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: NKavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

fc4ac67b

i40e: Cleanup Tx buffer info layout · 35a1e2ad

由 Alexander Duyck 提交于 9月 28, 2013

- drop the mapped_as_page u8 from the Tx buffer info as it was unused
- use the DMA unmap accessors for Tx DMA
- replace checks of DMA with checks of the unmap length to verify if an
  unmap is needed
- update the Tx buffer layout to make it consistent with igb, ixgbe
Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: NKavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

35a1e2ad

tcp: use ACCESS_ONCE() in tcp_update_pacing_rate() · ba537427

由 Eric Dumazet 提交于 10月 09, 2013

sk_pacing_rate is read by sch_fq packet scheduler at any time,
with no synchronization, so make sure we update it in a
sensible way. ACCESS_ONCE() is how we instruct compiler
to not do stupid things, like using the memory location
as a temporary variable.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ba537427

inet: includes a sock_common in request_sock · 634fb979

由 Eric Dumazet 提交于 10月 09, 2013

TCP listener refactoring, part 5 :

We want to be able to insert request sockets (SYN_RECV) into main
ehash table instead of the per listener hash table to allow RCU
lookups and remove listener lock contention.

This patch includes the needed struct sock_common in front
of struct request_sock

This means there is no more inet6_request_sock IPv6 specific
structure.

Following inet_request_sock fields were renamed as they became
macros to reference fields from struct sock_common.
Prefix ir_ was chosen to avoid name collisions.

loc_port   -> ir_loc_port
loc_addr   -> ir_loc_addr
rmt_addr   -> ir_rmt_addr
rmt_port   -> ir_rmt_port
iif        -> ir_iif
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

634fb979

net: gro: allow to build full sized skb · 8a29111c

由 Eric Dumazet 提交于 10月 08, 2013

skb_gro_receive() is currently limited to 16 or 17 MSS per GRO skb,
typically 24616 bytes, because it fills up to MAX_SKB_FRAGS frags.

It's relatively easy to extend the skb using frag_list to allow
more frags to be appended into the last sk_buff.

This still builds very efficient skbs, and allows reaching 45 MSS per
skb.

(45 MSS GRO packet uses one skb plus a frag_list containing 2 additional
sk_buff)

High speed TCP flows benefit from this extension by lowering TCP stack
cpu usage (less packets stored in receive queue, less ACK packets
processed)

Forwarding setups could be hurt, as such skbs will need to be
linearized, although its not a new problem, as GRO could already
provide skbs with a frag_list.

We could make the 65536 bytes threshold a tunable to mitigate this.

(First time we need to linearize skb in skb_needs_linearize(), we could
lower the tunable to ~16*1460 so that following skb_gro_receive() calls
build smaller skbs)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8a29111c

fib_trie: only calc for the un-first node · 4c60f1d6

由 baker.zhang 提交于 10月 08, 2013

This is a enhancement.

for the first node in fib_trie, newpos is 0, bit is 1.
Only for the leaf or node with unmatched key need calc pos.
Signed-off-by: Nbaker.zhang <baker.kernel@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4c60f1d6

veth: allow to setup multicast address for veth device · 5c70ef85

由 Gao feng 提交于 10月 04, 2013

We can only setup multicast address for network device when
net_device_ops->ndo_set_rx_mode is not null.

Some configurations need to add multicast address for net
device, such as netfilter cluster match module.

Add a fake ndo_set_rx_mode function to allow this operation.
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5c70ef85

i40e: Drop unused completed stat · c304fdac

由 Alexander Duyck 提交于 9月 28, 2013

The Tx "completed" stat was part of the original rewrite for detecting
Tx hangs.  However some time ago in ixgbe I determined that we could
just use the packets stat instead.  Since then this stat was
removed from ixgbe and it serves no purpose in i40e so it can be
dropped.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: NKavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

c304fdac

i40e: Link code updates · 6d779b41

由 Anjali Singhai 提交于 9月 28, 2013

Link events should not print to the log until the device is
administratively up.
Signed-off-by: NAnjali Singhai <anjali.singhai@intel.com>
Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: NKavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

6d779b41

be2net: change the driver version number to 4.9.224.0 · b68656b2

由 Ajit Khaparde 提交于 10月 03, 2013

Signed-off-by: NAjit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b68656b2

be2net: Display RoCE specific counters in ethtool -S · 461ae379

由 Ajit Khaparde 提交于 10月 03, 2013

SkyHawk-R can support RoCE. Add code to display RoCE specific
counters maintained in hardware.
Signed-off-by: NAjit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

461ae379

be2net: Call version 2 of GET_STATS ioctl for Skyhawk-R · 61000861

由 Ajit Khaparde 提交于 10月 03, 2013

Moving to version 2 of GET_STATS command as SkyHawk-R supports
higher number of rings.
Signed-off-by: NAjit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

61000861

batman-adv: reorder batadv_iv_flags · 18c68d59

由 Simon Wunderlich 提交于 4月 25, 2013

The vis flag is not needed anymore, and since we do a compat bump we
can start with the first bit again
Signed-off-by: NSimon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: NMarek Lindner <lindner_marek@yahoo.de>
Signed-off-by: NAntonio Quartulli <antonio@meshcoding.com>

18c68d59

batman-adv: remove packed from batadv_ogm_packet · 9284a47e

由 Simon Wunderlich 提交于 4月 25, 2013

As we decreased the struct size from 26 to 24 byte, we can remove
__packed as the compiler will not add any more padding.
Signed-off-by: NSimon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: NMarek Lindner <lindner_marek@yahoo.de>
Signed-off-by: NAntonio Quartulli <antonio@meshcoding.com>

9284a47e

batman-adv: reorder packet types · a1f1ac5c

由 Simon Wunderlich 提交于 4月 25, 2013

Reordering the packet type numbers allows us to handle unicast
packets in a general way - even if we don't know the specific packet
type, we can still forward it. There was already code handling
this for a couple of unicast packets, and this is the more
generalized version to do that.
Signed-off-by: NSimon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: NMarek Lindner <lindner_marek@yahoo.de>
Signed-off-by: NAntonio Quartulli <antonio@meshcoding.com>

a1f1ac5c

batman-adv: add build check macros for packet member offset · 80067c83

由 Simon Wunderlich 提交于 4月 25, 2013

Since we removed the __packed from most of the packets, we should
make sure that the offset generated by the compiler are correct for
sent/received data.
Signed-off-by: NSimon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: NMarek Lindner <lindner_marek@yahoo.de>
Signed-off-by: NAntonio Quartulli <antonio@meshcoding.com>

80067c83

batman-adv: remove vis functionality · 9f4980e6

由 Simon Wunderlich 提交于 4月 25, 2013

This is replaced by a userspace program, we don't need this
functionality to bloat the kernel.
Signed-off-by: NSimon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: NMarek Lindner <lindner_marek@yahoo.de>
Signed-off-by: NAntonio Quartulli <antonio@meshcoding.com>

9f4980e6

batman-adv: move BATADV_TT_CLIENT_TEMP to higher bit · 0035f97e

由 Antonio Quartulli 提交于 4月 24, 2013

Client flags from bit 0 to 7 are sent over the wire.
BATADV_TT_CLIENT_TEMP is a local flag and is not supposed
to be sent to the network. Therefore it has occupy a
higher bit.
Signed-off-by: NAntonio Quartulli <ordex@autistici.org>
Signed-off-by: NMarek Lindner <lindner_marek@yahoo.de>

0035f97e

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功