提交 · 28b2e0d2cd132284ad69fcea4b7cf6b7d0662c2f · openanolis / cloud-kernel

16 1月, 2018 1 次提交

net: phy: remove parameter new_link from phy_mac_interrupt() · 28b2e0d2

由 Heiner Kallweit 提交于 1月 10, 2018

I see two issues with parameter new_link:

1. It's not needed. See also phy_interrupt(), works w/o this parameter.
   phy_mac_interrupt sets the state to PHY_CHANGELINK and triggers the
   state machine which then calls phy_read_status. And phy_read_status
   updates the link state.

2. phy_mac_interrupt is used in interrupt context and getting the link
   state may sleep (at least when having to access the PHY registers
   via MDIO bus).

So let's remove it.
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Tested-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

28b2e0d2

15 1月, 2018 1 次提交

net: sch: prio: Add offload ability to PRIO qdisc · 7fdb61b4

由 Nogah Frankel 提交于 1月 14, 2018

Add the ability to offload PRIO qdisc by using ndo_setup_tc.
There are three commands for PRIO offloading:
* TC_PRIO_REPLACE: handles set and tune
* TC_PRIO_DESTROY: handles qdisc destroy
* TC_PRIO_STATS: updates the qdiscs counters (given as reference)

Like RED qdisc, the indication of whether PRIO is being offloaded is being
set and updated as part of the dump function. It is so because the driver
could decide to offload or not based on the qdisc parent, which could
change without notifying the qdisc.
Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
Reviewed-by: NYuval Mintz <yuvalm@mellanox.com>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7fdb61b4

11 1月, 2018 7 次提交

net: sch: red: Change the name of the stats struct to be generic · f34b4aac

由 Nogah Frankel 提交于 1月 10, 2018

Change the name of the stats struct to be generic, so it could be used for
other qdisc offload, that will be added in the next patches.
Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
Reviewed-by: NYuval Mintz <yuvalm@mellanox.com>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f34b4aac

tuntap: fix for "tuntap: XDP transmission" · 1125b008

由 Stephen Rothwell 提交于 1月 10, 2018

Fixes: fc72d1d5 ("tuntap: XDP transmission")
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Acked-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1125b008

net: fix xdp_rxq_info build issue when CONFIG_SYSFS is not set · fd3ba214

由 Jesper Dangaard Brouer 提交于 1月 09, 2018

The commit e817f856 ("xdp: generic XDP handling of xdp_rxq_info")
removed some ifdef CONFIG_SYSFS in net/core/dev.c, but forgot to
remove the corresponding ifdef's in include/linux/netdevice.h.

Fixes: e817f856 ("xdp: generic XDP handling of xdp_rxq_info")
Reported-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Tested-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fd3ba214

net/dim: use struct net_dim_sample as arg to net_dim · 8115b750

由 Andy Gospodarek 提交于 1月 09, 2018

Simplify the arguments net_dim() by formatting them into a struct
net_dim_sample before calling the function.
Signed-off-by: NAndy Gospodarek <gospo@broadcom.com>
Suggested-by: NTal Gilboa <talgi@mellanox.com>
Acked-by: NTal Gilboa <talgi@mellanox.com>
Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8115b750

net/mlx5e: Move dynamic interrupt coalescing code to include/linux · 4c4dbb4a

由 Andy Gospodarek 提交于 1月 09, 2018

This move allows drivers to add private structure elements to track the
number of packets, bytes, and interrupts events per ring.  A driver
also defines a workqueue handler to act on this collected data once per
poll and modify the coalescing parameters per ring.
Signed-off-by: NAndy Gospodarek <gospo@broadcom.com>
Acked-by: NTal Gilboa <talgi@mellanox.com>
Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4c4dbb4a

ipv6: Add support for non-equal-cost multipath · 398958ae

由 Ido Schimmel 提交于 1月 09, 2018

The use of hash-threshold instead of modulo-N makes it trivial to add
support for non-equal-cost multipath.

Instead of dividing the multipath hash function's output space equally
between the nexthops, each nexthop is assigned a region size which is
proportional to its weight.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Acked-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

398958ae

ipv6: Calculate hash thresholds for IPv6 nexthops · d7dedee1

由 Ido Schimmel 提交于 1月 09, 2018

Before we convert IPv6 to use hash-threshold instead of modulo-N, we
first need each nexthop to store its region boundary in the hash
function's output space.

The boundary is calculated by dividing the output space equally between
the different active nexthops. That is, nexthops that are not dead or
linkdown.

The boundaries are rebalanced whenever a nexthop is added or removed to
a multipath route and whenever a nexthop becomes active or inactive.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Acked-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d7dedee1

10 1月, 2018 5 次提交

bpf: export function to write into verifier log buffer · 430e68d1

由 Quentin Monnet 提交于 1月 10, 2018

Rename the BPF verifier `verbose()` to `bpf_verifier_log_write()` and
export it, so that other components (in particular, drivers for BPF
offload) can reuse the user buffer log to dump error messages at
verification time.

Renaming `verbose()` was necessary in order to avoid a name so generic
to be exported to the global namespace. However to prevent too much pain
for backports, the calls to `verbose()` in the kernel BPF verifier were
not changed. Instead, use function aliasing to make `verbose` point to
`bpf_verifier_log_write`. Another solution could consist in making a
wrapper around `verbose()`, but since it is a variadic function, I don't
see a clean way without creating two identical wrappers, one for the
verifier and one to export.
Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

430e68d1

bpf: avoid false sharing of map refcount with max_entries · be95a845

由 Daniel Borkmann 提交于 1月 09, 2018

In addition to commit b2157399 ("bpf: prevent out-of-bounds
speculation") also change the layout of struct bpf_map such that
false sharing of fast-path members like max_entries is avoided
when the maps reference counter is altered. Therefore enforce
them to be placed into separate cachelines.

pahole dump after change:

  struct bpf_map {
        const struct bpf_map_ops  * ops;                 /*     0     8 */
        struct bpf_map *           inner_map_meta;       /*     8     8 */
        void *                     security;             /*    16     8 */
        enum bpf_map_type          map_type;             /*    24     4 */
        u32                        key_size;             /*    28     4 */
        u32                        value_size;           /*    32     4 */
        u32                        max_entries;          /*    36     4 */
        u32                        map_flags;            /*    40     4 */
        u32                        pages;                /*    44     4 */
        u32                        id;                   /*    48     4 */
        int                        numa_node;            /*    52     4 */
        bool                       unpriv_array;         /*    56     1 */

        /* XXX 7 bytes hole, try to pack */

        /* --- cacheline 1 boundary (64 bytes) --- */
        struct user_struct *       user;                 /*    64     8 */
        atomic_t                   refcnt;               /*    72     4 */
        atomic_t                   usercnt;              /*    76     4 */
        struct work_struct         work;                 /*    80    32 */
        char                       name[16];             /*   112    16 */
        /* --- cacheline 2 boundary (128 bytes) --- */

        /* size: 128, cachelines: 2, members: 17 */
        /* sum members: 121, holes: 1, sum holes: 7 */
  };

Now all entries in the first cacheline are read only throughout
the life time of the map, set up once during map creation. Overall
struct size and number of cachelines doesn't change from the
reordering. struct bpf_map is usually first member and embedded
in map structs in specific map implementations, so also avoid those
members to sit at the end where it could potentially share the
cacheline with first map values e.g. in the array since remote
CPUs could trigger map updates just as well for those (easily
dirtying members like max_entries intentionally as well) while
having subsequent values in cache.

Quoting from Google's Project Zero blog [1]:

  Additionally, at least on the Intel machine on which this was
  tested, bouncing modified cache lines between cores is slow,
  apparently because the MESI protocol is used for cache coherence
  [8]. Changing the reference counter of an eBPF array on one
  physical CPU core causes the cache line containing the reference
  counter to be bounced over to that CPU core, making reads of the
  reference counter on all other CPU cores slow until the changed
  reference counter has been written back to memory. Because the
  length and the reference counter of an eBPF array are stored in
  the same cache line, this also means that changing the reference
  counter on one physical CPU core causes reads of the eBPF array's
  length to be slow on other physical CPU cores (intentional false
  sharing).

While this doesn't 'control' the out-of-bounds speculation through
masking the index as in commit b2157399, triggering a manipulation
of the map's reference counter is really trivial, so lets not allow
to easily affect max_entries from it.

Splitting to separate cachelines also generally makes sense from
a performance perspective anyway in that fast-path won't have a
cache miss if the map gets pinned, reused in other progs, etc out
of control path, thus also avoids unintentional false sharing.

  [1] https://googleprojectzero.blogspot.ch/2018/01/reading-privileged-memory-with-side.htmlSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

be95a845

tipc: improve groupcast scope handling · 232d07b7

由 Jon Maloy 提交于 1月 08, 2018

When a member joins a group, it also indicates a binding scope. This
makes it possible to create both node local groups, invisible to other
nodes, as well as cluster global groups, visible everywhere.

In order to avoid that different members end up having permanently
differing views of group size and memberhip, we must inhibit locally
and globally bound members from joining the same group.

We do this by using the binding scope as an additional separator between
groups. I.e., a member must ignore all membership events from sockets
using a different scope than itself, and all lookups for message
destinations must require an exact match between the message's lookup
scope and the potential target's binding scope.

Apart from making it possible to create local groups using the same
identity on different nodes, a side effect of this is that it now also
becomes possible to create a cluster global group with the same identity
across the same nodes, without interfering with the local groups.
Acked-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

232d07b7

virtio_net: propagate linkspeed/duplex settings from the hypervisor · faa9b39f

由 Jason Baron 提交于 1月 05, 2018

The ability to set speed and duplex for virtio_net is useful in various
scenarios as described here:

16032be5 virtio_net: add ethtool support for set and get of settings

However, it would be nice to be able to set this from the hypervisor,
such that virtio_net doesn't require custom guest ethtool commands.

Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
the hypervisor to export a linkspeed and duplex setting. The user can
subsequently overwrite it later if desired via: 'ethtool -s'.

Note that VIRTIO_NET_F_SPEED_DUPLEX is defined as bit 63, the intention
is that device feature bits are to grow down from bit 63, since the
transports are starting from bit 24 and growing up.
Signed-off-by: NJason Baron <jbaron@akamai.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: virtio-dev@lists.oasis-open.org
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

faa9b39f

macsec: Add support for GCM-AES-256 cipher suite · ccfdec90

由 Felix Walter 提交于 1月 05, 2018

This adds support for the GCM-AES-256 cipher suite as specified in
IEEE 802.1AEbn-2011. The prepared cipher suite selection mechanism is used,
with GCM-AES-128 being the default cipher suite as defined in the standard.
Signed-off-by: NFelix Walter <felix.walter@cloudandheat.com>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ccfdec90

09 1月, 2018 26 次提交

tuntap: XDP transmission · fc72d1d5

由 Jason Wang 提交于 1月 04, 2018

This patch implements XDP transmission for TAP. Since we can't create
new queues for TAP during XDP set, exist ptr_ring was reused for
queuing XDP buffers. To differ xdp_buff from sk_buff, TUN_XDP_FLAG
(0x1UL) was encoded into lowest bit of xpd_buff pointer during
ptr_ring_produce, and was decoded during consuming. XDP metadata was
stored in the headroom of the packet which should work in most of
cases since driver usually reserve enough headroom. Very minor changes
were done for vhost_net: it just need to peek the length depends on
the type of pointer.

Tests were done on two Intel E5-2630 2.40GHz machines connected back
to back through two 82599ES. Traffic were generated/received through
MoonGen/testpmd(rxonly). It reports ~20% improvements when
xdp_redirect_map is doing redirection from ixgbe to TAP (from 2.50Mpps
to 3.05Mpps)

Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fc72d1d5

tun/tap: use ptr_ring instead of skb_array · 5990a305

由 Jason Wang 提交于 1月 04, 2018

This patch switches to use ptr_ring instead of skb_array. This will be
used to enqueue different types of pointers by encoding type into
lower bits.
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5990a305

net/core: Add drop counters to VF statistics · c5a9f6f0

由 Eugenia Emantayev 提交于 7月 17, 2017

Modern hardware can decide to drop packets going to/from a VF.
Add receive and transmit drop counters to be displayed at hypervisor
layer in iproute2 per VF statistics.
Signed-off-by: NEugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

c5a9f6f0

net/mlx5: Hairpin pair core object setup · 18e568c3

由 Or Gerlitz 提交于 11月 12, 2017

Low level code to setup hairpin pair core object, deals with:
 - create hairpin RQs/SQs
 - destroy hairpin RQs/SQs
 - modifying hairpin RQs/SQs - pairing (rst2rdy) and unpairing (rdy2rst)

Unlike conventional RQs/SQs, the memory used for the packet and descriptor
buffers is allocated by the firmware and not the driver. The driver sets
the overall data size (log).
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

18e568c3

net/mlx5: Add hairpin definitions to the FW API · 40817cdb

由 Or Gerlitz 提交于 6月 25, 2017

Add hairpin definitions to the IFC file.

This includes the HCA ID, few HCA hairpin capabilities, new
fields in RQ/SQ used later for the pairing and the WQ hairpin
data size attribute.
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

40817cdb

net: No line break on netdev_WARN* formatting · e1cfe3d0

由 Gal Pressman 提交于 1月 07, 2018

Remove the unnecessary line break between the netdev name and reg state
to the actual message that should be printed.

For example, this:
[86730.307236] ------------[ cut here ]------------
[86730.313496] netdevice: enp27s0f0
Message from the driver
[...]

Will be replaced with:
[86770.259289] ------------[ cut here ]------------
[86770.265191] netdevice: enp27s0f0: Message from the driver
[...]
Signed-off-by: NGal Pressman <galp@mellanox.com>
Reviewed-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e1cfe3d0

net: Fix netdev_WARN_ONCE macro · 72dd831e

由 Gal Pressman 提交于 1月 07, 2018

netdev_WARN_ONCE is broken (whoops..), this fix will remove the
unnecessary "condition" parameter, add the missing comma and change
"arg" to "args".

Fixes: 375ef2b1 ("net: Introduce netdev_*_once functions")
Signed-off-by: NGal Pressman <galp@mellanox.com>
Reviewed-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

72dd831e

bpf: prevent out-of-bounds speculation · b2157399

由 Alexei Starovoitov 提交于 1月 07, 2018

Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
memory accesses under a bounds check may be speculated even if the
bounds check fails, providing a primitive for building a side channel.

To avoid leaking kernel data round up array-based maps and mask the index
after bounds check, so speculated load with out of bounds index will load
either valid value from the array or zero from the padded area.

Unconditionally mask index for all array types even when max_entries
are not rounded to power of 2 for root user.
When map is created by unpriv user generate a sequence of bpf insns
that includes AND operation to make sure that JITed code includes
the same 'index & index_mask' operation.

If prog_array map is created by unpriv user replace
  bpf_tail_call(ctx, map, index);
with
  if (index >= max_entries) {
    index &= map->index_mask;
    bpf_tail_call(ctx, map, index);
  }
(along with roundup to power 2) to prevent out-of-bounds speculation.
There is secondary redundant 'if (index >= max_entries)' in the interpreter
and in all JITs, but they can be optimized later if necessary.

Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
cannot be used by unpriv, so no changes there.

That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
all architectures with and without JIT.

v2->v3:
Daniel noticed that attack potentially can be crafted via syscall commands
without loading the program, so add masking to those paths as well.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

b2157399

net: tracepoint: exposing sk_faimily in tracepoint inet_sock_set_state · 0c3b34d8

由 Yafang Shao 提交于 1月 07, 2018

As of now, there're two sk_family are traced with sock:inet_sock_set_state,
which are AF_INET and AF_INET6.
So the sk_family are exposed as well.
Then we can conveniently use it to do the filter.

Both sk_family and sk_protocol are showed in the printk message, so we need
not expose them as tracepoint arguments.
Suggested-by: NBrendan Gregg <brendan.d.gregg@gmail.com>
Suggested-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
Reviewed-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c3b34d8

l2tp: adjust comments about L2TPv3 offsets · 23fe846f

由 Guillaume Nault 提交于 1月 05, 2018

The "offset" option has been removed by
commit 900631ee ("l2tp: remove configurable payload offset").
Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
Acked-by: NJames Chapman <jchapman@katalix.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

23fe846f

sctp: fix the handling of ICMP Frag Needed for too small MTUs · b6c5734d

由 Marcelo Ricardo Leitner 提交于 1月 05, 2018

syzbot reported a hang involving SCTP, on which it kept flooding dmesg
with the message:
[  246.742374] sctp: sctp_transport_update_pmtu: Reported pmtu 508 too
low, using default minimum of 512

That happened because whenever SCTP hits an ICMP Frag Needed, it tries
to adjust to the new MTU and triggers an immediate retransmission. But
it didn't consider the fact that MTUs smaller than the SCTP minimum MTU
allowed (512) would not cause the PMTU to change, and issued the
retransmission anyway (thus leading to another ICMP Frag Needed, and so
on).

As IPv4 (ip_rt_min_pmtu=556) and IPv6 (IPV6_MIN_MTU=1280) minimum MTU
are higher than that, sctp_transport_update_pmtu() is changed to
re-fetch the PMTU that got set after our request, and with that, detect
if there was an actual change or not.

The fix, thus, skips the immediate retransmission if the received ICMP
resulted in no change, in the hope that SCTP will select another path.

Note: The value being used for the minimum MTU (512,
SCTP_DEFAULT_MINSEGMENT) is not right and instead it should be (576,
SCTP_MIN_PMTU), but such change belongs to another patch.

Changes from v1:
- do not disable PMTU discovery, in the light of commit
06ad3919 ("[SCTP] Don't disable PMTU discovery when mtu is small")
and as suggested by Xin Long.
- changed the way to break the rtx loop by detecting if the icmp
  resulted in a change or not
Changes from v2:
none

See-also: https://lkml.org/lkml/2017/12/22/811Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b6c5734d

net: ipv6: Allow connect to linklocal address from socket bound to vrf · 54dc3e33

由 David Ahern 提交于 1月 04, 2018

Allow a process bound to a VRF to connect to a linklocal address.
Currently, this fails because of a mismatch between the scope of the
linklocal address and the sk_bound_dev_if inherited by the VRF binding:
    $ ssh -6 fe80::70b8:cff:fedd:ead8%eth1
    ssh: connect to host fe80::70b8:cff:fedd:ead8%eth1 port 22: Invalid argument

Relax the scope check to allow the socket to be bound to the same L3
device as the scope id.

This makes ipv6 linklocal consistent with other relaxed checks enabled
by commits 1ff23bee ("net: l3mdev: Allow send on enslaved interface")
and 7bb387c5 ("net: Allow IP_MULTICAST_IF to set index to L3 slave").
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

54dc3e33

sh_eth: remove sh_eth_plat_data::edmac_endian · e3e49ca9

由 Sergei Shtylyov 提交于 1月 05, 2018

Since the commit 888cc8c2 ("sh_eth: remove EDMAC_BIG_ENDIAN") (geez,
I didn't realize that was 2 years ago!) the initializers in the SuperH
platform code for the 'sh_eth_plat_data::edmac_endian' stopped to matter,
so we can remove that field for good (not sure if it was ever useful --
SH7786 Ether has been reported to have the same EDMAC descriptor/register
endiannes as configured for the SuperH CPU)...
Signed-off-by: NSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e3e49ca9

netfilter: ipset: Fix "don't update counters" mode when counters used at the matching · 4750005a

由 Jozsef Kadlecsik 提交于 1月 06, 2018

The matching of the counters was not taken into account, fixed.
Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

4750005a

netfilter: nf_tables: flow offload expression · a3c90f7a

由 Pablo Neira Ayuso 提交于 1月 07, 2018

Add new instruction for the nf_tables VM that allows us to specify what
flows are offloaded into a given flow table via name. This new
instruction creates the flow entry and adds it to the flow table.

Only established flows, ie. we have seen traffic in both directions, are
added to the flow table. You can still decide to offload entries at a
later stage via packet counting or checking the ct status in case you
want to offload assured conntracks.

This new extension depends on the conntrack subsystem.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

a3c90f7a

netfilter: flow table support for the mixed IPv4/IPv6 family · 7c23b629

由 Pablo Neira Ayuso 提交于 1月 07, 2018

This patch adds the IPv6 flow table type, that implements the datapath
flow table to forward IPv6 traffic.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

7c23b629

netfilter: flow table support for IPv6 · 09952107

由 Pablo Neira Ayuso 提交于 1月 07, 2018

This patch adds the IPv6 flow table type, that implements the datapath
flow table to forward IPv6 traffic.

This patch exports ip6_dst_mtu_forward() that is required to check for
mtu to pass up packets that need PMTUD handling to the classic
forwarding path.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

09952107

netfilter: add generic flow table infrastructure · ac2a6666

由 Pablo Neira Ayuso 提交于 1月 07, 2018

This patch defines the API to interact with flow tables, this allows to
add, delete and lookup for entries in the flow table. This also adds the
generic garbage code that removes entries that have expired, ie. no
traffic has been seen for a while.

Users of the flow table infrastructure can delete entries via
flow_offload_dead(), which sets the dying bit, this signals the garbage
collector to release an entry from user context.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

ac2a6666

netfilter: nf_tables: add flow table netlink frontend · 3b49e2e9

由 Pablo Neira Ayuso 提交于 1月 07, 2018

This patch introduces a netlink control plane to create, delete and dump
flow tables. Flow tables are identified by name, this name is used from
rules to refer to an specific flow table. Flow tables use the rhashtable
class and a generic garbage collector to remove expired entries.

This also adds the infrastructure to add different flow table types, so
we can add one for each layer 3 protocol family.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

3b49e2e9

netfilter: nf_conntrack: add IPS_OFFLOAD status bit · 90964016

由 Pablo Neira Ayuso 提交于 1月 07, 2018

This new bit tells us that the conntrack entry is owned by the flow
table offload infrastructure.

 # cat /proc/net/nf_conntrack
 ipv4     2 tcp      6 src=10.141.10.2 dst=147.75.205.195 sport=36392 dport=443 src=147.75.205.195 dst=192.168.2.195 sport=443 dport=36392 [OFFLOAD] mark=0 zone=0 use=2

Note the [OFFLOAD] tag in the listing.

The timer of such conntrack entries look like stopped from userspace.
In practise, to make sure the conntrack entry does not go away, the
conntrack timer is periodically set to an arbitrary large value that
gets refreshed on every iteration from the garbage collector, so it
never expires- and they display no internal state in the case of TCP
flows. This allows us to save a bitcheck from the packet path via
nf_ct_is_expired().

Conntrack entries that have been offloaded to the flow table
infrastructure cannot be deleted/flushed via ctnetlink. The flow table
infrastructure is also responsible for releasing this conntrack entry.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

90964016

netfilter: nf_tables: remove nft_dereference() · 0befd061

由 Pablo Neira Ayuso 提交于 1月 02, 2018

This macro is unnecessary, it just hides details for one single caller.
nfnl_dereference() is just enough.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

0befd061

netfilter: meta: secpath support · f6931f5f

由 Florian Westphal 提交于 12月 06, 2017

replacement for iptables "-m policy --dir in --policy {ipsec,none}".
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

f6931f5f

netfilter: remove struct nf_afinfo and its helper functions · b3a61254

由 Pablo Neira Ayuso 提交于 12月 09, 2017

This abstraction has no clients anymore, remove it.

This is what remains from previous authors, so correct copyright
statement after recent modifications and code removal.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

b3a61254

netfilter: remove route_key_size field in struct nf_afinfo · 46435623

由 Pablo Neira Ayuso 提交于 11月 27, 2017

This is only needed by nf_queue, place this code where it belongs.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

46435623

netfilter: move reroute indirection to struct nf_ipv6_ops · ce388f45

由 Pablo Neira Ayuso 提交于 11月 27, 2017

We cannot make a direct call to nf_ip6_reroute() because that would result
in autoloading the 'ipv6' module because of symbol dependencies.
Therefore, define reroute indirection in nf_ipv6_ops where this really
belongs to.

For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

ce388f45

netfilter: move route indirection to struct nf_ipv6_ops · 3f87c08c

由 Pablo Neira Ayuso 提交于 11月 27, 2017

We cannot make a direct call to nf_ip6_route() because that would result
in autoloading the 'ipv6' module because of symbol dependencies.
Therefore, define route indirection in nf_ipv6_ops where this really
belongs to.

For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

3f87c08c

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功