提交 · c7d39e32632e5db9dc4da51198b76d8c315946ff · openeuler / Kernel

13 10月, 2015 1 次提交

sock: support per-packet fwmark · f28ea365

由 Edward Jee 提交于 10月 08, 2015

It's useful to allow users to set fwmark for an individual packet,
without changing the socket state. The function this patch adds in
sock layer can be used by the protocols that need such a feature.
Signed-off-by: NEdward Hyunkoo Jee <edjee@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f28ea365

04 10月, 2015 1 次提交

tcp/dccp: add SLAB_DESTROY_BY_RCU flag for request sockets · e96f78ab

由 Eric Dumazet 提交于 10月 03, 2015

Before letting request sockets being put in TCP/DCCP regular
ehash table, we need to add either :

- SLAB_DESTROY_BY_RCU flag to their kmem_cache
- add RCU grace period before freeing them.

Since we carefully respected the SLAB_DESTROY_BY_RCU protocol
like ESTABLISH and TIMEWAIT sockets, use it here.

req_prot_init() being only used by TCP and DCCP, I did not add
a new slab_flags into their rsk_prot, but reuse prot->slab_flags

Since all reqsk_alloc() users are correctly dealing with a failure,
add the __GFP_NOWARN flag to avoid traces under pressure.

Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e96f78ab

16 9月, 2015 1 次提交

net: core: drop null test before destroy functions · adf78eda

由 Julia Lawall 提交于 9月 13, 2015

Remove unneeded NULL test.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@ expression x; @@
-if (x != NULL) {
  \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
  x = NULL;
-}
// </smpl>
Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

adf78eda

28 8月, 2015 1 次提交

sock: fix kernel doc error · 69dba9bb

由 Jean Sacren 提交于 8月 27, 2015

The symbol '__sk_reclaim' is not present in the current tree. Apparently
'__sk_reclaim' was meant to be '__sk_mem_reclaim', so fix it with the
right symbol name for the kernel doc.
Signed-off-by: NJean Sacren <sakiwit@gmail.com>
Cc: Hideo Aoki <haoki@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

69dba9bb

31 7月, 2015 1 次提交

net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket · 8a681736

由 Sowmini Varadhan 提交于 7月 30, 2015

The newsk returned by sk_clone_lock should hold a get_net()
reference if, and only if, the parent is not a kernel socket
(making this similar to sk_alloc()).

E.g,. for the SYN_RECV path, tcp_v4_syn_recv_sock->..inet_csk_clone_lock
sets up the syn_recv newsk from sk_clone_lock. When the parent (listen)
socket is a kernel socket (defined in sk_alloc() as having
sk_net_refcnt == 0), then the newsk should also have a 0 sk_net_refcnt
and should not hold a get_net() reference.

Fixes: 26abe143 ("net: Modify sk_alloc to not reference count the
      netns of kernel sockets.")
Acked-by: NEric Dumazet <edumazet@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8a681736

27 7月, 2015 1 次提交

tcp: fix recv with flags MSG_WAITALL | MSG_PEEK · dfbafc99

由 Sabrina Dubroca 提交于 7月 24, 2015

Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called
with flags = MSG_WAITALL | MSG_PEEK.

sk_wait_data waits for sk_receive_queue not empty, but in this case,
the receive queue is not empty, but does not contain any skb that we
can use.

Add a "last skb seen on receive queue" argument to sk_wait_data, so
that it sleeps until the receive queue has new skbs.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461
Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258Reported-by: NEnrico Scholz <rh-bugzilla@ensc.de>
Reported-by: NDan Searle <dan@censornet.com>
Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dfbafc99

01 7月, 2015 1 次提交

sock_diag: don't broadcast kernel sockets · b922622e

由 Craig Gallek 提交于 6月 30, 2015

Kernel sockets do not hold a reference for the network namespace to
which they point.  Socket destruction broadcasting relies on the
network namespace and will cause the splat below when a kernel socket
is destroyed.

This fix simply ignores kernel sockets when they are destroyed.

Reported as:
general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 1 PID: 9130 Comm: kworker/1:1 Not tainted 4.1.0-gelk-debug+ #1
Workqueue: sock_diag_events sock_diag_broadcast_destroy_work
Stack:
 ffff8800b9c586c0 ffff8800b9c586c0 ffff8800ac4692c0 ffff8800936d4a90
 ffff8800352efd38 ffffffff8469a93e ffff8800352efd98 ffffffffc09b9b90
 ffff8800352efd78 ffff8800ac4692c0 ffff8800b9c586c0 ffff8800831b6ab8
Call Trace:
 [<ffffffff8469a93e>] ? mutex_unlock+0xe/0x10
 [<ffffffffc09b9b90>] ? inet_diag_handler_get_info+0x110/0x1fb [inet_diag]
 [<ffffffff845c868d>] netlink_broadcast+0x1d/0x20
 [<ffffffff8469a93e>] ? mutex_unlock+0xe/0x10
 [<ffffffff845b2bf5>] sock_diag_broadcast_destroy_work+0xd5/0x160
 [<ffffffff8408ea97>] process_one_work+0x147/0x420
 [<ffffffff8408f0f9>] worker_thread+0x69/0x470
 [<ffffffff8409fda3>] ? preempt_count_sub+0xa3/0xf0
 [<ffffffff8408f090>] ? rescuer_thread+0x320/0x320
 [<ffffffff84093cd7>] kthread+0x107/0x120
 [<ffffffff84093bd0>] ? kthread_create_on_node+0x1b0/0x1b0
 [<ffffffff8469d31f>] ret_from_fork+0x3f/0x70
 [<ffffffff84093bd0>] ? kthread_create_on_node+0x1b0/0x1b0

Tested:
  Using a debug kernel while 'ss -E' is running:
  ip netns add test-ns
  ip netns delete test-ns

Fixes: eb4cb008 sock_diag: define destruction multicast groups
Fixes: 26abe143 net: Modify sk_alloc to not reference count the
  netns of kernel sockets.
Reported-by: NDave Jones <davej@codemonkey.org.uk>
Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NCraig Gallek <kraig@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b922622e

29 6月, 2015 1 次提交

net: Kill sock->sk_protinfo · 1830fcea

由 David Miller 提交于 6月 25, 2015

No more users, so it can now be removed.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1830fcea

16 6月, 2015 1 次提交

sock_diag: define destruction multicast groups · eb4cb008

由 Craig Gallek 提交于 6月 15, 2015

These groups will contain socket-destruction events for
AF_INET/AF_INET6, IPPROTO_TCP/IPPROTO_UDP.

Near the end of socket destruction, a check for listeners is
performed.  In the presence of a listener, rather than completely
cleanup the socket, a unit of work will be added to a private
work queue which will first broadcast information about the socket
and then finish the cleanup operation.
Signed-off-by: NCraig Gallek <kraig@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eb4cb008

12 6月, 2015 1 次提交

net: don't wait for order-3 page allocation · fb05e7a8

由 Shaohua Li 提交于 6月 11, 2015

We saw excessive direct memory compaction triggered by skb_page_frag_refill.
This causes performance issues and add latency. Commit 5640f768
introduces the order-3 allocation. According to the changelog, the order-3
allocation isn't a must-have but to improve performance. But direct memory
compaction has high overhead. The benefit of order-3 allocation can't
compensate the overhead of direct memory compaction.

This patch makes the order-3 page allocation atomic. If there is no memory
pressure and memory isn't fragmented, the alloction will still success, so we
don't sacrifice the order-3 benefit here. If the atomic allocation fails,
direct memory compaction will not be triggered, skb_page_frag_refill will
fallback to order-0 immediately, hence the direct memory compaction overhead is
avoided. In the allocation failure case, kswapd is waken up and doing
compaction, so chances are allocation could success next time.

alloc_skb_with_frags is the same.

The mellanox driver does similar thing, if this is accepted, we must fix
the driver too.

V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric
V2: make the changelog clearer

Cc: Eric Dumazet <edumazet@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Debabrata Banerjee <dbavatar@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fb05e7a8

11 6月, 2015 1 次提交

net, swap: Remove a warning and clarify why sk_mem_reclaim is required when deactivating swap · 5d753610

由 Mel Gorman 提交于 6月 10, 2015

Jeff Layton reported the following;

 [   74.232485] ------------[ cut here ]------------
 [   74.233354] WARNING: CPU: 2 PID: 754 at net/core/sock.c:364 sk_clear_memalloc+0x51/0x80()
 [   74.234790] Modules linked in: cts rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xfs libcrc32c snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device nfsd snd_pcm snd_timer snd e1000 ppdev parport_pc joydev parport pvpanic soundcore floppy serio_raw i2c_piix4 pcspkr nfs_acl lockd virtio_balloon acpi_cpufreq auth_rpcgss grace sunrpc qxl drm_kms_helper ttm drm virtio_console virtio_blk virtio_pci ata_generic virtio_ring pata_acpi virtio
 [   74.243599] CPU: 2 PID: 754 Comm: swapoff Not tainted 4.1.0-rc6+ #5
 [   74.244635] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 [   74.245546]  0000000000000000 0000000079e69e31 ffff8800d066bde8 ffffffff8179263d
 [   74.246786]  0000000000000000 0000000000000000 ffff8800d066be28 ffffffff8109e6fa
 [   74.248175]  0000000000000000 ffff880118d48000 ffff8800d58f5c08 ffff880036e380a8
 [   74.249483] Call Trace:
 [   74.249872]  [<ffffffff8179263d>] dump_stack+0x45/0x57
 [   74.250703]  [<ffffffff8109e6fa>] warn_slowpath_common+0x8a/0xc0
 [   74.251655]  [<ffffffff8109e82a>] warn_slowpath_null+0x1a/0x20
 [   74.252585]  [<ffffffff81661241>] sk_clear_memalloc+0x51/0x80
 [   74.253519]  [<ffffffffa0116c72>] xs_disable_swap+0x42/0x80 [sunrpc]
 [   74.254537]  [<ffffffffa01109de>] rpc_clnt_swap_deactivate+0x7e/0xc0 [sunrpc]
 [   74.255610]  [<ffffffffa03e4fd7>] nfs_swap_deactivate+0x27/0x30 [nfs]
 [   74.256582]  [<ffffffff811e99d4>] destroy_swap_extents+0x74/0x80
 [   74.257496]  [<ffffffff811ecb52>] SyS_swapoff+0x222/0x5c0
 [   74.258318]  [<ffffffff81023f27>] ? syscall_trace_leave+0xc7/0x140
 [   74.259253]  [<ffffffff81798dae>] system_call_fastpath+0x12/0x71
 [   74.260158] ---[ end trace 2530722966429f10 ]---

The warning in question was unnecessary but with Jeff's series the rules
are also clearer.  This patch removes the warning and updates the comment
to explain why sk_mem_reclaim() may still be called.

[jlayton: remove if (sk->sk_forward_alloc) conditional. As Leon
          points out that it's not needed.]

Cc: Leon Romanovsky <leon@leon.nu>
Signed-off-by: NMel Gorman <mgorman@suse.de>
Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5d753610

27 5月, 2015 1 次提交

tcp: tcp_tso_autosize() minimum is one packet · d6a4e26a

由 Eric Dumazet 提交于 5月 26, 2015

By making sure sk->sk_gso_max_segs minimal value is one,
and sysctl_tcp_min_tso_segs minimal value is one as well,
tcp_tso_autosize() will return a non zero value.

We can then revert 843925f3
("tcp: Do not apply TSO segment limit to non-TSO packets")
and save few cpu cycles in fast path.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d6a4e26a

18 5月, 2015 1 次提交

net: fix sk_mem_reclaim_partial() · 1a24e04e

由 Eric Dumazet 提交于 5月 15, 2015

sk_mem_reclaim_partial() goal is to ensure each socket has
one SK_MEM_QUANTUM forward allocation. This is needed both for
performance and better handling of memory pressure situations in
follow up patches.

SK_MEM_QUANTUM is currently a page, but might be reduced to 4096 bytes
as some arches have 64KB pages.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1a24e04e

11 5月, 2015 3 次提交

net: kill sk_change_net and sk_release_kernel · affb9792

由 Eric W. Biederman 提交于 5月 08, 2015

These functions are no longer needed and no longer used kill them.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

affb9792

net: Modify sk_alloc to not reference count the netns of kernel sockets. · 26abe143

由 Eric W. Biederman 提交于 5月 08, 2015

Now that sk_alloc knows when a kernel socket is being allocated modify
it to not reference count the network namespace of kernel sockets.

Keep track of if a socket needs reference counting by adding a flag to
struct sock called sk_net_refcnt.

Update all of the callers of sock_create_kern to stop using
sk_change_net and sk_release_kernel as those hacks are no longer
needed, to avoid reference counting a kernel socket.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

26abe143

net: Pass kern from net_proto_family.create to sk_alloc · 11aa9c28

由 Eric W. Biederman 提交于 5月 08, 2015

In preparation for changing how struct net is refcounted
on kernel sockets pass the knowledge that we are creating
a kernel socket from sock_create_kern through to sk_alloc.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

11aa9c28

04 5月, 2015 1 次提交

Revert "net: kernel socket should be released in init_net namespace" · 2e70aedd

由 Herbert Xu 提交于 5月 03, 2015

This reverts commit c243d7e2.

That patch is solving a non-existant problem while creating a
real problem.  Just because a socket is allocated in the init
name space doesn't mean that it gets hashed in the init name space.

When we unhash it the name space must be the same as the one
we had when we hashed it.  So this patch is completely bogus
and causes socket leaks.
Reported-by: NAndrey Wagin <avagin@gmail.com>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2e70aedd

13 4月, 2015 1 次提交

tcp: do not cache align timewait sockets · 52db70dc

由 Eric Dumazet 提交于 4月 10, 2015

With recent adoption of skc_cookie in struct sock_common,
struct tcp_timewait_sock size increased from 192 to 200 bytes
on 64bit arches. SLAB rounds then to 256 bytes.

It is time to drop SLAB_HWCACHE_ALIGN constraint for twsk_slab.

This saves about 12 MB of memory on typical configuration reaching
262144 timewait sockets, and has no noticeable impact on performance.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

52db70dc

07 4月, 2015 1 次提交

ipv6: protect skb->sk accesses from recursive dereference inside the stack · f60e5990

由 hannes@stressinduktion.org 提交于 4月 01, 2015

We should not consult skb->sk for output decisions in xmit recursion
levels > 0 in the stack. Otherwise local socket settings could influence
the result of e.g. tunnel encapsulation process.

ipv6 does not conform with this in three places:

1) ip6_fragment: we do consult ipv6_npinfo for frag_size

2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
   loop the packet back to the local socket

3) ip6_skb_dst_mtu could query the settings from the user socket and
   force a wrong MTU

Furthermore:
In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
PF_PACKET socket ontop of an IPv6-backed vxlan device.

Reuse xmit_recursion as we are currently only interested in protecting
tunnel devices.

Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f60e5990

24 3月, 2015 1 次提交

net: Move the comment about unsettable socket-level options to default clause... · 443b5991

由 YOSHIFUJI Hideaki/吉藤英明提交于 3月 23, 2015

net: Move the comment about unsettable socket-level options to default clause and update its reference.

We implement the SO_SNDLOWAT etc not to be settable and return
ENOPROTOOPT per 1003.1g 7. Move the comment to appropriate
position and update the reference.
Signed-off-by: NYOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

443b5991

21 3月, 2015 1 次提交

inet: get rid of central tcp/dccp listener timer · fa76ce73

由 Eric Dumazet 提交于 3月 19, 2015

One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.

This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.

SYNACK are sent in huge bursts, likely to cause severe drops anyway.

This model was OK 15 years ago when memory was very tight.

We now can afford to have a timer per request sock.

Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.

With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.

This is ~100 times more what we could achieve before this patch.

Later, we will get rid of the listener hash and use ehash instead.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fa76ce73

17 3月, 2015 2 次提交

net: kernel socket should be released in init_net namespace · c243d7e2

由 Ying Xue 提交于 3月 16, 2015

Creating a kernel socket with sock_create_kern() happens in "init_net"
namespace, however, releasing it with sk_release_kernel() occurs in
the current namespace which may be different with "init_net" namespace.
Therefore, we should guarantee that the namespace in which a kernel
socket is created is same as the socket is created.
Signed-off-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c243d7e2

inet: factorize sock_edemux()/sock_gen_put() code · 2c13270b

由 Eric Dumazet 提交于 3月 15, 2015

sock_edemux() is not used in fast path, and should
really call sock_gen_put() to save some code.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2c13270b

13 3月, 2015 3 次提交

inet: prepare sock_edemux() & sock_gen_put() for new SYN_RECV state · 41b822c5

由 Eric Dumazet 提交于 3月 12, 2015

sock_edemux() & sock_gen_put() should be ready to cope with request socks.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

41b822c5

net: add req_prot_cleanup() & req_prot_init() helpers · 0159dfd3

由 Eric Dumazet 提交于 3月 12, 2015

Make proto_register() & proto_unregister() a bit nicer.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0159dfd3

net: Kill hold_net release_net · efd7ef1c

由 Eric W. Biederman 提交于 3月 11, 2015

hold_net and release_net were an idea that turned out to be useless.
The code has been disabled since 2008.  Kill the code it is long past due.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

efd7ef1c

12 3月, 2015 1 次提交

net: add real socket cookies · 33cf7c90

由 Eric Dumazet 提交于 3月 11, 2015

A long standing problem in netlink socket dumps is the use
of kernel socket addresses as cookies.

1) It is a security concern.

2) Sockets can be reused quite quickly, so there is
   no guarantee a cookie is used once and identify
   a flow.

3) request sock, establish sock, and timewait socks
   for a given flow have different cookies.

Part of our effort to bring better TCP statistics requires
to switch to a different allocator.

In this patch, I chose to use a per network namespace 64bit generator,
and to use it only in the case a socket needs to be dumped to netlink.
(This might be refined later if needed)

Note that I tried to carry cookies from request sock, to establish sock,
then timewait sockets.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Eric Salo <salo@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

33cf7c90

11 3月, 2015 1 次提交

net: add comment for sock_efree() usage · 7768eed8

由 Oliver Hartkopp 提交于 3月 10, 2015

Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
Acked-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7768eed8

03 3月, 2015 1 次提交

net: Remove iocb argument from sendmsg and recvmsg · 1b784140

由 Ying Xue 提交于 3月 02, 2015

After TIPC doesn't depend on iocb argument in its internal
implementations of sendmsg() and recvmsg() hooks defined in proto
structure, no any user is using iocb argument in them at all now.
Then we can drop the redundant iocb argument completely from kinds of
implementations of both sendmsg() and recvmsg() in the entire
networking stack.

Cc: Christoph Hellwig <hch@lst.de>
Suggested-by: NAl Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1b784140

02 3月, 2015 1 次提交

net: add common accessor for setting dropcount on packets · 3bc3b96f

由 Eyal Birger 提交于 3月 01, 2015

As part of an effort to move skb->dropcount to skb->cb[], use
a common function in order to set dropcount in struct sk_buff.
Signed-off-by: NEyal Birger <eyal.birger@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3bc3b96f

03 2月, 2015 1 次提交

net-timestamp: no-payload only sysctl · b245be1f

由 Willem de Bruijn 提交于 1月 30, 2015

Tx timestamps are looped onto the error queue on top of an skb. This
mechanism leaks packet headers to processes unless the no-payload
options SOF_TIMESTAMPING_OPT_TSONLY is set.

Add a sysctl that optionally drops looped timestamp with data. This
only affects processes without CAP_NET_RAW.

The policy is checked when timestamps are generated in the stack.
It is possible for timestamps with data to be reported after the
sysctl is set, if these were queued internally earlier.

No vulnerability is immediately known that exploits knowledge
gleaned from packet headers, but it may still be preferable to allow
administrators to lock down this path at the cost of possible
breakage of legacy applications.
Signed-off-by: NWillem de Bruijn <willemb@google.com>

----

Changes
  (v1 -> v2)
  - test socket CAP_NET_RAW instead of capable(CAP_NET_RAW)
  (rfc -> v1)
  - document the sysctl in Documentation/sysctl/net.txt
  - fix access control race: read .._OPT_TSONLY only once,
        use same value for permission check and skb generation.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b245be1f

06 12月, 2014 1 次提交

net: sock: allow eBPF programs to be attached to sockets · 89aa0758

由 Alexei Starovoitov 提交于 12月 01, 2014

introduce new setsockopt() command:

setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd))

where prog_fd was received from syscall bpf(BPF_PROG_LOAD, attr, ...)
and attr->prog_type == BPF_PROG_TYPE_SOCKET_FILTER

setsockopt() calls bpf_prog_get() which increments refcnt of the program,
so it doesn't get unloaded while socket is using the program.

The same eBPF program can be attached to multiple sockets.

User task exit automatically closes socket which calls sk_filter_uncharge()
which decrements refcnt of eBPF program
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

89aa0758

25 11月, 2014 1 次提交

crypto: algif - add and use sock_kzfree_s() instead of memzero_explicit() · 79e88659

由 Daniel Borkmann 提交于 11月 19, 2014

Commit e1bd95bf ("crypto: algif - zeroize IV buffer") and
2a6af25b ("crypto: algif - zeroize message digest buffer")
added memzero_explicit() calls on buffers that are later on
passed back to sock_kfree_s().

This is a discussed follow-up that, instead, extends the sock
API and adds sock_kzfree_s(), which internally uses kzfree()
instead of kfree() for passing the buffers back to slab.

Having sock_kzfree_s() allows to keep the changes more minimal
by just having a drop-in replacement instead of adding
memzero_explicit() calls everywhere before sock_kfree_s().

In kzfree(), the compiler is not allowed to optimize the memset()
away and thus there's no need for memzero_explicit(). Both,
sock_kfree_s() and sock_kzfree_s() are wrappers for
__sock_kfree_s() and call into kfree() resp. kzfree(); here,
__sock_kfree_s() needs to be explicitly inlined as we want the
compiler to optimize the call and condition away and thus it
produces e.g. on x86_64 the _same_ assembler output for
sock_kfree_s() before and after, and thus also allows for
avoiding code duplication.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>

79e88659

12 11月, 2014 1 次提交

net: introduce SO_INCOMING_CPU · 2c8c56e1

由 Eric Dumazet 提交于 11月 11, 2014

Alternative to RPS/RFS is to use hardware support for multiple
queues.

Then split a set of million of sockets into worker threads, each
one using epoll() to manage events on its own socket pool.

Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
know after accept() or connect() on which queue/cpu a socket is managed.

We normally use one cpu per RX queue (IRQ smp_affinity being properly
set), so remembering on socket structure which cpu delivered last packet
is enough to solve the problem.

After accept(), connect(), or even file descriptor passing around
processes, applications can use :

 int cpu;
 socklen_t len = sizeof(cpu);

 getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);

And use this information to put the socket into the right silo
for optimal performance, as all networking stack should run
on the appropriate cpu, without need to send IPI (RPS/RFS).
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2c8c56e1

06 11月, 2014 1 次提交

net: Add and use skb_copy_datagram_msg() helper. · 51f3d02b

由 David S. Miller 提交于 11月 05, 2014

This encapsulates all of the skb_copy_datagram_iovec() callers
with call argument signature "skb, offset, msghdr->msg_iov, length".

When we move to iov_iters in the networking, the iov_iter object will
sit in the msghdr.

Having a helper like this means there will be less places to touch
during that transformation.

Based upon descriptions and patch from Al Viro.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

51f3d02b

15 10月, 2014 1 次提交

net: Trap attempts to call sock_kfree_s() with a NULL pointer. · e53da5fb

由 David S. Miller 提交于 10月 14, 2014

Unlike normal kfree() it is never right to call sock_kfree_s() with
a NULL pointer, because sock_kfree_s() also has the side effect of
discharging the memory from the sockets quota.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e53da5fb

28 9月, 2014 1 次提交

net_dma: simple removal · 7bced397

由 Dan Williams 提交于 12月 30, 2013

Per commit "77873803 net_dma: mark broken" net_dma is no longer used
and there is no plan to fix it.

This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards.
Reverting the remainder of the net_dma induced changes is deferred to
subsequent patches.

Marked for stable due to Roman's report of a memory leak in
dma_pin_iovec_pages():

    https://lkml.org/lkml/2014/9/3/177

Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: David Whipple <whipple@securedatainnovations.ch>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: <stable@vger.kernel.org>
Reported-by: NRoman Gushchin <klamm@yandex-team.ru>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

7bced397

20 9月, 2014 1 次提交

net: add alloc_skb_with_frags() helper · 2e4e4410

由 Eric Dumazet 提交于 9月 17, 2014

Extract from sock_alloc_send_pskb() code building skb with frags,
so that we can reuse this in other contexts.

Intent is to use it from tcp_send_rcvq(), tcp_collapse(), ...

We also want to replace some skb_linearize() calls to a more reliable
strategy in pathological cases where we need to reduce number of frags.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2e4e4410

09 9月, 2014 1 次提交

net: fix skb_page_frag_refill() kerneldoc · 82d5e2b8

由 Eric Dumazet 提交于 9月 08, 2014

In commit d9b2938a ("net: attempt a single high order allocation)
I forgot to update kerneldoc, as @prio parameter was renamed to @gfp
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

82d5e2b8

06 9月, 2014 1 次提交

net: merge cases where sock_efree and sock_edemux are the same function · 82eabd9e

由 Alexander Duyck 提交于 9月 04, 2014

Since sock_efree and sock_demux are essentially the same code for non-TCP
sockets and the case where CONFIG_INET is not defined we can combine the
code or replace the call to sock_edemux in several spots. As a result we
can avoid a bit of unnecessary code or code duplication.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

82eabd9e

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功