1. 07 12月, 2016 1 次提交
    • E
      net: sock_rps_record_flow() is for connected sockets · 5b8e2f61
      Eric Dumazet 提交于
      Paolo noticed a cache line miss in UDP recvmsg() to access
      sk_rxhash, sharing a cache line with sk_drops.
      
      sk_drops might be heavily incremented by cpus handling a flood targeting
      this socket.
      
      We might place sk_drops on a separate cache line, but lets try
      to avoid wasting 64 bytes per socket just for this, since we have
      other bottlenecks to take care of.
      
      sock_rps_record_flow() should only access sk_rxhash for connected
      flows.
      
      Testing sk_state for TCP_ESTABLISHED covers most of the cases for
      connected sockets, for a zero cost, since system calls using
      sock_rps_record_flow() also access sk->sk_prot which is on the
      same cache line.
      
      A follow up patch will provide a static_key (Jump Label) since most
      hosts do not even use RFS.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b8e2f61
  2. 06 12月, 2016 1 次提交
    • E
      net: reorganize struct sock for better data locality · 9115e8cd
      Eric Dumazet 提交于
      Group fields used in TX path, and keep some cache lines mostly read
      to permit sharing among cpus.
      
      Gained two 4 bytes holes on 64bit arches.
      
      Added a place holder for tcp tsq_flags, next to sk_wmem_alloc
      to speed up tcp_wfree() in the following patch.
      
      I have not added ____cacheline_aligned_in_smp, this might be done later.
      I prefer doing this once inet and tcp/udp sockets reorg is also done.
      
      Tested with both TCP and UDP.
      
      UDP receiver performance under flood increased by ~20 % :
      Accessing sk_filter/sk_wq/sk_napi_id no longer stalls because sk_drops
      was moved away from a critical cache line, now mostly read and shared.
      
      	/* --- cacheline 4 boundary (256 bytes) --- */
      	unsigned int               sk_napi_id;           /* 0x100   0x4 */
      	int                        sk_rcvbuf;            /* 0x104   0x4 */
      	struct sk_filter *         sk_filter;            /* 0x108   0x8 */
      	union {
      		struct socket_wq * sk_wq;                /*         0x8 */
      		struct socket_wq * sk_wq_raw;            /*         0x8 */
      	};                                               /* 0x110   0x8 */
      	struct xfrm_policy *       sk_policy[2];         /* 0x118  0x10 */
      	struct dst_entry *         sk_rx_dst;            /* 0x128   0x8 */
      	struct dst_entry *         sk_dst_cache;         /* 0x130   0x8 */
      	atomic_t                   sk_omem_alloc;        /* 0x138   0x4 */
      	int                        sk_sndbuf;            /* 0x13c   0x4 */
      	/* --- cacheline 5 boundary (320 bytes) --- */
      	int                        sk_wmem_queued;       /* 0x140   0x4 */
      	atomic_t                   sk_wmem_alloc;        /* 0x144   0x4 */
      	long unsigned int          sk_tsq_flags;         /* 0x148   0x8 */
      	struct sk_buff *           sk_send_head;         /* 0x150   0x8 */
      	struct sk_buff_head        sk_write_queue;       /* 0x158  0x18 */
      	__s32                      sk_peek_off;          /* 0x170   0x4 */
      	int                        sk_write_pending;     /* 0x174   0x4 */
      	long int                   sk_sndtimeo;          /* 0x178   0x8 */
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Tested-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9115e8cd
  3. 03 12月, 2016 1 次提交
  4. 15 11月, 2016 1 次提交
  5. 05 11月, 2016 1 次提交
    • L
      net: core: Add a UID field to struct sock. · 86741ec2
      Lorenzo Colitti 提交于
      Protocol sockets (struct sock) don't have UIDs, but most of the
      time, they map 1:1 to userspace sockets (struct socket) which do.
      
      Various operations such as the iptables xt_owner match need
      access to the "UID of a socket", and do so by following the
      backpointer to the struct socket. This involves taking
      sk_callback_lock and doesn't work when there is no socket
      because userspace has already called close().
      
      Simplify this by adding a sk_uid field to struct sock whose value
      matches the UID of the corresponding struct socket. The semantics
      are as follows:
      
      1. Whenever sk_socket is non-null: sk_uid is the same as the UID
         in sk_socket, i.e., matches the return value of sock_i_uid.
         Specifically, the UID is set when userspace calls socket(),
         fchown(), or accept().
      2. When sk_socket is NULL, sk_uid is defined as follows:
         - For a socket that no longer has a sk_socket because
           userspace has called close(): the previous UID.
         - For a cloned socket (e.g., an incoming connection that is
           established but on which userspace has not yet called
           accept): the UID of the socket it was cloned from.
         - For a socket that has never had an sk_socket: UID 0 inside
           the user namespace corresponding to the network namespace
           the socket belongs to.
      
      Kernel sockets created by sock_create_kern are a special case
      of #1 and sk_uid is the user that created them. For kernel
      sockets created at network namespace creation time, such as the
      per-processor ICMP and TCP sockets, this is the user that created
      the network namespace.
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86741ec2
  6. 04 11月, 2016 1 次提交
    • E
      dccp: do not release listeners too soon · c3f24cfb
      Eric Dumazet 提交于
      Andrey Konovalov reported following error while fuzzing with syzkaller :
      
      IPv4: Attempt to release alive inet socket ffff880068e98940
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      task: ffff88006b9e0000 task.stack: ffff880068770000
      RIP: 0010:[<ffffffff819ead5f>]  [<ffffffff819ead5f>]
      selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639
      RSP: 0018:ffff8800687771c8  EFLAGS: 00010202
      RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a
      RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010
      RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002
      R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940
      R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000
      FS:  00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0
      Stack:
       ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007
       ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480
       ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0
      Call Trace:
       [<ffffffff819c8dd5>] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317
       [<ffffffff82c2a9e7>] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81
       [<ffffffff82b81e60>] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460
       [<ffffffff838bbf12>] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873
       [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0
      net/ipv4/ip_input.c:216
       [<     inline     >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
       [<     inline     >] NF_HOOK ./include/linux/netfilter.h:255
       [<ffffffff8306abd2>] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
       [<     inline     >] dst_input ./include/net/dst.h:507
       [<ffffffff83068500>] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
       [<     inline     >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
       [<     inline     >] NF_HOOK ./include/linux/netfilter.h:255
       [<ffffffff8306b82f>] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
       [<ffffffff82bd9fb7>] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
       [<ffffffff82bdb19a>] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
       [<ffffffff82bdb493>] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
       [<ffffffff82bdb6b8>] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
       [<ffffffff8241fc75>] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
       [<ffffffff82421b5a>] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
       [<     inline     >] new_sync_write fs/read_write.c:499
       [<ffffffff8151bd44>] __vfs_write+0x334/0x570 fs/read_write.c:512
       [<ffffffff8151f85b>] vfs_write+0x17b/0x500 fs/read_write.c:560
       [<     inline     >] SYSC_write fs/read_write.c:607
       [<ffffffff81523184>] SyS_write+0xd4/0x1a0 fs/read_write.c:599
       [<ffffffff83fc02c1>] entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      It turns out DCCP calls __sk_receive_skb(), and this broke when
      lookups no longer took a reference on listeners.
      
      Fix this issue by adding a @refcounted parameter to __sk_receive_skb(),
      so that sock_put() is used only when needed.
      
      Fixes: 3b24d854 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3f24cfb
  7. 01 11月, 2016 1 次提交
    • E
      net: set SK_MEM_QUANTUM to 4096 · bd68a2a8
      Eric Dumazet 提交于
      Systems with large pages (64KB pages for example) do not always have
      huge quantity of memory.
      
      A big SK_MEM_QUANTUM value leads to fewer interactions with the
      global counters (like tcp_memory_allocated) but might trigger
      memory pressure much faster, giving suboptimal TCP performance
      since windows are lowered to ridiculous values.
      
      Note that sysctl_mem units being in pages and in ABI, we also need
      to change sk_prot_mem_limits() accordingly.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd68a2a8
  8. 27 10月, 2016 1 次提交
  9. 23 10月, 2016 1 次提交
  10. 17 9月, 2016 1 次提交
    • E
      net: avoid sk_forward_alloc overflows · 20c64d5c
      Eric Dumazet 提交于
      A malicious TCP receiver, sending SACK, can force the sender to split
      skbs in write queue and increase its memory usage.
      
      Then, when socket is closed and its write queue purged, we might
      overflow sk_forward_alloc (It becomes negative)
      
      sk_mem_reclaim() does nothing in this case, and more than 2GB
      are leaked from TCP perspective (tcp_memory_allocated is not changed)
      
      Then warnings trigger from inet_sock_destruct() and
      sk_stream_kill_queues() seeing a not zero sk_forward_alloc
      
      All TCP stack can be stuck because TCP is under memory pressure.
      
      A simple fix is to preemptively reclaim from sk_mem_uncharge().
      
      This makes sure a socket wont have more than 2 MB forward allocated,
      after burst and idle period.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20c64d5c
  11. 24 8月, 2016 2 次提交
  12. 19 8月, 2016 1 次提交
  13. 14 7月, 2016 1 次提交
    • W
      dccp: limit sk_filter trim to payload · 4f0c40d9
      Willem de Bruijn 提交于
      Dccp verifies packet integrity, including length, at initial rcv in
      dccp_invalid_packet, later pulls headers in dccp_enqueue_skb.
      
      A call to sk_filter in-between can cause __skb_pull to wrap skb->len.
      skb_copy_datagram_msg interprets this as a negative value, so
      (correctly) fails with EFAULT. The negative length is reported in
      ioctl SIOCINQ or possibly in a DCCP_WARN in dccp_close.
      
      Introduce an sk_receive_skb variant that caps how small a filter
      program can trim packets, and call this in dccp with the header
      length. Excessively trimmed packets are now processed normally and
      queued for reception as 0B payloads.
      
      Fixes: 7c657876 ("[DCCP]: Initial implementation")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f0c40d9
  14. 21 5月, 2016 1 次提交
  15. 05 5月, 2016 1 次提交
  16. 04 5月, 2016 1 次提交
    • E
      net: add __sock_wfree() helper · 1d2077ac
      Eric Dumazet 提交于
      Hosts sending lot of ACK packets exhibit high sock_wfree() cost
      because of cache line miss to test SOCK_USE_WRITE_QUEUE
      
      We could move this flag close to sk_wmem_alloc but it is better
      to perform the atomic_sub_and_test() on a clean cache line,
      as it avoid one extra bus transaction.
      
      skb_orphan_partial() can also have a fast track for packets that either
      are TCP acks, or already went through another skb_orphan_partial()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d2077ac
  17. 03 5月, 2016 1 次提交
    • E
      tcp: make tcp_sendmsg() aware of socket backlog · d41a69f1
      Eric Dumazet 提交于
      Large sendmsg()/write() hold socket lock for the duration of the call,
      unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
      are parked into socket backlog for a long time.
      Critical decisions like fast retransmit might be delayed.
      Receivers have to maintain a big out of order queue with additional cpu
      overhead, and also possible stalls in TX once windows are full.
      
      Bidirectional flows are particularly hurt since the backlog can become
      quite big if the copy from user space triggers IO (page faults)
      
      Some applications learnt to use sendmsg() (or sendmmsg()) with small
      chunks to avoid this issue.
      
      Kernel should know better, right ?
      
      Add a generic sk_flush_backlog() helper and use it right
      before a new skb is allocated. Typically we put 64KB of payload
      per skb (unless MSG_EOR is requested) and checking socket backlog
      every 64KB gives good results.
      
      As a matter of fact, tests with TSO/GSO disabled give very nice
      results, as we manage to keep a small write queue and smaller
      perceived rtt.
      
      Note that sk_flush_backlog() maintains socket ownership,
      so is not equivalent to a {release_sock(sk); lock_sock(sk);},
      to ensure implicit atomicity rules that sendmsg() was
      giving to (possibly buggy) applications.
      
      In this simple implementation, I chose to not call tcp_release_cb(),
      but we might consider this later.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d41a69f1
  18. 28 4月, 2016 2 次提交
  19. 26 4月, 2016 1 次提交
  20. 25 4月, 2016 1 次提交
  21. 15 4月, 2016 1 次提交
    • C
      soreuseport: fix ordering for mixed v4/v6 sockets · d894ba18
      Craig Gallek 提交于
      With the SO_REUSEPORT socket option, it is possible to create sockets
      in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
      This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
      the AF_INET6 sockets.
      
      Prior to the commits referenced below, an incoming IPv4 packet would
      always be routed to a socket of type AF_INET when this mixed-mode was used.
      After those changes, the same packet would be routed to the most recently
      bound socket (if this happened to be an AF_INET6 socket, it would
      have an IPv4 mapped IPv6 address).
      
      The change in behavior occurred because the recent SO_REUSEPORT optimizations
      short-circuit the socket scoring logic as soon as they find a match.  They
      did not take into account the scoring logic that favors AF_INET sockets
      over AF_INET6 sockets in the event of a tie.
      
      To fix this problem, this patch changes the insertion order of AF_INET
      and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
      have SO_REUSEPORT set.  AF_INET sockets will be inserted at the head of the
      list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
      the tail of the list.  This will force AF_INET sockets to always be
      considered first.
      
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d894ba18
  22. 14 4月, 2016 2 次提交
    • D
      net: force inlining of netif_tx_start/stop_queue, sock_hold, __sock_put · f9a7cbbf
      Denys Vlasenko 提交于
      Sometimes gcc mysteriously doesn't inline
      very small functions we expect to be inlined. See
          https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
      Arguably, gcc should do better, but gcc people aren't willing
      to invest time into it, asking to use __always_inline instead.
      
      With this .config:
      http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
      the following functions get deinlined many times.
      
      netif_tx_stop_queue: 207 copies, 590 calls:
      	55                      push   %rbp
      	48 89 e5                mov    %rsp,%rbp
      	f0 80 8f e0 01 00 00 01 lock orb $0x1,0x1e0(%rdi)
      	5d                      pop    %rbp
      	c3                      retq
      
      netif_tx_start_queue: 47 copies, 111 calls
      	55                      push   %rbp
      	48 89 e5                mov    %rsp,%rbp
      	f0 80 a7 e0 01 00 00 fe lock andb $0xfe,0x1e0(%rdi)
      	5d                      pop    %rbp
      	c3                      retq
      
      sock_hold: 39 copies, 124 calls
      	55                      push   %rbp
      	48 89 e5                mov    %rsp,%rbp
      	f0 ff 87 80 00 00 00    lock incl 0x80(%rdi)
      	5d                      pop    %rbp
      	c3                      retq
      
      __sock_put: 6 copies, 13 calls
      	55                      push   %rbp
      	48 89 e5                mov    %rsp,%rbp
      	f0 ff 8f 80 00 00 00    lock decl 0x80(%rdi)
      	5d                      pop    %rbp
      	c3                      retq
      
      This patch fixes this via s/inline/__always_inline/.
      
      Code size decrease after the patch is ~2.5k:
      
          text      data      bss       dec     hex filename
      56719876  56364551 36196352 149280779 8e5d80b vmlinux_before
      56717440  56364551 36196352 149278343 8e5ce87 vmlinux
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: linux-kernel@vger.kernel.org
      CC: netdev@vger.kernel.org
      CC: netfilter-devel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9a7cbbf
    • H
      sock: tigthen lockdep checks for sock_owned_by_user · fafc4e1e
      Hannes Frederic Sowa 提交于
      sock_owned_by_user should not be used without socket lock held. It seems
      to be a common practice to check .owned before lock reclassification, so
      provide a little help to abstract this check away.
      
      Cc: linux-cifs@vger.kernel.org
      Cc: linux-bluetooth@vger.kernel.org
      Cc: linux-nfs@vger.kernel.org
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fafc4e1e
  23. 08 4月, 2016 4 次提交
  24. 06 4月, 2016 3 次提交
  25. 05 4月, 2016 6 次提交
    • E
      tcp: increment sk_drops for dropped rx packets · 532182cd
      Eric Dumazet 提交于
      Now ss can report sk_drops, we can instruct TCP to increment
      this per socket counter when it drops an incoming frame, to refine
      monitoring and debugging.
      
      Following patch takes care of listeners drops.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      532182cd
    • E
      udp: no longer use SLAB_DESTROY_BY_RCU · ca065d0c
      Eric Dumazet 提交于
      Tom Herbert would like not touching UDP socket refcnt for encapsulated
      traffic. For this to happen, we need to use normal RCU rules, with a grace
      period before freeing a socket. UDP sockets are not short lived in the
      high usage case, so the added cost of call_rcu() should not be a concern.
      
      This actually removes a lot of complexity in UDP stack.
      
      Multicast receives no longer need to hold a bucket spinlock.
      
      Note that ip early demux still needs to take a reference on the socket.
      
      Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
      but this might be changed later.
      
      Performance for a single UDP socket receiving flood traffic from
      many RX queues/cpus.
      
      Simple udp_rx using simple recvfrom() loop :
      438 kpps instead of 374 kpps : 17 % increase of the peak rate.
      
      v2: Addressed Willem de Bruijn feedback in multicast handling
       - keep early demux break in __udp4_lib_demux_lookup()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Tested-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca065d0c
    • E
      net: add SOCK_RCU_FREE socket flag · a4298e45
      Eric Dumazet 提交于
      We want a generic way to insert an RCU grace period before socket
      freeing for cases where RCU_SLAB_DESTROY_BY_RCU is adding too
      much overhead.
      
      SLAB_DESTROY_BY_RCU strict rules force us to take a reference
      on the socket sk_refcnt, and it is a performance problem for UDP
      encapsulation, or TCP synflood behavior, as many CPUs might
      attempt the atomic operations on a shared sk_refcnt
      
      UDP sockets and TCP listeners can set SOCK_RCU_FREE so that their
      lookup can use traditional RCU rules, without refcount changes.
      They can set the flag only once hashed and visible by other cpus.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Tested-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4298e45
    • S
      sock: enable timestamping using control messages · c14ac945
      Soheil Hassas Yeganeh 提交于
      Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
      This is very costly when users want to sample writes to gather
      tx timestamps.
      
      Add support for enabling SO_TIMESTAMPING via control messages by
      using tsflags added in `struct sockcm_cookie` (added in the previous
      patches in this series) to set the tx_flags of the last skb created in
      a sendmsg. With this patch, the timestamp recording bits in tx_flags
      of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.
      
      Please note that this is only effective for overriding the recording
      timestamps flags. Users should enable timestamp reporting (e.g.,
      SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
      socket options and then should ask for SOF_TIMESTAMPING_TX_*
      using control messages per sendmsg to sample timestamps for each
      write.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c14ac945
    • S
      sock: accept SO_TIMESTAMPING flags in socket cmsg · 3dd17e63
      Soheil Hassas Yeganeh 提交于
      Accept SO_TIMESTAMPING in control messages of the SOL_SOCKET level
      as a basis to accept timestamping requests per write.
      
      This implementation only accepts TX recording flags (i.e.,
      SOF_TIMESTAMPING_TX_HARDWARE, SOF_TIMESTAMPING_TX_SOFTWARE,
      SOF_TIMESTAMPING_TX_SCHED, and SOF_TIMESTAMPING_TX_ACK) in
      control messages. Users need to set reporting flags (e.g.,
      SOF_TIMESTAMPING_OPT_ID) per socket via socket options.
      
      This commit adds a tsflags field in sockcm_cookie which is
      set in __sock_cmsg_send. It only override the SOF_TIMESTAMPING_TX_*
      bits in sockcm_cookie.tsflags allowing the control message
      to override the recording behavior per write, yet maintaining
      the value of other flags.
      
      This patch implements validating the control message and setting
      tsflags in struct sockcm_cookie. Next commits in this series will
      actually implement timestamping per write for different protocols.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3dd17e63
    • W
      sock: break up sock_cmsg_snd into __sock_cmsg_snd and loop · 39771b12
      Willem de Bruijn 提交于
      To process cmsg's of the SOL_SOCKET level in addition to
      cmsgs of another level, protocols can call sock_cmsg_send().
      This causes a double walk on the cmsghdr list, one for SOL_SOCKET
      and one for the other level.
      
      Extract the inner demultiplex logic from the loop that walks the list,
      to allow having this called directly from a walker in the protocol
      specific code.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39771b12
  26. 11 2月, 2016 1 次提交
  27. 22 1月, 2016 1 次提交