1. 18 11月, 2016 1 次提交
    • E
      udp: enable busy polling for all sockets · e68b6e50
      Eric Dumazet 提交于
      UDP busy polling is restricted to connected UDP sockets.
      
      This is because sk_busy_loop() only takes care of one NAPI context.
      
      There are cases where it could be extended.
      
      1) Some hosts receive traffic on a single NIC, with one RX queue.
      
      2) Some applications use SO_REUSEPORT and associated BPF filter
         to split the incoming traffic on one UDP socket per RX
      queue/thread/cpu
      
      3) Some UDP sockets are used to send/receive traffic for one flow, but
      they do not bother with connect()
      
      This patch records the napi_id of first received skb, giving more
      reach to busy polling.
      
      Tested:
      
      lpaa23:~# echo 70 >/proc/sys/net/core/busy_read
      lpaa24:~# echo 70 >/proc/sys/net/core/busy_read
      
      lpaa23:~# for f in `seq 1 10`; do ./super_netperf 1 -H lpaa24 -t UDP_RR -l 5; done
      
      Before patch :
         27867   28870   37324   41060   41215
         36764   36838   44455   41282   43843
      After patch :
         73920   73213   70147   74845   71697
         68315   68028   75219   70082   73707
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e68b6e50
  2. 17 11月, 2016 3 次提交
  3. 16 11月, 2016 2 次提交
  4. 15 11月, 2016 1 次提交
  5. 14 11月, 2016 3 次提交
  6. 10 11月, 2016 12 次提交
  7. 08 11月, 2016 3 次提交
  8. 05 11月, 2016 3 次提交
    • L
      net: inet: Support UID-based routing in IP protocols. · e2d118a1
      Lorenzo Colitti 提交于
      - Use the UID in routing lookups made by protocol connect() and
        sendmsg() functions.
      - Make sure that routing lookups triggered by incoming packets
        (e.g., Path MTU discovery) take the UID of the socket into
        account.
      - For packets not associated with a userspace socket, (e.g., ping
        replies) use UID 0 inside the user namespace corresponding to
        the network namespace the socket belongs to. This allows
        all namespaces to apply routing and iptables rules to
        kernel-originated traffic in that namespaces by matching UID 0.
        This is better than using the UID of the kernel socket that is
        sending the traffic, because the UID of kernel sockets created
        at namespace creation time (e.g., the per-processor ICMP and
        TCP sockets) is the UID of the user that created the socket,
        which might not be mapped in the namespace.
      
      Tested: compiles allnoconfig, allyesconfig, allmodconfig
      Tested: https://android-review.googlesource.com/253302Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2d118a1
    • L
      net: core: add UID to flows, rules, and routes · 622ec2c9
      Lorenzo Colitti 提交于
      - Define a new FIB rule attributes, FRA_UID_RANGE, to describe a
        range of UIDs.
      - Define a RTA_UID attribute for per-UID route lookups and dumps.
      - Support passing these attributes to and from userspace via
        rtnetlink. The value INVALID_UID indicates no UID was
        specified.
      - Add a UID field to the flow structures.
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      622ec2c9
    • L
      net: core: Add a UID field to struct sock. · 86741ec2
      Lorenzo Colitti 提交于
      Protocol sockets (struct sock) don't have UIDs, but most of the
      time, they map 1:1 to userspace sockets (struct socket) which do.
      
      Various operations such as the iptables xt_owner match need
      access to the "UID of a socket", and do so by following the
      backpointer to the struct socket. This involves taking
      sk_callback_lock and doesn't work when there is no socket
      because userspace has already called close().
      
      Simplify this by adding a sk_uid field to struct sock whose value
      matches the UID of the corresponding struct socket. The semantics
      are as follows:
      
      1. Whenever sk_socket is non-null: sk_uid is the same as the UID
         in sk_socket, i.e., matches the return value of sock_i_uid.
         Specifically, the UID is set when userspace calls socket(),
         fchown(), or accept().
      2. When sk_socket is NULL, sk_uid is defined as follows:
         - For a socket that no longer has a sk_socket because
           userspace has called close(): the previous UID.
         - For a cloned socket (e.g., an incoming connection that is
           established but on which userspace has not yet called
           accept): the UID of the socket it was cloned from.
         - For a socket that has never had an sk_socket: UID 0 inside
           the user namespace corresponding to the network namespace
           the socket belongs to.
      
      Kernel sockets created by sock_create_kern are a special case
      of #1 and sk_uid is the user that created them. For kernel
      sockets created at network namespace creation time, such as the
      per-processor ICMP and TCP sockets, this is the user that created
      the network namespace.
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86741ec2
  9. 04 11月, 2016 4 次提交
    • E
      dccp: do not release listeners too soon · c3f24cfb
      Eric Dumazet 提交于
      Andrey Konovalov reported following error while fuzzing with syzkaller :
      
      IPv4: Attempt to release alive inet socket ffff880068e98940
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      task: ffff88006b9e0000 task.stack: ffff880068770000
      RIP: 0010:[<ffffffff819ead5f>]  [<ffffffff819ead5f>]
      selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639
      RSP: 0018:ffff8800687771c8  EFLAGS: 00010202
      RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a
      RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010
      RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002
      R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940
      R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000
      FS:  00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0
      Stack:
       ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007
       ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480
       ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0
      Call Trace:
       [<ffffffff819c8dd5>] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317
       [<ffffffff82c2a9e7>] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81
       [<ffffffff82b81e60>] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460
       [<ffffffff838bbf12>] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873
       [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0
      net/ipv4/ip_input.c:216
       [<     inline     >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
       [<     inline     >] NF_HOOK ./include/linux/netfilter.h:255
       [<ffffffff8306abd2>] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
       [<     inline     >] dst_input ./include/net/dst.h:507
       [<ffffffff83068500>] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
       [<     inline     >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
       [<     inline     >] NF_HOOK ./include/linux/netfilter.h:255
       [<ffffffff8306b82f>] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
       [<ffffffff82bd9fb7>] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
       [<ffffffff82bdb19a>] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
       [<ffffffff82bdb493>] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
       [<ffffffff82bdb6b8>] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
       [<ffffffff8241fc75>] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
       [<ffffffff82421b5a>] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
       [<     inline     >] new_sync_write fs/read_write.c:499
       [<ffffffff8151bd44>] __vfs_write+0x334/0x570 fs/read_write.c:512
       [<ffffffff8151f85b>] vfs_write+0x17b/0x500 fs/read_write.c:560
       [<     inline     >] SYSC_write fs/read_write.c:607
       [<ffffffff81523184>] SyS_write+0xd4/0x1a0 fs/read_write.c:599
       [<ffffffff83fc02c1>] entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      It turns out DCCP calls __sk_receive_skb(), and this broke when
      lookups no longer took a reference on listeners.
      
      Fix this issue by adding a @refcounted parameter to __sk_receive_skb(),
      so that sock_put() is used only when needed.
      
      Fixes: 3b24d854 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3f24cfb
    • L
      ipv4: allow local fragmentation in ip_finish_output_gso() · 9ee6c5dc
      Lance Richardson 提交于
      Some configurations (e.g. geneve interface with default
      MTU of 1500 over an ethernet interface with 1500 MTU) result
      in the transmission of packets that exceed the configured MTU.
      While this should be considered to be a "bad" configuration,
      it is still allowed and should not result in the sending
      of packets that exceed the configured MTU.
      
      Fix by dropping the assumption in ip_finish_output_gso() that
      locally originated gso packets will never need fragmentation.
      Basic testing using iperf (observing CPU usage and bandwidth)
      have shown no measurable performance impact for traffic not
      requiring fragmentation.
      
      Fixes: c7ba65d7 ("net: ip: push gso skb forwarding handling down the stack")
      Reported-by: NJan Tluka <jtluka@redhat.com>
      Signed-off-by: NLance Richardson <lrichard@redhat.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ee6c5dc
    • D
      net: tcp: check skb is non-NULL for exact match on lookups · da96786e
      David Ahern 提交于
      Andrey reported the following error report while running the syzkaller
      fuzzer:
      
      general protection fault: 0000 [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 0 PID: 648 Comm: syz-executor Not tainted 4.9.0-rc3+ #333
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      task: ffff8800398c4480 task.stack: ffff88003b468000
      RIP: 0010:[<ffffffff83091106>]  [<     inline     >]
      inet_exact_dif_match include/net/tcp.h:808
      RIP: 0010:[<ffffffff83091106>]  [<ffffffff83091106>]
      __inet_lookup_listener+0xb6/0x500 net/ipv4/inet_hashtables.c:219
      RSP: 0018:ffff88003b46f270  EFLAGS: 00010202
      RAX: 0000000000000004 RBX: 0000000000004242 RCX: 0000000000000001
      RDX: 0000000000000000 RSI: ffffc90000e3c000 RDI: 0000000000000054
      RBP: ffff88003b46f2d8 R08: 0000000000004000 R09: ffffffff830910e7
      R10: 0000000000000000 R11: 000000000000000a R12: ffffffff867fa0c0
      R13: 0000000000004242 R14: 0000000000000003 R15: dffffc0000000000
      FS:  00007fb135881700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020cc3000 CR3: 000000006d56a000 CR4: 00000000000006f0
      Stack:
       0000000000000000 000000000601a8c0 0000000000000000 ffffffff00004242
       424200003b9083c2 ffff88003def4041 ffffffff84e7e040 0000000000000246
       ffff88003a0911c0 0000000000000000 ffff88003a091298 ffff88003b9083ae
      Call Trace:
       [<ffffffff831100f4>] tcp_v4_send_reset+0x584/0x1700 net/ipv4/tcp_ipv4.c:643
       [<ffffffff83115b1b>] tcp_v4_rcv+0x198b/0x2e50 net/ipv4/tcp_ipv4.c:1718
       [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0
      net/ipv4/ip_input.c:216
      ...
      
      MD5 has a code path that calls __inet_lookup_listener with a null skb,
      so inet{6}_exact_dif_match needs to check skb against null before pulling
      the flag.
      
      Fixes: a04a480d ("net: Require exact match for TCP socket lookups if
             dif is l3mdev")
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da96786e
    • W
      ipv4: add IP_RECVFRAGSIZE cmsg · 70ecc248
      Willem de Bruijn 提交于
      The IP stack records the largest fragment of a reassembled packet
      in IPCB(skb)->frag_max_size. When reading a datagram or raw packet
      that arrived fragmented, expose the value to allow applications to
      estimate receive path MTU.
      
      Tested:
        Sent data over a veth pair of which the source has a small mtu.
        Sent data using netcat, received using a dedicated process.
      
        Verified that the cmsg IP_RECVFRAGSIZE is returned only when
        data arrives fragmented, and in that cases matches the veth mtu.
      
          ip link add veth0 type veth peer name veth1
      
          ip netns add from
          ip netns add to
      
          ip link set dev veth1 netns to
          ip netns exec to ip addr add dev veth1 192.168.10.1/24
          ip netns exec to ip link set dev veth1 up
      
          ip link set dev veth0 netns from
          ip netns exec from ip addr add dev veth0 192.168.10.2/24
          ip netns exec from ip link set dev veth0 up
          ip netns exec from ip link set dev veth0 mtu 1300
          ip netns exec from ethtool -K veth0 ufo off
      
          dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload
      
          ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 &
          ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload
      
        using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70ecc248
  10. 03 11月, 2016 4 次提交
  11. 02 11月, 2016 3 次提交
    • P
      netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c · 8db4c5be
      Pablo Neira Ayuso 提交于
      We need this split to reuse existing codebase for the upcoming nf_tables
      socket expression.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8db4c5be
    • P
      netfilter: nf_log: add packet logging for netdev family · 1fddf4ba
      Pablo Neira Ayuso 提交于
      Move layer 2 packet logging into nf_log_l2packet() that resides in
      nf_log_common.c, so this can be shared by both bridge and netdev
      families.
      
      This patch adds the boiler plate code to register the netdev logging
      family.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      1fddf4ba
    • F
      netfilter: nf_tables: add fib expression · f6d0cbcf
      Florian Westphal 提交于
      Add FIB expression, supported for ipv4, ipv6 and inet family (the latter
      just dispatches to ipv4 or ipv6 one based on nfproto).
      
      Currently supports fetching output interface index/name and the
      rtm_type associated with an address.
      
      This can be used for adding path filtering. rtm_type is useful
      to e.g. enforce a strong-end host model where packets
      are only accepted if daddr is configured on the interface the
      packet arrived on.
      
      The fib expression is a native nftables alternative to the
      xtables addrtype and rp_filter matches.
      
      FIB result order for oif/oifname retrieval is as follows:
       - if packet is local (skb has rtable, RTF_LOCAL set, this
         will also catch looped-back multicast packets), set oif to
         the loopback interface.
       - if fib lookup returns an error, or result points to local,
         store zero result.  This means '--local' option of -m rpfilter
         is not supported. It is possible to use 'fib type local' or add
         explicit saddr/daddr matching rules to create exceptions if this
         is really needed.
       - store result in the destination register.
         In case of multiple routes, search set for desired oif in case
         strict matching is requested.
      
      ipv4 and ipv6 behave fib expressions are supposed to behave the same.
      
      [ I have collapsed Arnd Bergmann's ("netfilter: nf_tables: fib warnings")
      
      	http://patchwork.ozlabs.org/patch/688615/
      
        to address fallout from this patch after rebasing nf-next, that was
        posted to address compilation warnings. --pablo ]
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      f6d0cbcf
  12. 01 11月, 2016 1 次提交
    • E
      net: set SK_MEM_QUANTUM to 4096 · bd68a2a8
      Eric Dumazet 提交于
      Systems with large pages (64KB pages for example) do not always have
      huge quantity of memory.
      
      A big SK_MEM_QUANTUM value leads to fewer interactions with the
      global counters (like tcp_memory_allocated) but might trigger
      memory pressure much faster, giving suboptimal TCP performance
      since windows are lowered to ridiculous values.
      
      Note that sysctl_mem units being in pages and in ABI, we also need
      to change sk_prot_mem_limits() accordingly.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd68a2a8