1. 22 11月, 2016 3 次提交
  2. 20 11月, 2016 1 次提交
    • A
      net: fix bogus cast in skb_pagelen() and use unsigned variables · c72d8cda
      Alexey Dobriyan 提交于
      1) cast to "int" is unnecessary:
         u8 will be promoted to int before decrementing,
         small positive numbers fit into "int", so their values won't be changed
         during promotion.
      
         Once everything is int including loop counters, signedness doesn't
         matter: 32-bit operations will stay 32-bit operations.
      
         But! Someone tried to make this loop smart by making everything of
         the same type apparently in an attempt to optimise it.
         Do the optimization, just differently.
         Do the cast where it matters. :^)
      
      2) frag size is unsigned entity and sum of fragments sizes is also
         unsigned.
      
      Make everything unsigned, leave no MOVSX instruction behind.
      
      	add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-4 (-4)
      	function                                     old     new   delta
      	skb_cow_data                                 835     834      -1
      	ip_do_fragment                              2549    2548      -1
      	ip6_fragment                                3130    3128      -2
      	Total: Before=154865032, After=154865028, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c72d8cda
  3. 18 11月, 2016 2 次提交
    • A
      netns: make struct pernet_operations::id unsigned int · c7d03a00
      Alexey Dobriyan 提交于
      Make struct pernet_operations::id unsigned.
      
      There are 2 reasons to do so:
      
      1)
      This field is really an index into an zero based array and
      thus is unsigned entity. Using negative value is out-of-bound
      access by definition.
      
      2)
      On x86_64 unsigned 32-bit data which are mixed with pointers
      via array indexing or offsets added or subtracted to pointers
      are preffered to signed 32-bit data.
      
      "int" being used as an array index needs to be sign-extended
      to 64-bit before being used.
      
      	void f(long *p, int i)
      	{
      		g(p[i]);
      	}
      
        roughly translates to
      
      	movsx	rsi, esi
      	mov	rdi, [rsi+...]
      	call 	g
      
      MOVSX is 3 byte instruction which isn't necessary if the variable is
      unsigned because x86_64 is zero extending by default.
      
      Now, there is net_generic() function which, you guessed it right, uses
      "int" as an array index:
      
      	static inline void *net_generic(const struct net *net, int id)
      	{
      		...
      		ptr = ng->ptr[id - 1];
      		...
      	}
      
      And this function is used a lot, so those sign extensions add up.
      
      Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
      messing with code generation):
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      
      Unfortunately some functions actually grow bigger.
      This is a semmingly random artefact of code generation with register
      allocator being used differently. gcc decides that some variable
      needs to live in new r8+ registers and every access now requires REX
      prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
      used which is longer than [r8]
      
      However, overall balance is in negative direction:
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      	function                                     old     new   delta
      	nfsd4_lock                                  3886    3959     +73
      	tipc_link_build_proto_msg                   1096    1140     +44
      	mac80211_hwsim_new_radio                    2776    2808     +32
      	tipc_mon_rcv                                1032    1058     +26
      	svcauth_gss_legacy_init                     1413    1429     +16
      	tipc_bcbase_select_primary                   379     392     +13
      	nfsd4_exchange_id                           1247    1260     +13
      	nfsd4_setclientid_confirm                    782     793     +11
      		...
      	put_client_renew_locked                      494     480     -14
      	ip_set_sockfn_get                            730     716     -14
      	geneve_sock_add                              829     813     -16
      	nfsd4_sequence_done                          721     703     -18
      	nlmclnt_lookup_host                          708     686     -22
      	nfsd4_lockt                                 1085    1063     -22
      	nfs_get_client                              1077    1050     -27
      	tcf_bpf_init                                1106    1076     -30
      	nfsd4_encode_fattr                          5997    5930     -67
      	Total: Before=154856051, After=154854321, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d03a00
    • E
      udp: enable busy polling for all sockets · e68b6e50
      Eric Dumazet 提交于
      UDP busy polling is restricted to connected UDP sockets.
      
      This is because sk_busy_loop() only takes care of one NAPI context.
      
      There are cases where it could be extended.
      
      1) Some hosts receive traffic on a single NIC, with one RX queue.
      
      2) Some applications use SO_REUSEPORT and associated BPF filter
         to split the incoming traffic on one UDP socket per RX
      queue/thread/cpu
      
      3) Some UDP sockets are used to send/receive traffic for one flow, but
      they do not bother with connect()
      
      This patch records the napi_id of first received skb, giving more
      reach to busy polling.
      
      Tested:
      
      lpaa23:~# echo 70 >/proc/sys/net/core/busy_read
      lpaa24:~# echo 70 >/proc/sys/net/core/busy_read
      
      lpaa23:~# for f in `seq 1 10`; do ./super_netperf 1 -H lpaa24 -t UDP_RR -l 5; done
      
      Before patch :
         27867   28870   37324   41060   41215
         36764   36838   44455   41282   43843
      After patch :
         73920   73213   70147   74845   71697
         68315   68028   75219   70082   73707
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e68b6e50
  4. 16 11月, 2016 3 次提交
  5. 14 11月, 2016 3 次提交
    • J
      netfilter: x_tables: simplify IS_ERR_OR_NULL to NULL test · eb1a6bdc
      Julia Lawall 提交于
      Since commit 7926dbfa ("netfilter: don't use
      mutex_lock_interruptible()"), the function xt_find_table_lock can only
      return NULL on an error.  Simplify the call sites and update the
      comment before the function.
      
      The semantic patch that change the code is as follows:
      (http://coccinelle.lip6.fr/)
      
      // <smpl>
      @@
      expression t,e;
      @@
      
      t = \(xt_find_table_lock(...)\|
            try_then_request_module(xt_find_table_lock(...),...)\)
      ... when != t=e
      - ! IS_ERR_OR_NULL(t)
      + t
      
      @@
      expression t,e;
      @@
      
      t = \(xt_find_table_lock(...)\|
            try_then_request_module(xt_find_table_lock(...),...)\)
      ... when != t=e
      - IS_ERR_OR_NULL(t)
      + !t
      
      @@
      expression t,e,e1;
      @@
      
      t = \(xt_find_table_lock(...)\|
            try_then_request_module(xt_find_table_lock(...),...)\)
      ... when != t=e
      ?- t ? PTR_ERR(t) : e1
      + e1
      ... when any
      
      // </smpl>
      Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      eb1a6bdc
    • E
      tcp: take care of truncations done by sk_filter() · ac6e7800
      Eric Dumazet 提交于
      With syzkaller help, Marco Grassi found a bug in TCP stack,
      crashing in tcp_collapse()
      
      Root cause is that sk_filter() can truncate the incoming skb,
      but TCP stack was not really expecting this to happen.
      It probably was expecting a simple DROP or ACCEPT behavior.
      
      We first need to make sure no part of TCP header could be removed.
      Then we need to adjust TCP_SKB_CB(skb)->end_seq
      
      Many thanks to syzkaller team and Marco for giving us a reproducer.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMarco Grassi <marco.gra@gmail.com>
      Reported-by: NVladis Dronov <vdronov@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac6e7800
    • S
      ipv4: use new_gw for redirect neigh lookup · 969447f2
      Stephen Suryaputra Lin 提交于
      In v2.6, ip_rt_redirect() calls arp_bind_neighbour() which returns 0
      and then the state of the neigh for the new_gw is checked. If the state
      isn't valid then the redirected route is deleted. This behavior is
      maintained up to v3.5.7 by check_peer_redirect() because rt->rt_gateway
      is assigned to peer->redirect_learned.a4 before calling
      ipv4_neigh_lookup().
      
      After commit 5943634f ("ipv4: Maintain redirect and PMTU info in
      struct rtable again."), ipv4_neigh_lookup() is performed without the
      rt_gateway assigned to the new_gw. In the case when rt_gateway (old_gw)
      isn't zero, the function uses it as the key. The neigh is most likely
      valid since the old_gw is the one that sends the ICMP redirect message.
      Then the new_gw is assigned to fib_nh_exception. The problem is: the
      new_gw ARP may never gets resolved and the traffic is blackholed.
      
      So, use the new_gw for neigh lookup.
      
      Changes from v1:
       - use __ipv4_neigh_lookup instead (per Eric Dumazet).
      
      Fixes: 5943634f ("ipv4: Maintain redirect and PMTU info in struct rtable again.")
      Signed-off-by: NStephen Suryaputra Lin <ssurya@ieee.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      969447f2
  6. 11 11月, 2016 1 次提交
  7. 10 11月, 2016 7 次提交
  8. 08 11月, 2016 4 次提交
  9. 05 11月, 2016 2 次提交
    • L
      net: inet: Support UID-based routing in IP protocols. · e2d118a1
      Lorenzo Colitti 提交于
      - Use the UID in routing lookups made by protocol connect() and
        sendmsg() functions.
      - Make sure that routing lookups triggered by incoming packets
        (e.g., Path MTU discovery) take the UID of the socket into
        account.
      - For packets not associated with a userspace socket, (e.g., ping
        replies) use UID 0 inside the user namespace corresponding to
        the network namespace the socket belongs to. This allows
        all namespaces to apply routing and iptables rules to
        kernel-originated traffic in that namespaces by matching UID 0.
        This is better than using the UID of the kernel socket that is
        sending the traffic, because the UID of kernel sockets created
        at namespace creation time (e.g., the per-processor ICMP and
        TCP sockets) is the UID of the user that created the socket,
        which might not be mapped in the namespace.
      
      Tested: compiles allnoconfig, allyesconfig, allmodconfig
      Tested: https://android-review.googlesource.com/253302Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2d118a1
    • L
      net: core: add UID to flows, rules, and routes · 622ec2c9
      Lorenzo Colitti 提交于
      - Define a new FIB rule attributes, FRA_UID_RANGE, to describe a
        range of UIDs.
      - Define a RTA_UID attribute for per-UID route lookups and dumps.
      - Support passing these attributes to and from userspace via
        rtnetlink. The value INVALID_UID indicates no UID was
        specified.
      - Add a UID field to the flow structures.
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      622ec2c9
  10. 04 11月, 2016 7 次提交
    • E
      tcp: fix return value for partial writes · 79d8665b
      Eric Dumazet 提交于
      After my commit, tcp_sendmsg() might restart its loop after
      processing socket backlog.
      
      If sk_err is set, we blindly return an error, even though we
      copied data to user space before.
      
      We should instead return number of bytes that could be copied,
      otherwise user space might resend data and corrupt the stream.
      
      This might happen if another thread is using recvmsg(MSG_ERRQUEUE)
      to process timestamps.
      
      Issue was diagnosed by Soheil and Willem, big kudos to them !
      
      Fixes: d41a69f1 ("tcp: make tcp_sendmsg() aware of socket backlog")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79d8665b
    • L
      ipv4: allow local fragmentation in ip_finish_output_gso() · 9ee6c5dc
      Lance Richardson 提交于
      Some configurations (e.g. geneve interface with default
      MTU of 1500 over an ethernet interface with 1500 MTU) result
      in the transmission of packets that exceed the configured MTU.
      While this should be considered to be a "bad" configuration,
      it is still allowed and should not result in the sending
      of packets that exceed the configured MTU.
      
      Fix by dropping the assumption in ip_finish_output_gso() that
      locally originated gso packets will never need fragmentation.
      Basic testing using iperf (observing CPU usage and bandwidth)
      have shown no measurable performance impact for traffic not
      requiring fragmentation.
      
      Fixes: c7ba65d7 ("net: ip: push gso skb forwarding handling down the stack")
      Reported-by: NJan Tluka <jtluka@redhat.com>
      Signed-off-by: NLance Richardson <lrichard@redhat.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ee6c5dc
    • W
      ipv4: add IP_RECVFRAGSIZE cmsg · 70ecc248
      Willem de Bruijn 提交于
      The IP stack records the largest fragment of a reassembled packet
      in IPCB(skb)->frag_max_size. When reading a datagram or raw packet
      that arrived fragmented, expose the value to allow applications to
      estimate receive path MTU.
      
      Tested:
        Sent data over a veth pair of which the source has a small mtu.
        Sent data using netcat, received using a dedicated process.
      
        Verified that the cmsg IP_RECVFRAGSIZE is returned only when
        data arrives fragmented, and in that cases matches the veth mtu.
      
          ip link add veth0 type veth peer name veth1
      
          ip netns add from
          ip netns add to
      
          ip link set dev veth1 netns to
          ip netns exec to ip addr add dev veth1 192.168.10.1/24
          ip netns exec to ip link set dev veth1 up
      
          ip link set dev veth0 netns from
          ip netns exec from ip addr add dev veth0 192.168.10.2/24
          ip netns exec from ip link set dev veth0 up
          ip netns exec from ip link set dev veth0 mtu 1300
          ip netns exec from ethtool -K veth0 ufo off
      
          dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload
      
          ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 &
          ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload
      
        using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70ecc248
    • E
      tcp: fix potential memory corruption · ac9e70b1
      Eric Dumazet 提交于
      Imagine initial value of max_skb_frags is 17, and last
      skb in write queue has 15 frags.
      
      Then max_skb_frags is lowered to 14 or smaller value.
      
      tcp_sendmsg() will then be allowed to add additional page frags
      and eventually go past MAX_SKB_FRAGS, overflowing struct
      skb_shared_info.
      
      Fixes: 5f74f82e ("net:Add sysctl_max_skb_frags")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
      Cc: Håkon Bugge <haakon.bugge@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac9e70b1
    • C
      net: ip, raw_diag -- Use jump for exiting from nested loop · 9999370f
      Cyrill Gorcunov 提交于
      I managed to miss that sk_for_each is called under "for"
      cycle so need to use goto here to return matching socket.
      
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: David Ahern <dsa@cumulusnetworks.com>
      CC: Andrey Vagin <avagin@openvz.org>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9999370f
    • C
      net: ip, raw_diag -- Fix socket leaking for destroy request · cd05a0ec
      Cyrill Gorcunov 提交于
      In raw_diag_destroy the helper raw_sock_get returns
      with sock_hold call, so we have to put it then.
      
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: David Ahern <dsa@cumulusnetworks.com>
      CC: Andrey Vagin <avagin@openvz.org>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd05a0ec
    • W
      inet: fix sleeping inside inet_wait_for_connect() · 14135f30
      WANG Cong 提交于
      Andrey reported this kernel warning:
      
        WARNING: CPU: 0 PID: 4608 at kernel/sched/core.c:7724
        __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
        do not call blocking ops when !TASK_RUNNING; state=1 set at
        [<ffffffff811f5a5c>] prepare_to_wait+0xbc/0x210
        kernel/sched/wait.c:178
        Modules linked in:
        CPU: 0 PID: 4608 Comm: syz-executor Not tainted 4.9.0-rc2+ #320
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
         ffff88006625f7a0 ffffffff81b46914 ffff88006625f818 0000000000000000
         ffffffff84052960 0000000000000000 ffff88006625f7e8 ffffffff81111237
         ffff88006aceac00 ffffffff00001e2c ffffed000cc4beff ffffffff84052960
        Call Trace:
         [<     inline     >] __dump_stack lib/dump_stack.c:15
         [<ffffffff81b46914>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
         [<ffffffff81111237>] __warn+0x1a7/0x1f0 kernel/panic.c:550
         [<ffffffff8111132c>] warn_slowpath_fmt+0xac/0xd0 kernel/panic.c:565
         [<ffffffff811922fc>] __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
         [<     inline     >] slab_pre_alloc_hook mm/slab.h:393
         [<     inline     >] slab_alloc_node mm/slub.c:2634
         [<     inline     >] slab_alloc mm/slub.c:2716
         [<ffffffff81508da0>] __kmalloc_track_caller+0x150/0x2a0 mm/slub.c:4240
         [<ffffffff8146be14>] kmemdup+0x24/0x50 mm/util.c:113
         [<ffffffff8388b2cf>] dccp_feat_clone_sp_val.part.5+0x4f/0xe0 net/dccp/feat.c:374
         [<     inline     >] dccp_feat_clone_sp_val net/dccp/feat.c:1141
         [<     inline     >] dccp_feat_change_recv net/dccp/feat.c:1141
         [<ffffffff8388d491>] dccp_feat_parse_options+0xaa1/0x13d0 net/dccp/feat.c:1411
         [<ffffffff83894f01>] dccp_parse_options+0x721/0x1010 net/dccp/options.c:128
         [<ffffffff83891280>] dccp_rcv_state_process+0x200/0x15b0 net/dccp/input.c:644
         [<ffffffff838b8a94>] dccp_v4_do_rcv+0xf4/0x1a0 net/dccp/ipv4.c:681
         [<     inline     >] sk_backlog_rcv ./include/net/sock.h:872
         [<ffffffff82b7ceb6>] __release_sock+0x126/0x3a0 net/core/sock.c:2044
         [<ffffffff82b7d189>] release_sock+0x59/0x1c0 net/core/sock.c:2502
         [<     inline     >] inet_wait_for_connect net/ipv4/af_inet.c:547
         [<ffffffff8316b2a2>] __inet_stream_connect+0x5d2/0xbb0 net/ipv4/af_inet.c:617
         [<ffffffff8316b8d5>] inet_stream_connect+0x55/0xa0 net/ipv4/af_inet.c:656
         [<ffffffff82b705e4>] SYSC_connect+0x244/0x2f0 net/socket.c:1533
         [<ffffffff82b72dd4>] SyS_connect+0x24/0x30 net/socket.c:1514
         [<ffffffff83fbf701>] entry_SYSCALL_64_fastpath+0x1f/0xc2
        arch/x86/entry/entry_64.S:209
      
      Unlike commit 26cabd31
      ("sched, net: Clean up sk_wait_event() vs. might_sleep()"), the
      sleeping function is called before schedule_timeout(), this is indeed
      a bug. Fix this by moving the wait logic to the new API, it is similar
      to commit ff960a73
      ("netdev, sched/wait: Fix sleeping inside wait event").
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14135f30
  11. 03 11月, 2016 4 次提交
    • P
      netfilter: nf_tables: use hook state from xt_action_param structure · 0e5a1c7e
      Pablo Neira Ayuso 提交于
      Don't copy relevant fields from hook state structure, instead use the
      one that is already available in struct xt_action_param.
      
      This patch also adds a set of new wrapper functions to fetch relevant
      hook state structure fields.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      0e5a1c7e
    • P
      netfilter: x_tables: move hook state into xt_action_param structure · 613dbd95
      Pablo Neira Ayuso 提交于
      Place pointer to hook state in xt_action_param structure instead of
      copying the fields that we need. After this change xt_action_param fits
      into one cacheline.
      
      This patch also adds a set of new wrapper functions to fetch relevant
      hook state structure fields.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      613dbd95
    • C
      net: ip, diag -- Adjust raw_abort to use unlocked __udp_disconnect · 3de864f8
      Cyrill Gorcunov 提交于
      While being preparing patches for killing raw sockets via
      diag netlink interface I noticed that my runs are stuck:
      
       | [root@pcs7 ~]# cat /proc/`pidof ss`/stack
       | [<ffffffff816d1a76>] __lock_sock+0x80/0xc4
       | [<ffffffff816d206a>] lock_sock_nested+0x47/0x95
       | [<ffffffff8179ded6>] udp_disconnect+0x19/0x33
       | [<ffffffff8179b517>] raw_abort+0x33/0x42
       | [<ffffffff81702322>] sock_diag_destroy+0x4d/0x52
      
      which has not been the case before. I narrowed it down to the commit
      
       | commit 286c72de
       | Author: Eric Dumazet <edumazet@google.com>
       | Date:   Thu Oct 20 09:39:40 2016 -0700
       |
       |     udp: must lock the socket in udp_disconnect()
      
      where we start locking the socket for different reason.
      
      So the raw_abort escaped the renaming and we have to
      fix this typo using __udp_disconnect instead.
      
      Fixes: 286c72de ("udp: must lock the socket in udp_disconnect()")
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: David Ahern <dsa@cumulusnetworks.com>
      CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      CC: James Morris <jmorris@namei.org>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Andrey Vagin <avagin@openvz.org>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3de864f8
    • E
      tcp: enhance tcp collapsing · 2331ccc5
      Eric Dumazet 提交于
      As Ilya Lesokhin suggested, we can collapse two skbs at retransmit
      time even if the skb at the right has fragments.
      
      We simply have to use more generic skb_copy_bits() instead of
      skb_copy_from_linear_data() in tcp_collapse_retrans()
      
      Also need to guard this skb_copy_bits() in case there is nothing to
      copy, otherwise skb_put() could panic if left skb has frags.
      
      Tested:
      
      Used following packetdrill test
      
      // Establish a connection.
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 8>
         +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      +.100 < . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
         +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
         +0 write(4, ..., 200) = 200
         +0 > P. 1:201(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 201:401(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 401:601(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 601:801(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 801:1001(200) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1001:1101(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1101:1201(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1201:1301(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1301:1401(100) ack 1
      
      +.100 < . 1:1(0) ack 1 win 257 <nop,nop,sack 1001:1401>
      // Check that TCP collapse works :
         +0 > P. 1:1001(1000) ack 1
      Reported-by: NIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2331ccc5
  12. 02 11月, 2016 2 次提交
    • P
      netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c · 8db4c5be
      Pablo Neira Ayuso 提交于
      We need this split to reuse existing codebase for the upcoming nf_tables
      socket expression.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8db4c5be
    • F
      netfilter: nf_tables: add fib expression · f6d0cbcf
      Florian Westphal 提交于
      Add FIB expression, supported for ipv4, ipv6 and inet family (the latter
      just dispatches to ipv4 or ipv6 one based on nfproto).
      
      Currently supports fetching output interface index/name and the
      rtm_type associated with an address.
      
      This can be used for adding path filtering. rtm_type is useful
      to e.g. enforce a strong-end host model where packets
      are only accepted if daddr is configured on the interface the
      packet arrived on.
      
      The fib expression is a native nftables alternative to the
      xtables addrtype and rp_filter matches.
      
      FIB result order for oif/oifname retrieval is as follows:
       - if packet is local (skb has rtable, RTF_LOCAL set, this
         will also catch looped-back multicast packets), set oif to
         the loopback interface.
       - if fib lookup returns an error, or result points to local,
         store zero result.  This means '--local' option of -m rpfilter
         is not supported. It is possible to use 'fib type local' or add
         explicit saddr/daddr matching rules to create exceptions if this
         is really needed.
       - store result in the destination register.
         In case of multiple routes, search set for desired oif in case
         strict matching is requested.
      
      ipv4 and ipv6 behave fib expressions are supposed to behave the same.
      
      [ I have collapsed Arnd Bergmann's ("netfilter: nf_tables: fib warnings")
      
      	http://patchwork.ozlabs.org/patch/688615/
      
        to address fallout from this patch after rebasing nf-next, that was
        posted to address compilation warnings. --pablo ]
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      f6d0cbcf
  13. 01 11月, 2016 1 次提交
    • D
      net: Enable support for VRF with ipv4 multicast · e58e4159
      David Ahern 提交于
      Enable support for IPv4 multicast:
      - similar to unicast the flow struct is updated to L3 master device
        if relevant prior to calling fib_rules_lookup. The table id is saved
        to the lookup arg so the rule action for ipmr can return the table
        associated with the device.
      
      - ip_mr_forward needs to check for master device mismatch as well
        since the skb->dev is set to it
      
      - allow multicast address on VRF device for Rx by checking for the
        daddr in the VRF device as well as the original ingress device
      
      - on Tx need to drop to __mkroute_output when FIB lookup fails for
        multicast destination address.
      
      - if CONFIG_IP_MROUTE_MULTIPLE_TABLES is enabled VRF driver creates
        IPMR FIB rules on first device create similar to FIB rules. In
        addition the VRF driver does not divert IPv4 multicast packets:
        it breaks on Tx since the fib lookup fails on the mcast address.
      
      With this patch, ipmr forwarding and local rx/tx work.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e58e4159