1. 06 12月, 2016 1 次提交
  2. 04 12月, 2016 1 次提交
  3. 03 12月, 2016 5 次提交
    • D
      bpf: Add new cgroup attach type to enable sock modifications · 61023658
      David Ahern 提交于
      Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
      BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
      any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
      Currently only sk_bound_dev_if is exported to userspace for modification
      by a bpf program.
      
      This allows a cgroup to be configured such that AF_INET{6} sockets opened
      by processes are automatically bound to a specific device. In turn, this
      enables the running of programs that do not support SO_BINDTODEVICE in a
      specific VRF context / L3 domain.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61023658
    • A
      ip6_offload: check segs for NULL in ipv6_gso_segment. · 6b6ebb6b
      Artem Savkov 提交于
      segs needs to be checked for being NULL in ipv6_gso_segment() before calling
      skb_shinfo(segs), otherwise kernel can run into a NULL-pointer dereference:
      
      [   97.811262] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
      [   97.819112] IP: [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0
      [   97.825214] PGD 0 [   97.827047]
      [   97.828540] Oops: 0000 [#1] SMP
      [   97.831678] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 rpcsec_gss_krb5
      nfsv4 dns_resolver nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4
      iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
      ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
      bridge stp llc snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel
      snd_hda_codec edac_mce_amd snd_hda_core edac_core snd_hwdep kvm_amd snd_seq kvm snd_seq_device
      snd_pcm irqbypass snd_timer ppdev parport_serial snd parport_pc k10temp pcspkr soundcore parport
      sp5100_tco shpchp sg wmi i2c_piix4 acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc
      ip_tables xfs libcrc32c sr_mod cdrom sd_mod ata_generic pata_acpi amdkfd amd_iommu_v2 radeon
      broadcom bcm_phy_lib i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
      ttm ahci serio_raw tg3 firewire_ohci libahci pata_atiixp drm ptp libata firewire_core pps_core
      i2c_core crc_itu_t fjes dm_mirror dm_region_hash dm_log dm_mod
      [   97.927721] CPU: 1 PID: 3504 Comm: vhost-3495 Not tainted 4.9.0-7.el7.test.x86_64 #1
      [   97.935457] Hardware name: AMD Snook/Snook, BIOS ESK0726A 07/26/2010
      [   97.941806] task: ffff880129a1c080 task.stack: ffffc90001bcc000
      [   97.947720] RIP: 0010:[<ffffffff816e52f9>]  [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0
      [   97.956251] RSP: 0018:ffff88012fc43a10  EFLAGS: 00010207
      [   97.961557] RAX: 0000000000000000 RBX: ffff8801292c8700 RCX: 0000000000000594
      [   97.968687] RDX: 0000000000000593 RSI: ffff880129a846c0 RDI: 0000000000240000
      [   97.975814] RBP: ffff88012fc43a68 R08: ffff880129a8404e R09: 0000000000000000
      [   97.982942] R10: 0000000000000000 R11: ffff880129a84076 R12: 00000020002949b3
      [   97.990070] R13: ffff88012a580000 R14: 0000000000000000 R15: ffff88012a580000
      [   97.997198] FS:  0000000000000000(0000) GS:ffff88012fc40000(0000) knlGS:0000000000000000
      [   98.005280] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   98.011021] CR2: 00000000000000cc CR3: 0000000126c5d000 CR4: 00000000000006e0
      [   98.018149] Stack:
      [   98.020157]  00000000ffffffff ffff88012fc43ac8 ffffffffa017ad0a 000000000000000e
      [   98.027584]  0000001300000000 0000000077d59998 ffff8801292c8700 00000020002949b3
      [   98.035010]  ffff88012a580000 0000000000000000 ffff88012a580000 ffff88012fc43a98
      [   98.042437] Call Trace:
      [   98.044879]  <IRQ> [   98.046803]  [<ffffffffa017ad0a>] ? tg3_start_xmit+0x84a/0xd60 [tg3]
      [   98.053156]  [<ffffffff815eeee0>] skb_mac_gso_segment+0xb0/0x130
      [   98.059158]  [<ffffffff815eefd3>] __skb_gso_segment+0x73/0x110
      [   98.064985]  [<ffffffff815ef40d>] validate_xmit_skb+0x12d/0x2b0
      [   98.070899]  [<ffffffff815ef5d2>] validate_xmit_skb_list+0x42/0x70
      [   98.077073]  [<ffffffff81618560>] sch_direct_xmit+0xd0/0x1b0
      [   98.082726]  [<ffffffff815efd86>] __dev_queue_xmit+0x486/0x690
      [   98.088554]  [<ffffffff8135c135>] ? cpumask_next_and+0x35/0x50
      [   98.094380]  [<ffffffff815effa0>] dev_queue_xmit+0x10/0x20
      [   98.099863]  [<ffffffffa09ce057>] br_dev_queue_push_xmit+0xa7/0x170 [bridge]
      [   98.106907]  [<ffffffffa09ce161>] br_forward_finish+0x41/0xc0 [bridge]
      [   98.113430]  [<ffffffff81627cf2>] ? nf_iterate+0x52/0x60
      [   98.118735]  [<ffffffff81627d6b>] ? nf_hook_slow+0x6b/0xc0
      [   98.124216]  [<ffffffffa09ce32c>] __br_forward+0x14c/0x1e0 [bridge]
      [   98.130480]  [<ffffffffa09ce120>] ? br_dev_queue_push_xmit+0x170/0x170 [bridge]
      [   98.137785]  [<ffffffffa09ce4bd>] br_forward+0x9d/0xb0 [bridge]
      [   98.143701]  [<ffffffffa09cfbb7>] br_handle_frame_finish+0x267/0x560 [bridge]
      [   98.150834]  [<ffffffffa09d0064>] br_handle_frame+0x174/0x2f0 [bridge]
      [   98.157355]  [<ffffffff8102fb89>] ? sched_clock+0x9/0x10
      [   98.162662]  [<ffffffff810b63b2>] ? sched_clock_cpu+0x72/0xa0
      [   98.168403]  [<ffffffff815eccf5>] __netif_receive_skb_core+0x1e5/0xa20
      [   98.174926]  [<ffffffff813659f9>] ? timerqueue_add+0x59/0xb0
      [   98.180580]  [<ffffffff815ed548>] __netif_receive_skb+0x18/0x60
      [   98.186494]  [<ffffffff815ee625>] process_backlog+0x95/0x140
      [   98.192145]  [<ffffffff815edccd>] net_rx_action+0x16d/0x380
      [   98.197713]  [<ffffffff8170cff1>] __do_softirq+0xd1/0x283
      [   98.203106]  [<ffffffff8170b2bc>] do_softirq_own_stack+0x1c/0x30
      [   98.209107]  <EOI> [   98.211029]  [<ffffffff8108a5c0>] do_softirq+0x50/0x60
      [   98.216166]  [<ffffffff815ec853>] netif_rx_ni+0x33/0x80
      [   98.221386]  [<ffffffffa09eeff7>] tun_get_user+0x487/0x7f0 [tun]
      [   98.227388]  [<ffffffffa09ef3ab>] tun_sendmsg+0x4b/0x60 [tun]
      [   98.233129]  [<ffffffffa0b68932>] handle_tx+0x282/0x540 [vhost_net]
      [   98.239392]  [<ffffffffa0b68c25>] handle_tx_kick+0x15/0x20 [vhost_net]
      [   98.245916]  [<ffffffffa0abacfe>] vhost_worker+0x9e/0xf0 [vhost]
      [   98.251919]  [<ffffffffa0abac60>] ? vhost_umem_alloc+0x40/0x40 [vhost]
      [   98.258440]  [<ffffffff81003a47>] ? do_syscall_64+0x67/0x180
      [   98.264094]  [<ffffffff810a44d9>] kthread+0xd9/0xf0
      [   98.268965]  [<ffffffff810a4400>] ? kthread_park+0x60/0x60
      [   98.274444]  [<ffffffff8170a4d5>] ret_from_fork+0x25/0x30
      [   98.279836] Code: 8b 93 d8 00 00 00 48 2b 93 d0 00 00 00 4c 89 e6 48 89 df 66 89 93 c2 00 00 00 ff 10 48 3d 00 f0 ff ff 49 89 c2 0f 87 52 01 00 00 <41> 8b 92 cc 00 00 00 48 8b 80 d0 00 00 00 44 0f b7 74 10 06 66
      [   98.299425] RIP  [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0
      [   98.305612]  RSP <ffff88012fc43a10>
      [   98.309094] CR2: 00000000000000cc
      [   98.312406] ---[ end trace 726a2c7a2d2d78d0 ]---
      Signed-off-by: NArtem Savkov <asavkov@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b6ebb6b
    • F
      tcp: randomize tcp timestamp offsets for each connection · 95a22cae
      Florian Westphal 提交于
      jiffies based timestamps allow for easy inference of number of devices
      behind NAT translators and also makes tracking of hosts simpler.
      
      commit ceaa1fef ("tcp: adding a per-socket timestamp offset")
      added the main infrastructure that is needed for per-connection ts
      randomization, in particular writing/reading the on-wire tcp header
      format takes the offset into account so rest of stack can use normal
      tcp_time_stamp (jiffies).
      
      So only two items are left:
       - add a tsoffset for request sockets
       - extend the tcp isn generator to also return another 32bit number
         in addition to the ISN.
      
      Re-use of ISN generator also means timestamps are still monotonically
      increasing for same connection quadruple, i.e. PAWS will still work.
      
      Includes fixes from Eric Dumazet.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95a22cae
    • E
      Revert: "ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()" · 80d1106a
      Eli Cooper 提交于
      This reverts commit ae148b08
      ("ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()").
      
      skb->protocol is now set in __ip_local_out() and __ip6_local_out() before
      dst_output() is called. It is no longer necessary to do it for each tunnel.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NEli Cooper <elicooper@gmx.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80d1106a
    • E
      ipv6: Set skb->protocol properly for local output · b4e479a9
      Eli Cooper 提交于
      When xfrm is applied to TSO/GSO packets, it follows this path:
      
          xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()
      
      where skb_gso_segment() relies on skb->protocol to function properly.
      
      This patch sets skb->protocol to ETH_P_IPV6 before dst_output() is called,
      fixing a bug where GSO packets sent through an ipip6 tunnel are dropped
      when xfrm is involved.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NEli Cooper <elicooper@gmx.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4e479a9
  4. 01 12月, 2016 1 次提交
  5. 30 11月, 2016 2 次提交
  6. 29 11月, 2016 1 次提交
    • D
      net: handle no dst on skb in icmp6_send · 79dc7e3f
      David Ahern 提交于
      Andrey reported the following while fuzzing the kernel with syzkaller:
      
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 0 PID: 3859 Comm: a.out Not tainted 4.9.0-rc6+ #429
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      task: ffff8800666d4200 task.stack: ffff880067348000
      RIP: 0010:[<ffffffff833617ec>]  [<ffffffff833617ec>]
      icmp6_send+0x5fc/0x1e30 net/ipv6/icmp.c:451
      RSP: 0018:ffff88006734f2c0  EFLAGS: 00010206
      RAX: ffff8800666d4200 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000018
      RBP: ffff88006734f630 R08: ffff880064138418 R09: 0000000000000003
      R10: dffffc0000000000 R11: 0000000000000005 R12: 0000000000000000
      R13: ffffffff84e7e200 R14: ffff880064138484 R15: ffff8800641383c0
      FS:  00007fb3887a07c0(0000) GS:ffff88006cc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020000000 CR3: 000000006b040000 CR4: 00000000000006f0
      Stack:
       ffff8800666d4200 ffff8800666d49f8 ffff8800666d4200 ffffffff84c02460
       ffff8800666d4a1a 1ffff1000ccdaa2f ffff88006734f498 0000000000000046
       ffff88006734f440 ffffffff832f4269 ffff880064ba7456 0000000000000000
      Call Trace:
       [<ffffffff83364ddc>] icmpv6_param_prob+0x2c/0x40 net/ipv6/icmp.c:557
       [<     inline     >] ip6_tlvopt_unknown net/ipv6/exthdrs.c:88
       [<ffffffff83394405>] ip6_parse_tlv+0x555/0x670 net/ipv6/exthdrs.c:157
       [<ffffffff8339a759>] ipv6_parse_hopopts+0x199/0x460 net/ipv6/exthdrs.c:663
       [<ffffffff832ee773>] ipv6_rcv+0xfa3/0x1dc0 net/ipv6/ip6_input.c:191
       ...
      
      icmp6_send / icmpv6_send is invoked for both rx and tx paths. In both
      cases the dst->dev should be preferred for determining the L3 domain
      if the dst has been set on the skb. Fallback to the skb->dev if it has
      not. This covers the case reported here where icmp6_send is invoked on
      Rx before the route lookup.
      
      Fixes: 5d41ce29 ("net: icmp6_send should use dst dev to determine L3 domain")
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79dc7e3f
  7. 26 11月, 2016 1 次提交
    • D
      net: ipv4, ipv6: run cgroup eBPF egress programs · 33b48679
      Daniel Mack 提交于
      If the cgroup associated with the receiving socket has an eBPF
      programs installed, run them from ip_output(), ip6_output() and
      ip_mc_output(). From mentioned functions we have two socket contexts
      as per 7026b1dd ("netfilter: Pass socket pointer down through
      okfn()."). We explicitly need to use sk instead of skb->sk here,
      since otherwise the same program would run multiple times on egress
      when encap devices are involved, which is not desired in our case.
      
      eBPF programs used in this context are expected to either return 1 to
      let the packet pass, or != 1 to drop them. The programs have access to
      the skb through bpf_skb_load_bytes(), and the payload starts at the
      network headers (L3).
      
      Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
      for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
      the feature is unused.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33b48679
  8. 25 11月, 2016 2 次提交
    • E
      udplite: call proper backlog handlers · 30c7be26
      Eric Dumazet 提交于
      In commits 93821778 ("udp: Fix rcv socket locking") and
      f7ad74fe ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into
      __udpv6_queue_rcv_skb") UDP backlog handlers were renamed, but UDPlite
      was forgotten.
      
      This leads to crashes if UDPlite header is pulled twice, which happens
      starting from commit e6afc8ac ("udp: remove headers from UDP packets
      before queueing")
      
      Bug found by syzkaller team, thanks a lot guys !
      
      Note that backlog use in UDP/UDPlite is scheduled to be removed starting
      from linux-4.10, so this patch is only needed up to linux-4.9
      
      Fixes: 93821778 ("udp: Fix rcv socket locking")
      Fixes: f7ad74fe ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb")
      Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30c7be26
    • P
      ipv6: bump genid when the IFA_F_TENTATIVE flag is clear · 764d3be6
      Paolo Abeni 提交于
      When an ipv6 address has the tentative flag set, it can't be
      used as source for egress traffic, while the associated route,
      if any, can be looked up and even stored into some dst_cache.
      
      In the latter scenario, the source ipv6 address selected and
      stored in the cache is most probably wrong (e.g. with
      link-local scope) and the entity using the dst_cache will
      experience lack of ipv6 connectivity until said cache is
      cleared or invalidated.
      
      Overall this may cause lack of connectivity over most IPv6 tunnels
      (comprising geneve and vxlan), if the first egress packet reaches
      the tunnel before the DaD is completed for the used ipv6
      address.
      
      This patch bumps a new genid after that the IFA_F_TENTATIVE flag
      is cleared, so that dst_cache will be invalidated on
      next lookup and ipv6 connectivity restored.
      
      Fixes: 0c1d70af ("net: use dst_cache for vxlan device")
      Fixes: 468dfffc ("geneve: add dst caching support")
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      764d3be6
  9. 24 11月, 2016 1 次提交
    • D
      netfilter: Update nf_send_reset6 to consider L3 domain · 00b4422f
      David Ahern 提交于
      nf_send_reset6 is not considering the L3 domain and lookups are sent
      to the wrong table. For example consider the following output rule:
      
      ip6tables -A OUTPUT -p tcp --dport 12345 -j REJECT --reject-with tcp-reset
      
      using perf to analyze lookups via the fib6_table_lookup tracepoint shows:
      
      swapper     0 [001]   248.787816: fib6:fib6_table_lookup: table 255 oif 0 iif 1 src 2100:1::3 dst 2100:1:
              ffffffff81439cdc perf_trace_fib6_table_lookup ([kernel.kallsyms])
              ffffffff814c1ce3 trace_fib6_table_lookup ([kernel.kallsyms])
              ffffffff814c3e89 ip6_pol_route ([kernel.kallsyms])
              ffffffff814c40d5 ip6_pol_route_output ([kernel.kallsyms])
              ffffffff814e7b6f fib6_rule_action ([kernel.kallsyms])
              ffffffff81437f60 fib_rules_lookup ([kernel.kallsyms])
              ffffffff814e7c79 fib6_rule_lookup ([kernel.kallsyms])
              ffffffff814c2541 ip6_route_output_flags ([kernel.kallsyms])
                           528 nf_send_reset6 ([nf_reject_ipv6])
      
      The lookup is directed to table 255 rather than the table associated with
      the device via the L3 domain. Update nf_send_reset6 to pull the L3 domain
      from the dst currently attached to the skb.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      00b4422f
  10. 22 11月, 2016 1 次提交
  11. 20 11月, 2016 1 次提交
    • A
      net: fix bogus cast in skb_pagelen() and use unsigned variables · c72d8cda
      Alexey Dobriyan 提交于
      1) cast to "int" is unnecessary:
         u8 will be promoted to int before decrementing,
         small positive numbers fit into "int", so their values won't be changed
         during promotion.
      
         Once everything is int including loop counters, signedness doesn't
         matter: 32-bit operations will stay 32-bit operations.
      
         But! Someone tried to make this loop smart by making everything of
         the same type apparently in an attempt to optimise it.
         Do the optimization, just differently.
         Do the cast where it matters. :^)
      
      2) frag size is unsigned entity and sum of fragments sizes is also
         unsigned.
      
      Make everything unsigned, leave no MOVSX instruction behind.
      
      	add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-4 (-4)
      	function                                     old     new   delta
      	skb_cow_data                                 835     834      -1
      	ip_do_fragment                              2549    2548      -1
      	ip6_fragment                                3130    3128      -2
      	Total: Before=154865032, After=154865028, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c72d8cda
  12. 18 11月, 2016 3 次提交
    • A
      netns: make struct pernet_operations::id unsigned int · c7d03a00
      Alexey Dobriyan 提交于
      Make struct pernet_operations::id unsigned.
      
      There are 2 reasons to do so:
      
      1)
      This field is really an index into an zero based array and
      thus is unsigned entity. Using negative value is out-of-bound
      access by definition.
      
      2)
      On x86_64 unsigned 32-bit data which are mixed with pointers
      via array indexing or offsets added or subtracted to pointers
      are preffered to signed 32-bit data.
      
      "int" being used as an array index needs to be sign-extended
      to 64-bit before being used.
      
      	void f(long *p, int i)
      	{
      		g(p[i]);
      	}
      
        roughly translates to
      
      	movsx	rsi, esi
      	mov	rdi, [rsi+...]
      	call 	g
      
      MOVSX is 3 byte instruction which isn't necessary if the variable is
      unsigned because x86_64 is zero extending by default.
      
      Now, there is net_generic() function which, you guessed it right, uses
      "int" as an array index:
      
      	static inline void *net_generic(const struct net *net, int id)
      	{
      		...
      		ptr = ng->ptr[id - 1];
      		...
      	}
      
      And this function is used a lot, so those sign extensions add up.
      
      Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
      messing with code generation):
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      
      Unfortunately some functions actually grow bigger.
      This is a semmingly random artefact of code generation with register
      allocator being used differently. gcc decides that some variable
      needs to live in new r8+ registers and every access now requires REX
      prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
      used which is longer than [r8]
      
      However, overall balance is in negative direction:
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      	function                                     old     new   delta
      	nfsd4_lock                                  3886    3959     +73
      	tipc_link_build_proto_msg                   1096    1140     +44
      	mac80211_hwsim_new_radio                    2776    2808     +32
      	tipc_mon_rcv                                1032    1058     +26
      	svcauth_gss_legacy_init                     1413    1429     +16
      	tipc_bcbase_select_primary                   379     392     +13
      	nfsd4_exchange_id                           1247    1260     +13
      	nfsd4_setclientid_confirm                    782     793     +11
      		...
      	put_client_renew_locked                      494     480     -14
      	ip_set_sockfn_get                            730     716     -14
      	geneve_sock_add                              829     813     -16
      	nfsd4_sequence_done                          721     703     -18
      	nlmclnt_lookup_host                          708     686     -22
      	nfsd4_lockt                                 1085    1063     -22
      	nfs_get_client                              1077    1050     -27
      	tcf_bpf_init                                1106    1076     -30
      	nfsd4_encode_fattr                          5997    5930     -67
      	Total: Before=154856051, After=154854321, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d03a00
    • E
      udp: enable busy polling for all sockets · e68b6e50
      Eric Dumazet 提交于
      UDP busy polling is restricted to connected UDP sockets.
      
      This is because sk_busy_loop() only takes care of one NAPI context.
      
      There are cases where it could be extended.
      
      1) Some hosts receive traffic on a single NIC, with one RX queue.
      
      2) Some applications use SO_REUSEPORT and associated BPF filter
         to split the incoming traffic on one UDP socket per RX
      queue/thread/cpu
      
      3) Some UDP sockets are used to send/receive traffic for one flow, but
      they do not bother with connect()
      
      This patch records the napi_id of first received skb, giving more
      reach to busy polling.
      
      Tested:
      
      lpaa23:~# echo 70 >/proc/sys/net/core/busy_read
      lpaa24:~# echo 70 >/proc/sys/net/core/busy_read
      
      lpaa23:~# for f in `seq 1 10`; do ./super_netperf 1 -H lpaa24 -t UDP_RR -l 5; done
      
      Before patch :
         27867   28870   37324   41060   41215
         36764   36838   44455   41282   43843
      After patch :
         73920   73213   70147   74845   71697
         68315   68028   75219   70082   73707
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e68b6e50
    • P
      ip6_tunnel: disable caching when the traffic class is inherited · b5c2d495
      Paolo Abeni 提交于
      If an ip6 tunnel is configured to inherit the traffic class from
      the inner header, the dst_cache must be disabled or it will foul
      the policy routing.
      
      The issue is apprently there since at leat Linux-2.6.12-rc2.
      Reported-by: NLiam McBirnie <liam.mcbirnie@boeing.com>
      Cc: Liam McBirnie <liam.mcbirnie@boeing.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5c2d495
  13. 17 11月, 2016 1 次提交
    • D
      ipv6: sr: add option to control lwtunnel support · 46738b13
      David Lebrun 提交于
      This patch adds a new option CONFIG_IPV6_SEG6_LWTUNNEL to enable/disable
      support of encapsulation with the lightweight tunnels. When this option
      is enabled, CONFIG_LWTUNNEL is automatically selected.
      
      Fix commit 6c8702c6 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels")
      
      Without a proper option to control lwtunnel support for SR-IPv6, if
      CONFIG_LWTUNNEL=n then the IPv6 initialization fails as a consequence
      of seg6_iptunnel_init() failure with EOPNOTSUPP:
      
      NET: Registered protocol family 10
      IPv6: Attempt to unregister permanent protocol 6
      IPv6: Attempt to unregister permanent protocol 136
      IPv6: Attempt to unregister permanent protocol 17
      NET: Unregistered protocol family 10
      
      Tested (compiling, booting, and loading ipv6 module when relevant)
      with possible combinations of CONFIG_IPV6={y,m,n},
      CONFIG_IPV6_SEG6_LWTUNNEL={y,n} and CONFIG_LWTUNNEL={y,n}.
      Reported-by: NLorenzo Colitti <lorenzo@google.com>
      Suggested-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46738b13
  14. 16 11月, 2016 2 次提交
  15. 14 11月, 2016 2 次提交
  16. 10 11月, 2016 12 次提交
    • D
      net: tcp response should set oif only if it is L3 master · 9b6c14d5
      David Ahern 提交于
      Lorenzo noted an Android unit test failed due to e0d56fdd:
      "The expectation in the test was that the RST replying to a SYN sent to a
      closed port should be generated with oif=0. In other words it should not
      prefer the interface where the SYN came in on, but instead should follow
      whatever the routing table says it should do."
      
      Revert the change to ip_send_unicast_reply and tcp_v6_send_response such
      that the oif in the flow is set to the skb_iif only if skb_iif is an L3
      master.
      
      Fixes: e0d56fdd ("net: l3mdev: remove redundant calls")
      Reported-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NLorenzo Colitti <lorenzo@google.com>
      Acked-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b6c14d5
    • D
      ipv6: sr: add support for SRH injection through setsockopt · a149e7c7
      David Lebrun 提交于
      This patch adds support for per-socket SRH injection with the setsockopt
      system call through the IPPROTO_IPV6, IPV6_RTHDR options.
      The SRH is pushed through the ipv6_push_nfrag_opts function.
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a149e7c7
    • D
      ipv6: add source address argument for ipv6_push_nfrag_opts · 613fa3ca
      David Lebrun 提交于
      This patch prepares for insertion of SRH through setsockopt().
      The new source address argument is used when an HMAC field is
      present in the SRH, which must be filled. The HMAC signature
      process requires the source address as input text.
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      613fa3ca
    • D
      ipv6: sr: add calls to verify and insert HMAC signatures · 9baee834
      David Lebrun 提交于
      This patch enables the verification of the HMAC signature for transiting
      SR-enabled packets, and its insertion on encapsulated/injected SRH.
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9baee834
    • D
      ipv6: sr: implement API to control SR HMAC structure · 4f4853dc
      David Lebrun 提交于
      This patch provides an implementation of the genetlink commands
      to associate a given HMAC key identifier with an hashing algorithm
      and a secret.
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f4853dc
    • D
      ipv6: sr: add core files for SR HMAC support · bf355b8d
      David Lebrun 提交于
      This patch adds the necessary functions to compute and check the HMAC signature
      of an SR-enabled packet. Two HMAC algorithms are supported: hmac(sha1) and
      hmac(sha256).
      
      In order to avoid dynamic memory allocation for each HMAC computation,
      a per-cpu ring buffer is allocated for this purpose.
      
      A new per-interface sysctl called seg6_require_hmac is added, allowing a
      user-defined policy for processing HMAC-signed SR-enabled packets.
      A value of -1 means that the HMAC field will always be ignored.
      A value of 0 means that if an HMAC field is present, its validity will
      be enforced (the packet is dropped is the signature is incorrect).
      Finally, a value of 1 means that any SR-enabled packet that does not
      contain an HMAC signature or whose signature is incorrect will be dropped.
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf355b8d
    • D
      ipv6: sr: add support for SRH encapsulation and injection with lwtunnels · 6c8702c6
      David Lebrun 提交于
      This patch creates a new type of interfaceless lightweight tunnel (SEG6),
      enabling the encapsulation and injection of SRH within locally emitted
      packets and forwarded packets.
      
      >From a configuration viewpoint, a seg6 tunnel would be configured as follows:
      
        ip -6 ro ad fc00::1/128 encap seg6 mode encap segs fc42::1,fc42::2,fc42::3 dev eth0
      
      Any packet whose destination address is fc00::1 would thus be encapsulated
      within an outer IPv6 header containing the SRH with three segments, and would
      actually be routed to the first segment of the list. If `mode inline' was
      specified instead of `mode encap', then the SRH would be directly inserted
      after the IPv6 header without outer encapsulation.
      
      The inline mode is only available if CONFIG_IPV6_SEG6_INLINE is enabled. This
      feature was made configurable because direct header insertion may break
      several mechanisms such as PMTUD or IPSec AH.
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c8702c6
    • D
      ipv6: sr: add code base for control plane support of SR-IPv6 · 915d7e5e
      David Lebrun 提交于
      This patch adds the necessary hooks and structures to provide support
      for SR-IPv6 control plane, essentially the Generic Netlink commands
      that will be used for userspace control over the Segment Routing
      kernel structures.
      
      The genetlink commands provide control over two different structures:
      tunnel source and HMAC data. The tunnel source is the source address
      that will be used by default when encapsulating packets into an
      outer IPv6 header + SRH. If the tunnel source is set to :: then an
      address of the outgoing interface will be selected as the source.
      
      The HMAC commands currently just return ENOTSUPP and will be implemented
      in a future patch.
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      915d7e5e
    • D
      ipv6: implement dataplane support for rthdr type 4 (Segment Routing Header) · 1ababeba
      David Lebrun 提交于
      Implement minimal support for processing of SR-enabled packets
      as described in
      https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-02.
      
      This patch implements the following operations:
      - Intermediate segment endpoint: incrementation of active segment and rerouting.
      - Egress for SR-encapsulated packets: decapsulation of outer IPv6 header + SRH
        and routing of inner packet.
      - Cleanup flag support for SR-inlined packets: removal of SRH if we are the
        penultimate segment endpoint.
      
      A per-interface sysctl seg6_enabled is provided, to accept/deny SR-enabled
      packets. Default is deny.
      
      This patch does not provide support for HMAC-signed packets.
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ababeba
    • A
      udp: provide udp{4,6}_lib_lookup for nf_socket_ipv{4,6} · 30f58158
      Arnd Bergmann 提交于
      Since commit ca065d0c ("udp: no longer use SLAB_DESTROY_BY_RCU")
      the udp6_lib_lookup and udp4_lib_lookup functions are only
      provided when it is actually possible to call them.
      
      However, moving the callers now caused a link error:
      
      net/built-in.o: In function `nf_sk_lookup_slow_v6':
      (.text+0x131a39): undefined reference to `udp6_lib_lookup'
      net/ipv4/netfilter/nf_socket_ipv4.o: In function `nf_sk_lookup_slow_v4':
      nf_socket_ipv4.c:(.text.nf_sk_lookup_slow_v4+0x114): undefined reference to `udp4_lib_lookup'
      
      This extends the #ifdef so we also provide the functions when
      CONFIG_NF_SOCKET_IPV4 or CONFIG_NF_SOCKET_IPV6, respectively
      are set.
      
      Fixes: 8db4c5be ("netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      30f58158
    • D
      netfilter: conntrack: simplify init/uninit of L4 protocol trackers · 0e54d217
      Davide Caratti 提交于
      modify registration and deregistration of layer-4 protocol trackers to
      facilitate inclusion of new elements into the current list of builtin
      protocols. Both builtin (TCP, UDP, ICMP) and non-builtin (DCCP, GRE, SCTP,
      UDPlite) layer-4 protocol trackers usually register/deregister themselves
      using consecutive calls to nf_ct_l4proto_{,pernet}_{,un}register(...).
      This sequence is interrupted and rolled back in case of error; in order to
      simplify addition of builtin protocols, the input of the above functions
      has been modified to allow registering/unregistering multiple protocols.
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      0e54d217
    • M
      net-ipv6: on device mtu change do not add mtu to mtu-less routes · fb56be83
      Maciej Żenczykowski 提交于
      Routes can specify an mtu explicitly or inherit the mtu from
      the underlying device - this inheritance is implemented in
      dst->ops->mtu handlers ip6_mtu() and ip6_blackhole_mtu().
      
      Currently changing the mtu of a device adds mtu explicitly
      to routes using that device.
      
      ie.
        # ip link set dev lo mtu 65536
        # ip -6 route add local 2000::1 dev lo
        # ip -6 route get 2000::1
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
      
        # ip link set dev lo mtu 65535
        # ip -6 route get 2000::1
        local 2000::1 dev lo  table local  src ...  metric 1024  mtu 65535 pref medium
      
        # ip link set dev lo mtu 65536
        # ip -6 route get 2000::1
        local 2000::1 dev lo  table local  src ...  metric 1024  mtu 65536 pref medium
      
        # ip -6 route del local 2000::1
      
      After this patch the route entry no longer changes unless it already has an mtu.
      There is no need: this inheritance is already done in ip6_mtu()
      
        # ip link set dev lo mtu 65536
        # ip -6 route add local 2000::1 dev lo
        # ip -6 route add local 2000::2 dev lo mtu 2000
        # ip -6 route get 2000::1; ip -6 route get 2000::2
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
        local 2000::2 dev lo  table local  src ...  metric 1024  mtu 2000 pref medium
      
        # ip link set dev lo mtu 65535
        # ip -6 route get 2000::1; ip -6 route get 2000::2
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
        local 2000::2 dev lo  table local  src ...  metric 1024  mtu 2000 pref medium
      
        # ip link set dev lo mtu 1501
        # ip -6 route get 2000::1; ip -6 route get 2000::2
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
        local 2000::2 dev lo  table local  src ...  metric 1024  mtu 1501 pref medium
      
        # ip link set dev lo mtu 65536
        # ip -6 route get 2000::1; ip -6 route get 2000::2
        local 2000::1 dev lo  table local  src ...  metric 1024  pref medium
        local 2000::2 dev lo  table local  src ...  metric 1024  mtu 65536 pref medium
      
        # ip -6 route del local 2000::1
        # ip -6 route del local 2000::2
      
      This is desirable because changing device mtu and then resetting it
      to the previous value shouldn't change the user visible routing table.
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      CC: Eric Dumazet <edumazet@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb56be83
  17. 08 11月, 2016 3 次提交