1. 24 12月, 2016 2 次提交
    • D
      ipv6: handle -EFAULT from skb_copy_bits · a98f9175
      Dave Jones 提交于
      By setting certain socket options on ipv6 raw sockets, we can confuse the
      length calculation in rawv6_push_pending_frames triggering a BUG_ON.
      
      RIP: 0010:[<ffffffff817c6390>] [<ffffffff817c6390>] rawv6_sendmsg+0xc30/0xc40
      RSP: 0018:ffff881f6c4a7c18  EFLAGS: 00010282
      RAX: 00000000fffffff2 RBX: ffff881f6c681680 RCX: 0000000000000002
      RDX: ffff881f6c4a7cf8 RSI: 0000000000000030 RDI: ffff881fed0f6a00
      RBP: ffff881f6c4a7da8 R08: 0000000000000000 R09: 0000000000000009
      R10: ffff881fed0f6a00 R11: 0000000000000009 R12: 0000000000000030
      R13: ffff881fed0f6a00 R14: ffff881fee39ba00 R15: ffff881fefa93a80
      
      Call Trace:
       [<ffffffff8118ba23>] ? unmap_page_range+0x693/0x830
       [<ffffffff81772697>] inet_sendmsg+0x67/0xa0
       [<ffffffff816d93f8>] sock_sendmsg+0x38/0x50
       [<ffffffff816d982f>] SYSC_sendto+0xef/0x170
       [<ffffffff816da27e>] SyS_sendto+0xe/0x10
       [<ffffffff81002910>] do_syscall_64+0x50/0xa0
       [<ffffffff817f7cbc>] entry_SYSCALL64_slow_path+0x25/0x25
      
      Handle by jumping to the failure path if skb_copy_bits gets an EFAULT.
      
      Reproducer:
      
      #include <stdio.h>
      #include <stdlib.h>
      #include <string.h>
      #include <unistd.h>
      #include <sys/types.h>
      #include <sys/socket.h>
      #include <netinet/in.h>
      
      #define LEN 504
      
      int main(int argc, char* argv[])
      {
      	int fd;
      	int zero = 0;
      	char buf[LEN];
      
      	memset(buf, 0, LEN);
      
      	fd = socket(AF_INET6, SOCK_RAW, 7);
      
      	setsockopt(fd, SOL_IPV6, IPV6_CHECKSUM, &zero, 4);
      	setsockopt(fd, SOL_IPV6, IPV6_DSTOPTS, &buf, LEN);
      
      	sendto(fd, buf, 1, 0, (struct sockaddr *) buf, 110);
      }
      Signed-off-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a98f9175
    • W
      inet: fix IP(V6)_RECVORIGDSTADDR for udp sockets · 39b2dd76
      Willem de Bruijn 提交于
      Socket cmsg IP(V6)_RECVORIGDSTADDR checks that port range lies within
      the packet. For sockets that have transport headers pulled, transport
      offset can be negative. Use signed comparison to avoid overflow.
      
      Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
      Reported-by: NNisar Jagabar <njagabar@cloudmark.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      39b2dd76
  2. 18 12月, 2016 2 次提交
    • M
      net: ipv6: check route protocol when deleting routes · c2ed1880
      Mantas M 提交于
      The protocol field is checked when deleting IPv4 routes, but ignored for
      IPv6, which causes problems with routing daemons accidentally deleting
      externally set routes (observed by multiple bird6 users).
      
      This can be verified using `ip -6 route del <prefix> proto something`.
      Signed-off-by: NMantas Mikulėnas <grawity@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2ed1880
    • T
      inet: Fix get port to handle zero port number with soreuseport set · 0643ee4f
      Tom Herbert 提交于
      A user may call listen with binding an explicit port with the intent
      that the kernel will assign an available port to the socket. In this
      case inet_csk_get_port does a port scan. For such sockets, the user may
      also set soreuseport with the intent a creating more sockets for the
      port that is selected. The problem is that the initial socket being
      opened could inadvertently choose an existing and unreleated port
      number that was already created with soreuseport.
      
      This patch adds a boolean parameter to inet_bind_conflict that indicates
      rather soreuseport is allowed for the check (in addition to
      sk->sk_reuseport). In calls to inet_bind_conflict from inet_csk_get_port
      the argument is set to true if an explicit port is being looked up (snum
      argument is nonzero), and is false if port scan is done.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0643ee4f
  3. 07 12月, 2016 5 次提交
  4. 06 12月, 2016 2 次提交
  5. 05 12月, 2016 7 次提交
  6. 04 12月, 2016 1 次提交
  7. 03 12月, 2016 5 次提交
    • D
      bpf: Add new cgroup attach type to enable sock modifications · 61023658
      David Ahern 提交于
      Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
      BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
      any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
      Currently only sk_bound_dev_if is exported to userspace for modification
      by a bpf program.
      
      This allows a cgroup to be configured such that AF_INET{6} sockets opened
      by processes are automatically bound to a specific device. In turn, this
      enables the running of programs that do not support SO_BINDTODEVICE in a
      specific VRF context / L3 domain.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61023658
    • A
      ip6_offload: check segs for NULL in ipv6_gso_segment. · 6b6ebb6b
      Artem Savkov 提交于
      segs needs to be checked for being NULL in ipv6_gso_segment() before calling
      skb_shinfo(segs), otherwise kernel can run into a NULL-pointer dereference:
      
      [   97.811262] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
      [   97.819112] IP: [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0
      [   97.825214] PGD 0 [   97.827047]
      [   97.828540] Oops: 0000 [#1] SMP
      [   97.831678] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 rpcsec_gss_krb5
      nfsv4 dns_resolver nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4
      iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
      ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
      bridge stp llc snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel
      snd_hda_codec edac_mce_amd snd_hda_core edac_core snd_hwdep kvm_amd snd_seq kvm snd_seq_device
      snd_pcm irqbypass snd_timer ppdev parport_serial snd parport_pc k10temp pcspkr soundcore parport
      sp5100_tco shpchp sg wmi i2c_piix4 acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc
      ip_tables xfs libcrc32c sr_mod cdrom sd_mod ata_generic pata_acpi amdkfd amd_iommu_v2 radeon
      broadcom bcm_phy_lib i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
      ttm ahci serio_raw tg3 firewire_ohci libahci pata_atiixp drm ptp libata firewire_core pps_core
      i2c_core crc_itu_t fjes dm_mirror dm_region_hash dm_log dm_mod
      [   97.927721] CPU: 1 PID: 3504 Comm: vhost-3495 Not tainted 4.9.0-7.el7.test.x86_64 #1
      [   97.935457] Hardware name: AMD Snook/Snook, BIOS ESK0726A 07/26/2010
      [   97.941806] task: ffff880129a1c080 task.stack: ffffc90001bcc000
      [   97.947720] RIP: 0010:[<ffffffff816e52f9>]  [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0
      [   97.956251] RSP: 0018:ffff88012fc43a10  EFLAGS: 00010207
      [   97.961557] RAX: 0000000000000000 RBX: ffff8801292c8700 RCX: 0000000000000594
      [   97.968687] RDX: 0000000000000593 RSI: ffff880129a846c0 RDI: 0000000000240000
      [   97.975814] RBP: ffff88012fc43a68 R08: ffff880129a8404e R09: 0000000000000000
      [   97.982942] R10: 0000000000000000 R11: ffff880129a84076 R12: 00000020002949b3
      [   97.990070] R13: ffff88012a580000 R14: 0000000000000000 R15: ffff88012a580000
      [   97.997198] FS:  0000000000000000(0000) GS:ffff88012fc40000(0000) knlGS:0000000000000000
      [   98.005280] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   98.011021] CR2: 00000000000000cc CR3: 0000000126c5d000 CR4: 00000000000006e0
      [   98.018149] Stack:
      [   98.020157]  00000000ffffffff ffff88012fc43ac8 ffffffffa017ad0a 000000000000000e
      [   98.027584]  0000001300000000 0000000077d59998 ffff8801292c8700 00000020002949b3
      [   98.035010]  ffff88012a580000 0000000000000000 ffff88012a580000 ffff88012fc43a98
      [   98.042437] Call Trace:
      [   98.044879]  <IRQ> [   98.046803]  [<ffffffffa017ad0a>] ? tg3_start_xmit+0x84a/0xd60 [tg3]
      [   98.053156]  [<ffffffff815eeee0>] skb_mac_gso_segment+0xb0/0x130
      [   98.059158]  [<ffffffff815eefd3>] __skb_gso_segment+0x73/0x110
      [   98.064985]  [<ffffffff815ef40d>] validate_xmit_skb+0x12d/0x2b0
      [   98.070899]  [<ffffffff815ef5d2>] validate_xmit_skb_list+0x42/0x70
      [   98.077073]  [<ffffffff81618560>] sch_direct_xmit+0xd0/0x1b0
      [   98.082726]  [<ffffffff815efd86>] __dev_queue_xmit+0x486/0x690
      [   98.088554]  [<ffffffff8135c135>] ? cpumask_next_and+0x35/0x50
      [   98.094380]  [<ffffffff815effa0>] dev_queue_xmit+0x10/0x20
      [   98.099863]  [<ffffffffa09ce057>] br_dev_queue_push_xmit+0xa7/0x170 [bridge]
      [   98.106907]  [<ffffffffa09ce161>] br_forward_finish+0x41/0xc0 [bridge]
      [   98.113430]  [<ffffffff81627cf2>] ? nf_iterate+0x52/0x60
      [   98.118735]  [<ffffffff81627d6b>] ? nf_hook_slow+0x6b/0xc0
      [   98.124216]  [<ffffffffa09ce32c>] __br_forward+0x14c/0x1e0 [bridge]
      [   98.130480]  [<ffffffffa09ce120>] ? br_dev_queue_push_xmit+0x170/0x170 [bridge]
      [   98.137785]  [<ffffffffa09ce4bd>] br_forward+0x9d/0xb0 [bridge]
      [   98.143701]  [<ffffffffa09cfbb7>] br_handle_frame_finish+0x267/0x560 [bridge]
      [   98.150834]  [<ffffffffa09d0064>] br_handle_frame+0x174/0x2f0 [bridge]
      [   98.157355]  [<ffffffff8102fb89>] ? sched_clock+0x9/0x10
      [   98.162662]  [<ffffffff810b63b2>] ? sched_clock_cpu+0x72/0xa0
      [   98.168403]  [<ffffffff815eccf5>] __netif_receive_skb_core+0x1e5/0xa20
      [   98.174926]  [<ffffffff813659f9>] ? timerqueue_add+0x59/0xb0
      [   98.180580]  [<ffffffff815ed548>] __netif_receive_skb+0x18/0x60
      [   98.186494]  [<ffffffff815ee625>] process_backlog+0x95/0x140
      [   98.192145]  [<ffffffff815edccd>] net_rx_action+0x16d/0x380
      [   98.197713]  [<ffffffff8170cff1>] __do_softirq+0xd1/0x283
      [   98.203106]  [<ffffffff8170b2bc>] do_softirq_own_stack+0x1c/0x30
      [   98.209107]  <EOI> [   98.211029]  [<ffffffff8108a5c0>] do_softirq+0x50/0x60
      [   98.216166]  [<ffffffff815ec853>] netif_rx_ni+0x33/0x80
      [   98.221386]  [<ffffffffa09eeff7>] tun_get_user+0x487/0x7f0 [tun]
      [   98.227388]  [<ffffffffa09ef3ab>] tun_sendmsg+0x4b/0x60 [tun]
      [   98.233129]  [<ffffffffa0b68932>] handle_tx+0x282/0x540 [vhost_net]
      [   98.239392]  [<ffffffffa0b68c25>] handle_tx_kick+0x15/0x20 [vhost_net]
      [   98.245916]  [<ffffffffa0abacfe>] vhost_worker+0x9e/0xf0 [vhost]
      [   98.251919]  [<ffffffffa0abac60>] ? vhost_umem_alloc+0x40/0x40 [vhost]
      [   98.258440]  [<ffffffff81003a47>] ? do_syscall_64+0x67/0x180
      [   98.264094]  [<ffffffff810a44d9>] kthread+0xd9/0xf0
      [   98.268965]  [<ffffffff810a4400>] ? kthread_park+0x60/0x60
      [   98.274444]  [<ffffffff8170a4d5>] ret_from_fork+0x25/0x30
      [   98.279836] Code: 8b 93 d8 00 00 00 48 2b 93 d0 00 00 00 4c 89 e6 48 89 df 66 89 93 c2 00 00 00 ff 10 48 3d 00 f0 ff ff 49 89 c2 0f 87 52 01 00 00 <41> 8b 92 cc 00 00 00 48 8b 80 d0 00 00 00 44 0f b7 74 10 06 66
      [   98.299425] RIP  [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0
      [   98.305612]  RSP <ffff88012fc43a10>
      [   98.309094] CR2: 00000000000000cc
      [   98.312406] ---[ end trace 726a2c7a2d2d78d0 ]---
      Signed-off-by: NArtem Savkov <asavkov@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b6ebb6b
    • F
      tcp: randomize tcp timestamp offsets for each connection · 95a22cae
      Florian Westphal 提交于
      jiffies based timestamps allow for easy inference of number of devices
      behind NAT translators and also makes tracking of hosts simpler.
      
      commit ceaa1fef ("tcp: adding a per-socket timestamp offset")
      added the main infrastructure that is needed for per-connection ts
      randomization, in particular writing/reading the on-wire tcp header
      format takes the offset into account so rest of stack can use normal
      tcp_time_stamp (jiffies).
      
      So only two items are left:
       - add a tsoffset for request sockets
       - extend the tcp isn generator to also return another 32bit number
         in addition to the ISN.
      
      Re-use of ISN generator also means timestamps are still monotonically
      increasing for same connection quadruple, i.e. PAWS will still work.
      
      Includes fixes from Eric Dumazet.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95a22cae
    • E
      Revert: "ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()" · 80d1106a
      Eli Cooper 提交于
      This reverts commit ae148b08
      ("ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()").
      
      skb->protocol is now set in __ip_local_out() and __ip6_local_out() before
      dst_output() is called. It is no longer necessary to do it for each tunnel.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NEli Cooper <elicooper@gmx.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80d1106a
    • E
      ipv6: Set skb->protocol properly for local output · b4e479a9
      Eli Cooper 提交于
      When xfrm is applied to TSO/GSO packets, it follows this path:
      
          xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()
      
      where skb_gso_segment() relies on skb->protocol to function properly.
      
      This patch sets skb->protocol to ETH_P_IPV6 before dst_output() is called,
      fixing a bug where GSO packets sent through an ipip6 tunnel are dropped
      when xfrm is involved.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NEli Cooper <elicooper@gmx.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4e479a9
  8. 01 12月, 2016 1 次提交
  9. 30 11月, 2016 2 次提交
  10. 29 11月, 2016 1 次提交
    • D
      net: handle no dst on skb in icmp6_send · 79dc7e3f
      David Ahern 提交于
      Andrey reported the following while fuzzing the kernel with syzkaller:
      
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 0 PID: 3859 Comm: a.out Not tainted 4.9.0-rc6+ #429
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      task: ffff8800666d4200 task.stack: ffff880067348000
      RIP: 0010:[<ffffffff833617ec>]  [<ffffffff833617ec>]
      icmp6_send+0x5fc/0x1e30 net/ipv6/icmp.c:451
      RSP: 0018:ffff88006734f2c0  EFLAGS: 00010206
      RAX: ffff8800666d4200 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000018
      RBP: ffff88006734f630 R08: ffff880064138418 R09: 0000000000000003
      R10: dffffc0000000000 R11: 0000000000000005 R12: 0000000000000000
      R13: ffffffff84e7e200 R14: ffff880064138484 R15: ffff8800641383c0
      FS:  00007fb3887a07c0(0000) GS:ffff88006cc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020000000 CR3: 000000006b040000 CR4: 00000000000006f0
      Stack:
       ffff8800666d4200 ffff8800666d49f8 ffff8800666d4200 ffffffff84c02460
       ffff8800666d4a1a 1ffff1000ccdaa2f ffff88006734f498 0000000000000046
       ffff88006734f440 ffffffff832f4269 ffff880064ba7456 0000000000000000
      Call Trace:
       [<ffffffff83364ddc>] icmpv6_param_prob+0x2c/0x40 net/ipv6/icmp.c:557
       [<     inline     >] ip6_tlvopt_unknown net/ipv6/exthdrs.c:88
       [<ffffffff83394405>] ip6_parse_tlv+0x555/0x670 net/ipv6/exthdrs.c:157
       [<ffffffff8339a759>] ipv6_parse_hopopts+0x199/0x460 net/ipv6/exthdrs.c:663
       [<ffffffff832ee773>] ipv6_rcv+0xfa3/0x1dc0 net/ipv6/ip6_input.c:191
       ...
      
      icmp6_send / icmpv6_send is invoked for both rx and tx paths. In both
      cases the dst->dev should be preferred for determining the L3 domain
      if the dst has been set on the skb. Fallback to the skb->dev if it has
      not. This covers the case reported here where icmp6_send is invoked on
      Rx before the route lookup.
      
      Fixes: 5d41ce29 ("net: icmp6_send should use dst dev to determine L3 domain")
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79dc7e3f
  11. 26 11月, 2016 1 次提交
    • D
      net: ipv4, ipv6: run cgroup eBPF egress programs · 33b48679
      Daniel Mack 提交于
      If the cgroup associated with the receiving socket has an eBPF
      programs installed, run them from ip_output(), ip6_output() and
      ip_mc_output(). From mentioned functions we have two socket contexts
      as per 7026b1dd ("netfilter: Pass socket pointer down through
      okfn()."). We explicitly need to use sk instead of skb->sk here,
      since otherwise the same program would run multiple times on egress
      when encap devices are involved, which is not desired in our case.
      
      eBPF programs used in this context are expected to either return 1 to
      let the packet pass, or != 1 to drop them. The programs have access to
      the skb through bpf_skb_load_bytes(), and the payload starts at the
      network headers (L3).
      
      Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
      for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
      the feature is unused.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33b48679
  12. 25 11月, 2016 2 次提交
    • E
      udplite: call proper backlog handlers · 30c7be26
      Eric Dumazet 提交于
      In commits 93821778 ("udp: Fix rcv socket locking") and
      f7ad74fe ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into
      __udpv6_queue_rcv_skb") UDP backlog handlers were renamed, but UDPlite
      was forgotten.
      
      This leads to crashes if UDPlite header is pulled twice, which happens
      starting from commit e6afc8ac ("udp: remove headers from UDP packets
      before queueing")
      
      Bug found by syzkaller team, thanks a lot guys !
      
      Note that backlog use in UDP/UDPlite is scheduled to be removed starting
      from linux-4.10, so this patch is only needed up to linux-4.9
      
      Fixes: 93821778 ("udp: Fix rcv socket locking")
      Fixes: f7ad74fe ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb")
      Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30c7be26
    • P
      ipv6: bump genid when the IFA_F_TENTATIVE flag is clear · 764d3be6
      Paolo Abeni 提交于
      When an ipv6 address has the tentative flag set, it can't be
      used as source for egress traffic, while the associated route,
      if any, can be looked up and even stored into some dst_cache.
      
      In the latter scenario, the source ipv6 address selected and
      stored in the cache is most probably wrong (e.g. with
      link-local scope) and the entity using the dst_cache will
      experience lack of ipv6 connectivity until said cache is
      cleared or invalidated.
      
      Overall this may cause lack of connectivity over most IPv6 tunnels
      (comprising geneve and vxlan), if the first egress packet reaches
      the tunnel before the DaD is completed for the used ipv6
      address.
      
      This patch bumps a new genid after that the IFA_F_TENTATIVE flag
      is cleared, so that dst_cache will be invalidated on
      next lookup and ipv6 connectivity restored.
      
      Fixes: 0c1d70af ("net: use dst_cache for vxlan device")
      Fixes: 468dfffc ("geneve: add dst caching support")
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      764d3be6
  13. 24 11月, 2016 1 次提交
    • D
      netfilter: Update nf_send_reset6 to consider L3 domain · 00b4422f
      David Ahern 提交于
      nf_send_reset6 is not considering the L3 domain and lookups are sent
      to the wrong table. For example consider the following output rule:
      
      ip6tables -A OUTPUT -p tcp --dport 12345 -j REJECT --reject-with tcp-reset
      
      using perf to analyze lookups via the fib6_table_lookup tracepoint shows:
      
      swapper     0 [001]   248.787816: fib6:fib6_table_lookup: table 255 oif 0 iif 1 src 2100:1::3 dst 2100:1:
              ffffffff81439cdc perf_trace_fib6_table_lookup ([kernel.kallsyms])
              ffffffff814c1ce3 trace_fib6_table_lookup ([kernel.kallsyms])
              ffffffff814c3e89 ip6_pol_route ([kernel.kallsyms])
              ffffffff814c40d5 ip6_pol_route_output ([kernel.kallsyms])
              ffffffff814e7b6f fib6_rule_action ([kernel.kallsyms])
              ffffffff81437f60 fib_rules_lookup ([kernel.kallsyms])
              ffffffff814e7c79 fib6_rule_lookup ([kernel.kallsyms])
              ffffffff814c2541 ip6_route_output_flags ([kernel.kallsyms])
                           528 nf_send_reset6 ([nf_reject_ipv6])
      
      The lookup is directed to table 255 rather than the table associated with
      the device via the L3 domain. Update nf_send_reset6 to pull the L3 domain
      from the dst currently attached to the skb.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      00b4422f
  14. 22 11月, 2016 1 次提交
  15. 20 11月, 2016 1 次提交
    • A
      net: fix bogus cast in skb_pagelen() and use unsigned variables · c72d8cda
      Alexey Dobriyan 提交于
      1) cast to "int" is unnecessary:
         u8 will be promoted to int before decrementing,
         small positive numbers fit into "int", so their values won't be changed
         during promotion.
      
         Once everything is int including loop counters, signedness doesn't
         matter: 32-bit operations will stay 32-bit operations.
      
         But! Someone tried to make this loop smart by making everything of
         the same type apparently in an attempt to optimise it.
         Do the optimization, just differently.
         Do the cast where it matters. :^)
      
      2) frag size is unsigned entity and sum of fragments sizes is also
         unsigned.
      
      Make everything unsigned, leave no MOVSX instruction behind.
      
      	add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-4 (-4)
      	function                                     old     new   delta
      	skb_cow_data                                 835     834      -1
      	ip_do_fragment                              2549    2548      -1
      	ip6_fragment                                3130    3128      -2
      	Total: Before=154865032, After=154865028, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c72d8cda
  16. 18 11月, 2016 3 次提交
    • A
      netns: make struct pernet_operations::id unsigned int · c7d03a00
      Alexey Dobriyan 提交于
      Make struct pernet_operations::id unsigned.
      
      There are 2 reasons to do so:
      
      1)
      This field is really an index into an zero based array and
      thus is unsigned entity. Using negative value is out-of-bound
      access by definition.
      
      2)
      On x86_64 unsigned 32-bit data which are mixed with pointers
      via array indexing or offsets added or subtracted to pointers
      are preffered to signed 32-bit data.
      
      "int" being used as an array index needs to be sign-extended
      to 64-bit before being used.
      
      	void f(long *p, int i)
      	{
      		g(p[i]);
      	}
      
        roughly translates to
      
      	movsx	rsi, esi
      	mov	rdi, [rsi+...]
      	call 	g
      
      MOVSX is 3 byte instruction which isn't necessary if the variable is
      unsigned because x86_64 is zero extending by default.
      
      Now, there is net_generic() function which, you guessed it right, uses
      "int" as an array index:
      
      	static inline void *net_generic(const struct net *net, int id)
      	{
      		...
      		ptr = ng->ptr[id - 1];
      		...
      	}
      
      And this function is used a lot, so those sign extensions add up.
      
      Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
      messing with code generation):
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      
      Unfortunately some functions actually grow bigger.
      This is a semmingly random artefact of code generation with register
      allocator being used differently. gcc decides that some variable
      needs to live in new r8+ registers and every access now requires REX
      prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
      used which is longer than [r8]
      
      However, overall balance is in negative direction:
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      	function                                     old     new   delta
      	nfsd4_lock                                  3886    3959     +73
      	tipc_link_build_proto_msg                   1096    1140     +44
      	mac80211_hwsim_new_radio                    2776    2808     +32
      	tipc_mon_rcv                                1032    1058     +26
      	svcauth_gss_legacy_init                     1413    1429     +16
      	tipc_bcbase_select_primary                   379     392     +13
      	nfsd4_exchange_id                           1247    1260     +13
      	nfsd4_setclientid_confirm                    782     793     +11
      		...
      	put_client_renew_locked                      494     480     -14
      	ip_set_sockfn_get                            730     716     -14
      	geneve_sock_add                              829     813     -16
      	nfsd4_sequence_done                          721     703     -18
      	nlmclnt_lookup_host                          708     686     -22
      	nfsd4_lockt                                 1085    1063     -22
      	nfs_get_client                              1077    1050     -27
      	tcf_bpf_init                                1106    1076     -30
      	nfsd4_encode_fattr                          5997    5930     -67
      	Total: Before=154856051, After=154854321, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d03a00
    • E
      udp: enable busy polling for all sockets · e68b6e50
      Eric Dumazet 提交于
      UDP busy polling is restricted to connected UDP sockets.
      
      This is because sk_busy_loop() only takes care of one NAPI context.
      
      There are cases where it could be extended.
      
      1) Some hosts receive traffic on a single NIC, with one RX queue.
      
      2) Some applications use SO_REUSEPORT and associated BPF filter
         to split the incoming traffic on one UDP socket per RX
      queue/thread/cpu
      
      3) Some UDP sockets are used to send/receive traffic for one flow, but
      they do not bother with connect()
      
      This patch records the napi_id of first received skb, giving more
      reach to busy polling.
      
      Tested:
      
      lpaa23:~# echo 70 >/proc/sys/net/core/busy_read
      lpaa24:~# echo 70 >/proc/sys/net/core/busy_read
      
      lpaa23:~# for f in `seq 1 10`; do ./super_netperf 1 -H lpaa24 -t UDP_RR -l 5; done
      
      Before patch :
         27867   28870   37324   41060   41215
         36764   36838   44455   41282   43843
      After patch :
         73920   73213   70147   74845   71697
         68315   68028   75219   70082   73707
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e68b6e50
    • P
      ip6_tunnel: disable caching when the traffic class is inherited · b5c2d495
      Paolo Abeni 提交于
      If an ip6 tunnel is configured to inherit the traffic class from
      the inner header, the dst_cache must be disabled or it will foul
      the policy routing.
      
      The issue is apprently there since at leat Linux-2.6.12-rc2.
      Reported-by: NLiam McBirnie <liam.mcbirnie@boeing.com>
      Cc: Liam McBirnie <liam.mcbirnie@boeing.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5c2d495
  17. 17 11月, 2016 1 次提交
    • D
      ipv6: sr: add option to control lwtunnel support · 46738b13
      David Lebrun 提交于
      This patch adds a new option CONFIG_IPV6_SEG6_LWTUNNEL to enable/disable
      support of encapsulation with the lightweight tunnels. When this option
      is enabled, CONFIG_LWTUNNEL is automatically selected.
      
      Fix commit 6c8702c6 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels")
      
      Without a proper option to control lwtunnel support for SR-IPv6, if
      CONFIG_LWTUNNEL=n then the IPv6 initialization fails as a consequence
      of seg6_iptunnel_init() failure with EOPNOTSUPP:
      
      NET: Registered protocol family 10
      IPv6: Attempt to unregister permanent protocol 6
      IPv6: Attempt to unregister permanent protocol 136
      IPv6: Attempt to unregister permanent protocol 17
      NET: Unregistered protocol family 10
      
      Tested (compiling, booting, and loading ipv6 module when relevant)
      with possible combinations of CONFIG_IPV6={y,m,n},
      CONFIG_IPV6_SEG6_LWTUNNEL={y,n} and CONFIG_LWTUNNEL={y,n}.
      Reported-by: NLorenzo Colitti <lorenzo@google.com>
      Suggested-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46738b13
  18. 16 11月, 2016 2 次提交