1. 14 7月, 2018 3 次提交
    • Y
      tcp: remove DELAYED ACK events in DCTCP · a69258f7
      Yuchung Cheng 提交于
      After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK
      related callbacks are no longer needed
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a69258f7
    • Y
      tcp: fix dctcp delayed ACK schedule · b0c05d0e
      Yuchung Cheng 提交于
      Previously, when a data segment was sent an ACK was piggybacked
      on the data segment without generating a CA_EVENT_NON_DELAYED_ACK
      event to notify congestion control modules. So the DCTCP
      ca->delayed_ack_reserved flag could incorrectly stay set when
      in fact there were no delayed ACKs being reserved. This could result
      in sending a special ECN notification ACK that carries an older
      ACK sequence, when in fact there was no need for such an ACK.
      DCTCP keeps track of the delayed ACK status with its own separate
      state ca->delayed_ack_reserved. Previously it may accidentally cancel
      the delayed ACK without updating this field upon sending a special
      ACK that carries a older ACK sequence. This inconsistency would
      lead to DCTCP receiver never acknowledging the latest data until the
      sender times out and retry in some cases.
      
      Packetdrill script (provided by Larry Brakmo)
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      0.110 < [ect0] . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      
      0.200 < [ect0] . 1:1001(1000) ack 1 win 257
      0.200 > [ect01] . 1:1(0) ack 1001
      
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 1:2(1) ack 1001
      
      0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 2:3(1) ack 2001
      
      0.200 < [ect0] . 2001:3001(1000) ack 3 win 257
      0.200 < [ect0] . 3001:4001(1000) ack 3 win 257
      0.200 > [ect01] . 3:3(0) ack 4001
      
      0.210 < [ce] P. 4001:4501(500) ack 3 win 257
      
      +0.001 read(4, ..., 4500) = 4500
      +0 write(4, ..., 1) = 1
      +0 > [ect01] PE. 3:4(1) ack 4501
      
      +0.010 < [ect0] W. 4501:5501(1000) ack 4 win 257
      // Previously the ACK sequence below would be 4501, causing a long RTO
      +0.040~+0.045 > [ect01] . 4:4(0) ack 5501   // delayed ack
      
      +0.311 < [ect0] . 5501:6501(1000) ack 4 win 257  // More data
      +0 > [ect01] . 4:4(0) ack 6501     // now acks everything
      
      +0.500 < F. 9501:9501(0) ack 4 win 257
      Reported-by: NLarry Brakmo <brakmo@fb.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0c05d0e
    • S
      skbuff: Unconditionally copy pfmemalloc in __skb_clone() · e78bfb07
      Stefano Brivio 提交于
      Commit 8b700862 ("net: Don't copy pfmemalloc flag in
      __copy_skb_header()") introduced a different handling for the
      pfmemalloc flag in copy and clone paths.
      
      In __skb_clone(), now, the flag is set only if it was set in the
      original skb, but not cleared if it wasn't. This is wrong and
      might lead to socket buffers being flagged with pfmemalloc even
      if the skb data wasn't allocated from pfmemalloc reserves. Copy
      the flag instead of ORing it.
      Reported-by: NSabrina Dubroca <sd@queasysnail.net>
      Fixes: 8b700862 ("net: Don't copy pfmemalloc flag in __copy_skb_header()")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Tested-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e78bfb07
  2. 13 7月, 2018 10 次提交
    • M
      xsk: do not return EMSGSIZE in copy mode for packets larger than MTU · 09210c4b
      Magnus Karlsson 提交于
      This patch stops returning EMSGSIZE from sendmsg in copy mode when the
      size of the packet is larger than the MTU. Just send it to the device
      so that it will drop it as in zero-copy mode. This makes the error
      reporting consistent between copy mode and zero-copy mode.
      
      Fixes: 35fcde7f ("xsk: support for Tx")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      09210c4b
    • M
      xsk: always return ENOBUFS from sendmsg if there is no TX queue · 6efb4436
      Magnus Karlsson 提交于
      This patch makes sure ENOBUFS is always returned from sendmsg if there
      is no TX queue configured. This was not the case for zero-copy
      mode. With this patch this error reporting is consistent between copy
      mode and zero-copy mode.
      
      Fixes: ac98d8aa ("xsk: wire upp Tx zero-copy functions")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6efb4436
    • M
      xsk: do not return EAGAIN from sendmsg when completion queue is full · 9684f5e7
      Magnus Karlsson 提交于
      This patch stops returning EAGAIN in TX copy mode when the completion
      queue is full as zero-copy does not do this. Instead this situation
      can be detected by comparing the head and tail pointers of the
      completion queue in both modes. In any case, EAGAIN was not the
      correct error code here since no amount of calling sendmsg will solve
      the problem. Only consuming one or more messages on the completion
      queue will fix this.
      
      With this patch, the error reporting becomes consistent between copy
      mode and zero-copy mode.
      
      Fixes: 35fcde7f ("xsk: support for Tx")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9684f5e7
    • M
      xsk: do not return ENXIO from TX copy mode · 509d7648
      Magnus Karlsson 提交于
      This patch removes the ENXIO return code from TX copy-mode when
      someone has forcefully changed the number of queues on the device so
      that the queue bound to the socket is no longer available. Just
      silently stop sending anything as in zero-copy mode so the error
      reporting gets consistent between the two modes.
      
      Fixes: 35fcde7f ("xsk: support for Tx")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      509d7648
    • W
      packet: reset network header if packet shorter than ll reserved space · 993675a3
      Willem de Bruijn 提交于
      If variable length link layer headers result in a packet shorter
      than dev->hard_header_len, reset the network header offset. Else
      skb->mac_len may exceed skb->len after skb_mac_reset_len.
      
      packet_sendmsg_spkt already has similar logic.
      
      Fixes: b84bbaf7 ("packet: in packet_snd start writing at link layer allocation")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      993675a3
    • W
      nsh: set mac len based on inner packet · bab2c80e
      Willem de Bruijn 提交于
      When pulling the NSH header in nsh_gso_segment, set the mac length
      based on the encapsulated packet type.
      
      skb_reset_mac_len computes an offset to the network header, which
      here still points to the outer packet:
      
        >     skb_reset_network_header(skb);
        >     [...]
        >     __skb_pull(skb, nsh_len);
        >     skb_reset_mac_header(skb);    // now mac hdr starts nsh_len == 8B after net hdr
        >     skb_reset_mac_len(skb);       // mac len = net hdr - mac hdr == (u16) -8 == 65528
        >     [..]
        >     skb_mac_gso_segment(skb, ..)
      
      Link: http://lkml.kernel.org/r/CAF=yD-KeAcTSOn4AxirAxL8m7QAS8GBBe1w09eziYwvPbbUeYA@mail.gmail.com
      Reported-by: syzbot+7b9ed9872dab8c32305d@syzkaller.appspotmail.com
      Fixes: c411ed85 ("nsh: add GSO support")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bab2c80e
    • S
      net: Don't copy pfmemalloc flag in __copy_skb_header() · 8b700862
      Stefano Brivio 提交于
      The pfmemalloc flag indicates that the skb was allocated from
      the PFMEMALLOC reserves, and the flag is currently copied on skb
      copy and clone.
      
      However, an skb copied from an skb flagged with pfmemalloc
      wasn't necessarily allocated from PFMEMALLOC reserves, and on
      the other hand an skb allocated that way might be copied from an
      skb that wasn't.
      
      So we should not copy the flag on skb copy, and rather decide
      whether to allow an skb to be associated with sockets unrelated
      to page reclaim depending only on how it was allocated.
      
      Move the pfmemalloc flag before headers_start[0] using an
      existing 1-bit hole, so that __copy_skb_header() doesn't copy
      it.
      
      When cloning, we'll now take care of this flag explicitly,
      contravening to the warning comment of __skb_clone().
      
      While at it, restore the newline usage introduced by commit
      b1937227 ("net: reorganize sk_buff for faster
      __copy_skb_header()") to visually separate bytes used in
      bitfields after headers_start[0], that was gone after commit
      a9e419dc ("netfilter: merge ctinfo into nfct pointer storage
      area"), and describe the pfmemalloc flag in the kernel-doc
      structure comment.
      
      This doesn't change the size of sk_buff or cacheline boundaries,
      but consolidates the 15 bits hole before tc_index into a 2 bytes
      hole before csum, that could now be filled more easily.
      Reported-by: NPatrick Talbert <ptalbert@redhat.com>
      Fixes: c93bdd0e ("netvm: allow skb allocation to use PFMEMALLOC reserves")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b700862
    • S
      tcp: allow user to create repair socket without window probes · 70b7ff13
      Stefan Baranoff 提交于
      Under rare conditions where repair code may be used it is possible that
      window probes are either unnecessary or undesired. If the user knows that
      window probes are not wanted or needed this change allows them to skip
      sending them when a socket comes out of repair.
      Signed-off-by: NStefan Baranoff <sbaranoff@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70b7ff13
    • S
      tcp: fix sequence numbers for repaired sockets re-using TIME-WAIT sockets · 21684dc4
      Stefan Baranoff 提交于
      This patch fixes a bug where the sequence numbers of a socket created using
      TCP repair functionality are lower than set after connect is called.
      This occurs when the repair socket overlaps with a TIME-WAIT socket and
      triggers the re-use code. The amount lower is equal to the number of times
      that a particular IP/port set is re-used and then put back into TIME-WAIT.
      Re-using the first time the sequence number is 1 lower, closing that socket
      and then re-opening (with repair) a new socket with the same addresses/ports
      puts the sequence number 2 lower than set via setsockopt. The third time is
      3 lower, etc. I have not tested what the limit of this acrewal is, if any.
      
      The fix is, if a socket is in repair mode, to respect the already set
      sequence number and timestamp when it would have already re-used the
      TIME-WAIT socket.
      Signed-off-by: NStefan Baranoff <sbaranoff@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21684dc4
    • J
      sch_fq_codel: zero q->flows_cnt when fq_codel_init fails · 83fe6b87
      Jacob Keller 提交于
      When fq_codel_init fails, qdisc_create_dflt will cleanup by using
      qdisc_destroy. This function calls the ->reset() op prior to calling the
      ->destroy() op.
      
      Unfortunately, during the failure flow for sch_fq_codel, the ->flows
      parameter is not initialized, so the fq_codel_reset function will null
      pointer dereference.
      
         kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
         kernel: IP: fq_codel_reset+0x58/0xd0 [sch_fq_codel]
         kernel: PGD 0 P4D 0
         kernel: Oops: 0000 [#1] SMP PTI
         kernel: Modules linked in: i40iw i40e(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc devlink ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert iscsi_target_mod sunrpc ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate iTCO_wdt iTCO_vendor_support intel_uncore ib_core intel_rapl_perf mei_me mei joydev i2c_i801 lpc_ich ioatdma shpchp wmi sch_fq_codel xfs libcrc32c mgag200 ixgbe drm_kms_helper isci ttm firewire_ohci
         kernel:  mdio drm igb libsas crc32c_intel firewire_core ptp pps_core scsi_transport_sas crc_itu_t dca i2c_algo_bit ipmi_si ipmi_devintf ipmi_msghandler [last unloaded: i40e]
         kernel: CPU: 10 PID: 4219 Comm: ip Tainted: G           OE    4.16.13custom-fq-codel-test+ #3
         kernel: Hardware name: Intel Corporation S2600CO/S2600CO, BIOS SE5C600.86B.02.05.0004.051120151007 05/11/2015
         kernel: RIP: 0010:fq_codel_reset+0x58/0xd0 [sch_fq_codel]
         kernel: RSP: 0018:ffffbfbf4c1fb620 EFLAGS: 00010246
         kernel: RAX: 0000000000000400 RBX: 0000000000000000 RCX: 00000000000005b9
         kernel: RDX: 0000000000000000 RSI: ffff9d03264a60c0 RDI: ffff9cfd17b31c00
         kernel: RBP: 0000000000000001 R08: 00000000000260c0 R09: ffffffffb679c3e9
         kernel: R10: fffff1dab06a0e80 R11: ffff9cfd163af800 R12: ffff9cfd17b31c00
         kernel: R13: 0000000000000001 R14: ffff9cfd153de600 R15: 0000000000000001
         kernel: FS:  00007fdec2f92800(0000) GS:ffff9d0326480000(0000) knlGS:0000000000000000
         kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         kernel: CR2: 0000000000000008 CR3: 0000000c1956a006 CR4: 00000000000606e0
         kernel: Call Trace:
         kernel:  qdisc_destroy+0x56/0x140
         kernel:  qdisc_create_dflt+0x8b/0xb0
         kernel:  mq_init+0xc1/0xf0
         kernel:  qdisc_create_dflt+0x5a/0xb0
         kernel:  dev_activate+0x205/0x230
         kernel:  __dev_open+0xf5/0x160
         kernel:  __dev_change_flags+0x1a3/0x210
         kernel:  dev_change_flags+0x21/0x60
         kernel:  do_setlink+0x660/0xdf0
         kernel:  ? down_trylock+0x25/0x30
         kernel:  ? xfs_buf_trylock+0x1a/0xd0 [xfs]
         kernel:  ? rtnl_newlink+0x816/0x990
         kernel:  ? _xfs_buf_find+0x327/0x580 [xfs]
         kernel:  ? _cond_resched+0x15/0x30
         kernel:  ? kmem_cache_alloc+0x20/0x1b0
         kernel:  ? rtnetlink_rcv_msg+0x200/0x2f0
         kernel:  ? rtnl_calcit.isra.30+0x100/0x100
         kernel:  ? netlink_rcv_skb+0x4c/0x120
         kernel:  ? netlink_unicast+0x19e/0x260
         kernel:  ? netlink_sendmsg+0x1ff/0x3c0
         kernel:  ? sock_sendmsg+0x36/0x40
         kernel:  ? ___sys_sendmsg+0x295/0x2f0
         kernel:  ? ebitmap_cmp+0x6d/0x90
         kernel:  ? dev_get_by_name_rcu+0x73/0x90
         kernel:  ? skb_dequeue+0x52/0x60
         kernel:  ? __inode_wait_for_writeback+0x7f/0xf0
         kernel:  ? bit_waitqueue+0x30/0x30
         kernel:  ? fsnotify_grab_connector+0x3c/0x60
         kernel:  ? __sys_sendmsg+0x51/0x90
         kernel:  ? do_syscall_64+0x74/0x180
         kernel:  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
         kernel: Code: 00 00 48 89 87 00 02 00 00 8b 87 a0 01 00 00 85 c0 0f 84 84 00 00 00 31 ed 48 63 dd 83 c5 01 48 c1 e3 06 49 03 9c 24 90 01 00 00 <48> 8b 73 08 48 8b 3b e8 6c 9a 4f f6 48 8d 43 10 48 c7 03 00 00
         kernel: RIP: fq_codel_reset+0x58/0xd0 [sch_fq_codel] RSP: ffffbfbf4c1fb620
         kernel: CR2: 0000000000000008
         kernel: ---[ end trace e81a62bede66274e ]---
      
      This is caused because flows_cnt is non-zero, but flows hasn't been
      initialized. fq_codel_init has left the private data in a partially
      initialized state.
      
      To fix this, reset flows_cnt to 0 when we fail to initialize.
      Additionally, to make the state more consistent, also cleanup the flows
      pointer when the allocation of backlogs fails.
      
      This fixes the NULL pointer dereference, since both the for-loop and
      memset in fq_codel_reset will be no-ops when flow_cnt is zero.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83fe6b87
  3. 12 7月, 2018 2 次提交
    • D
      bpf: fix panic due to oob in bpf_prog_test_run_skb · 6e6fddc7
      Daniel Borkmann 提交于
      sykzaller triggered several panics similar to the below:
      
        [...]
        [  248.851531] BUG: KASAN: use-after-free in _copy_to_user+0x5c/0x90
        [  248.857656] Read of size 985 at addr ffff8808017ffff2 by task a.out/1425
        [...]
        [  248.865902] CPU: 1 PID: 1425 Comm: a.out Not tainted 4.18.0-rc4+ #13
        [  248.865903] Hardware name: Supermicro SYS-5039MS-H12TRF/X11SSE-F, BIOS 2.1a 03/08/2018
        [  248.865905] Call Trace:
        [  248.865910]  dump_stack+0xd6/0x185
        [  248.865911]  ? show_regs_print_info+0xb/0xb
        [  248.865913]  ? printk+0x9c/0xc3
        [  248.865915]  ? kmsg_dump_rewind_nolock+0xe4/0xe4
        [  248.865919]  print_address_description+0x6f/0x270
        [  248.865920]  kasan_report+0x25b/0x380
        [  248.865922]  ? _copy_to_user+0x5c/0x90
        [  248.865924]  check_memory_region+0x137/0x190
        [  248.865925]  kasan_check_read+0x11/0x20
        [  248.865927]  _copy_to_user+0x5c/0x90
        [  248.865930]  bpf_test_finish.isra.8+0x4f/0xc0
        [  248.865932]  bpf_prog_test_run_skb+0x6a0/0xba0
        [...]
      
      After scrubbing the BPF prog a bit from the noise, turns out it called
      bpf_skb_change_head() for the lwt_xmit prog with headroom of 2. Nothing
      wrong in that, however, this was run with repeat >> 0 in bpf_prog_test_run_skb()
      and the same skb thus keeps changing until the pskb_expand_head() called
      from skb_cow() keeps bailing out in atomic alloc context with -ENOMEM.
      So upon return we'll basically have 0 headroom left yet blindly do the
      __skb_push() of 14 bytes and keep copying data from there in bpf_test_finish()
      out of bounds. Fix to check if we have enough headroom and if pskb_expand_head()
      fails, bail out with error.
      
      Another bug independent of this fix (but related in triggering above) is
      that BPF_PROG_TEST_RUN should be reworked to reset the skb/xdp buffer to
      it's original state from input as otherwise repeating the same test in a
      loop won't work for benchmarking when underlying input buffer is getting
      changed by the prog each time and reused for the next run leading to
      unexpected results.
      
      Fixes: 1cf1cae9 ("bpf: introduce BPF_PROG_TEST_RUN command")
      Reported-by: syzbot+709412e651e55ed96498@syzkaller.appspotmail.com
      Reported-by: syzbot+54f39d6ab58f39720a55@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6e6fddc7
    • M
      bpf: fix availability probing for seg6 helpers · 61d76980
      Mathieu Xhonneux 提交于
      bpf_lwt_seg6_* helpers require CONFIG_IPV6_SEG6_BPF, and currently
      return -EOPNOTSUPP to indicate unavailability. This patch forces the
      BPF verifier to reject programs using these helpers when
      !CONFIG_IPV6_SEG6_BPF, allowing users to more easily probe if they are
      available or not.
      Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      61d76980
  4. 10 7月, 2018 3 次提交
    • D
      bpf: fix ldx in ld_abs rewrite for large offsets · 59ee4129
      Daniel Borkmann 提交于
      Mark reported that syzkaller triggered a KASAN detected slab-out-of-bounds
      bug in ___bpf_prog_run() with a BPF_LD | BPF_ABS word load at offset 0x8001.
      After further investigation it became clear that the issue was the
      BPF_LDX_MEM() which takes offset as an argument whereas it cannot encode
      larger than S16_MAX offsets into it. For this synthetical case we need to
      move the full address into tmp register instead and do the LDX without
      immediate value.
      
      Fixes: e0cea7ce ("bpf: implement ld_abs/ld_ind in native bpf")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reported-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      59ee4129
    • F
      netfilter: ipv6: nf_defrag: drop skb dst before queueing · 84379c9a
      Florian Westphal 提交于
      Eric Dumazet reports:
       Here is a reproducer of an annoying bug detected by syzkaller on our production kernel
       [..]
       ./b78305423 enable_conntrack
       Then :
       sleep 60
       dmesg | tail -10
       [  171.599093] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  181.631024] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  191.687076] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  201.703037] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  211.711072] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  221.959070] unregister_netdevice: waiting for lo to become free. Usage count = 2
      
      Reproducer sends ipv6 fragment that hits nfct defrag via LOCAL_OUT hook.
      skb gets queued until frag timer expiry -- 1 minute.
      
      Normally nf_conntrack_reasm gets called during prerouting, so skb has
      no dst yet which might explain why this wasn't spotted earlier.
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Reported-by: NJohn Sperbeck <jsperbeck@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Tested-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      84379c9a
    • A
      netfilter: nf_conntrack: Fix possible possible crash on module loading. · 2045cdfa
      Andrey Ryabinin 提交于
      Loading the nf_conntrack module with doubled hashsize parameter, i.e.
      	  modprobe nf_conntrack hashsize=12345 hashsize=12345
      causes NULL-ptr deref.
      
      If 'hashsize' specified twice, the nf_conntrack_set_hashsize() function
      will be called also twice.
      The first nf_conntrack_set_hashsize() call will set the
      'nf_conntrack_htable_size' variable:
      
      	nf_conntrack_set_hashsize()
      		...
      		/* On boot, we can set this without any fancy locking. */
      		if (!nf_conntrack_htable_size)
      			return param_set_uint(val, kp);
      
      But on the second invocation, the nf_conntrack_htable_size is already set,
      so the nf_conntrack_set_hashsize() will take a different path and call
      the nf_conntrack_hash_resize() function. Which will crash on the attempt
      to dereference 'nf_conntrack_hash' pointer:
      
      	BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      	RIP: 0010:nf_conntrack_hash_resize+0x255/0x490 [nf_conntrack]
      	Call Trace:
      	 nf_conntrack_set_hashsize+0xcd/0x100 [nf_conntrack]
      	 parse_args+0x1f9/0x5a0
      	 load_module+0x1281/0x1a50
      	 __se_sys_finit_module+0xbe/0xf0
      	 do_syscall_64+0x7c/0x390
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fix this, by checking !nf_conntrack_hash instead of
      !nf_conntrack_htable_size. nf_conntrack_hash will be initialized only
      after the module loaded, so the second invocation of the
      nf_conntrack_set_hashsize() won't crash, it will just reinitialize
      nf_conntrack_htable_size again.
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2045cdfa
  5. 09 7月, 2018 1 次提交
  6. 08 7月, 2018 7 次提交
    • E
      tcp: cleanup copied_seq and urg_data in tcp_disconnect · 6508b678
      Eric Dumazet 提交于
      tcp_zerocopy_receive() relies on tcp_inq() to limit number of bytes
      requested by user.
      
      syzbot found that after tcp_disconnect(), tcp_inq() was returning
      a stale value (number of bytes in queue before the disconnect).
      
      Note that after this patch, ioctl(fd, SIOCINQ, &val) is also fixed
      and returns 0, so this might be a candidate for all known linux kernels.
      
      While we are at this, we probably also should clear urg_data to
      avoid other syzkaller reports after it discovers how to deal with
      urgent data.
      
      syzkaller repro :
      
      socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
      bind(3, {sa_family=AF_INET, sin_port=htons(20000), sin_addr=inet_addr("224.0.0.1")}, 16) = 0
      connect(3, {sa_family=AF_INET, sin_port=htons(20000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
      send(3, ..., 4096, 0) = 4096
      connect(3, {sa_family=AF_UNSPEC, sa_data="\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 128) = 0
      getsockopt(3, SOL_TCP, TCP_ZEROCOPY_RECEIVE, ..., [16]) = 0 // CRASH
      
      Fixes: 05255b82 ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6508b678
    • P
      ipfrag: really prevent allocation on netns exit · f6f2a4a2
      Paolo Abeni 提交于
      Setting the low threshold to 0 has no effect on frags allocation,
      we need to clear high_thresh instead.
      
      The code was pre-existent to commit 648700f7 ("inet: frags:
      use rhashtables for reassembly units"), but before the above,
      such assignment had a different role: prevent concurrent eviction
      from the worker and the netns cleanup helper.
      
      Fixes: 648700f7 ("inet: frags: use rhashtables for reassembly units")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6f2a4a2
    • L
      net: diag: Don't double-free TCP_NEW_SYN_RECV sockets in tcp_abort · acc2cf4e
      Lorenzo Colitti 提交于
      When tcp_diag_destroy closes a TCP_NEW_SYN_RECV socket, it first
      frees it by calling inet_csk_reqsk_queue_drop_and_and_put in
      tcp_abort, and then frees it again by calling sock_gen_put.
      
      Since tcp_abort only has one caller, and all the other codepaths
      in tcp_abort don't free the socket, just remove the free in that
      function.
      
      Cc: David Ahern <dsa@cumulusnetworks.com>
      Tested: passes Android sock_diag_test.py, which exercises this codepath
      Fixes: d7226c7a ("net: diag: Fix refcnt leak in error path destroying socket")
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acc2cf4e
    • D
      net/ipv4: Set oif in fib_compute_spec_dst · e7372197
      David Ahern 提交于
      Xin reported that icmp replies may not use the address on the device the
      echo request is received if the destination address is broadcast. Instead
      a route lookup is done without considering VRF context. Fix by setting
      oif in flow struct to the master device if it is enslaved. That directs
      the lookup to the VRF table. If the device is not enslaved, oif is still
      0 so no affect.
      
      Fixes: cd2fbe1b ("net: Use VRF device index for lookups on RX")
      Reported-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7372197
    • T
      xdp: XDP_REDIRECT should check IFF_UP and MTU · d8d7218a
      Toshiaki Makita 提交于
      Otherwise we end up with attempting to send packets from down devices
      or to send oversized packets, which may cause unexpected driver/device
      behaviour. Generic XDP has already done this check, so reuse the logic
      in native XDP.
      
      Fixes: 814abfab ("xdp: add bpf_redirect helper function")
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d8d7218a
    • J
      bpf: sockmap, convert bpf_compute_data_pointers to bpf_*_sk_skb · 0ea488ff
      John Fastabend 提交于
      In commit
      
        'bpf: bpf_compute_data uses incorrect cb structure' (8108a775)
      
      we added the routine bpf_compute_data_end_sk_skb() to compute the
      correct data_end values, but this has since been lost. In kernel
      v4.14 this was correct and the above patch was applied in it
      entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
      we lost the piece that renamed bpf_compute_data_pointers to the
      new function bpf_compute_data_end_sk_skb. This was done here,
      
      e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      
      When it conflicted with the following rename patch,
      
      6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      
      Finally, after a refactor I thought even the function
      bpf_compute_data_end_sk_skb() was no longer needed and it was
      erroneously removed.
      
      However, we never reverted the sk_skb_convert_ctx_access() usage of
      tcp_skb_cb which had been committed and survived the merge conflict.
      Here we fix this by adding back the helper and *_data_end_sk_skb()
      usage. Using the bpf_skc_data_end mapping is not correct because it
      expects a qdisc_skb_cb object but at the sock layer this is not the
      case. Even though it happens to work here because we don't overwrite
      any data in-use at the socket layer and the cb structure is cleared
      later this has potential to create some subtle issues. But, even
      more concretely the filter.c access check uses tcp_skb_cb.
      
      And by some act of chance though,
      
      struct bpf_skb_data_end {
              struct qdisc_skb_cb        qdisc_cb;             /*     0    28 */
      
              /* XXX 4 bytes hole, try to pack */
      
              void *                     data_meta;            /*    32     8 */
              void *                     data_end;             /*    40     8 */
      
              /* size: 48, cachelines: 1, members: 3 */
              /* sum members: 44, holes: 1, sum holes: 4 */
              /* last cacheline: 48 bytes */
      };
      
      and then tcp_skb_cb,
      
      struct tcp_skb_cb {
      	[...]
                      struct {
                              __u32      flags;                /*    24     4 */
                              struct sock * sk_redir;          /*    32     8 */
                              void *     data_end;             /*    40     8 */
                      } bpf;                                   /*          24 */
              };
      
      So when we use offset_of() to track down the byte offset we get 40 in
      either case and everything continues to work. Fix this mess and use
      correct structures its unclear how long this might actually work for
      until someone moves the structs around.
      Reported-by: NMartin KaFai Lau <kafai@fb.com>
      Fixes: e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      Fixes: 6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0ea488ff
    • J
      bpf: fix sk_skb programs without skb->dev assigned · 0c6bc6e5
      John Fastabend 提交于
      Multiple BPF helpers in use by sk_skb programs calculate the max
      skb length using the __bpf_skb_max_len function. However, this
      calculates the max length using the skb->dev pointer which can be
      NULL when an sk_skb program is paired with an sk_msg program.
      
      To force this a sk_msg program needs to redirect into the ingress
      path of a sock with an attach sk_skb program. Then the the sk_skb
      program would need to call one of the helpers that adjust the skb
      size.
      
      To fix the null ptr dereference use SKB_MAX_ALLOC size if no dev
      is available.
      
      Fixes: 8934ce2f ("bpf: sockmap redirect ingress support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0c6bc6e5
  7. 07 7月, 2018 8 次提交
    • D
      net/sched: act_tunnel_key: fix NULL dereference when 'goto chain' is used · 38230a3e
      Davide Caratti 提交于
      the control action in the common member of struct tcf_tunnel_key must be a
      valid value, as it can contain the chain index when 'goto chain' is used.
      Ensure that the control action can be read as x->tcfa_action, when x is a
      pointer to struct tc_action and x->ops->type is TCA_ACT_TUNNEL_KEY, to
      prevent the following command:
      
       # tc filter add dev $h2 ingress protocol ip pref 1 handle 101 flower \
       > $tcflags dst_mac $h2mac action tunnel_key unset goto chain 1
      
      from causing a NULL dereference when a matching packet is received:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
       PGD 80000001097ac067 P4D 80000001097ac067 PUD 103b0a067 PMD 0
       Oops: 0000 [#1] SMP PTI
       CPU: 0 PID: 3491 Comm: mausezahn Tainted: G            E     4.18.0-rc2.auguri+ #421
       Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.58 02/07/2013
       RIP: 0010:tcf_action_exec+0xb8/0x100
       Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
       RSP: 0018:ffff95145ea03c40 EFLAGS: 00010246
       RAX: 0000000020000001 RBX: ffff9514499e5800 RCX: 0000000000000001
       RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
       RBP: ffff95145ea03e60 R08: 0000000000000000 R09: ffff95145ea03c9c
       R10: ffff95145ea03c78 R11: 0000000000000008 R12: ffff951456a69800
       R13: ffff951456a69808 R14: 0000000000000001 R15: ffff95144965ee40
       FS:  00007fd67ee11740(0000) GS:ffff95145ea00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000000 CR3: 00000001038a2006 CR4: 00000000001606f0
       Call Trace:
        <IRQ>
        fl_classify+0x1ad/0x1c0 [cls_flower]
        ? __update_load_avg_se.isra.47+0x1ca/0x1d0
        ? __update_load_avg_se.isra.47+0x1ca/0x1d0
        ? update_load_avg+0x665/0x690
        ? update_load_avg+0x665/0x690
        ? kmem_cache_alloc+0x38/0x1c0
        tcf_classify+0x89/0x140
        __netif_receive_skb_core+0x5ea/0xb70
        ? enqueue_entity+0xd0/0x270
        ? process_backlog+0x97/0x150
        process_backlog+0x97/0x150
        net_rx_action+0x14b/0x3e0
        __do_softirq+0xde/0x2b4
        do_softirq_own_stack+0x2a/0x40
        </IRQ>
        do_softirq.part.18+0x49/0x50
        __local_bh_enable_ip+0x49/0x50
        __dev_queue_xmit+0x4ab/0x8a0
        ? wait_woken+0x80/0x80
        ? packet_sendmsg+0x38f/0x810
        ? __dev_queue_xmit+0x8a0/0x8a0
        packet_sendmsg+0x38f/0x810
        sock_sendmsg+0x36/0x40
        __sys_sendto+0x10e/0x140
        ? do_vfs_ioctl+0xa4/0x630
        ? syscall_trace_enter+0x1df/0x2e0
        ? __audit_syscall_exit+0x22a/0x290
        __x64_sys_sendto+0x24/0x30
        do_syscall_64+0x5b/0x180
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7fd67e18dc93
       Code: 48 8b 0d 18 83 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 59 c7 20 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 2b f7 ff ff 48 89 04 24
       RSP: 002b:00007ffe0189b748 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
       RAX: ffffffffffffffda RBX: 00000000020ca010 RCX: 00007fd67e18dc93
       RDX: 0000000000000062 RSI: 00000000020ca322 RDI: 0000000000000003
       RBP: 00007ffe0189b780 R08: 00007ffe0189b760 R09: 0000000000000014
       R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000062
       R13: 00000000020ca322 R14: 00007ffe0189b760 R15: 0000000000000003
       Modules linked in: act_tunnel_key act_gact cls_flower sch_ingress vrf veth act_csum(E) xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek coretemp snd_hda_codec_generic kvm_intel kvm irqbypass snd_hda_intel crct10dif_pclmul crc32_pclmul hp_wmi ghash_clmulni_intel pcbc snd_hda_codec aesni_intel sparse_keymap rfkill snd_hda_core snd_hwdep snd_seq crypto_simd iTCO_wdt gpio_ich iTCO_vendor_support wmi_bmof cryptd mei_wdt glue_helper snd_seq_device snd_pcm pcspkr snd_timer snd i2c_i801 lpc_ich sg soundcore wmi mei_me
        mei ie31200_edac nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod sr_mod cdrom i915 video i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci crc32c_intel libahci serio_raw sfc libata mtd drm ixgbe mdio i2c_core e1000e dca
       CR2: 0000000000000000
       ---[ end trace 1ab8b5b5d4639dfc ]---
       RIP: 0010:tcf_action_exec+0xb8/0x100
       Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
       RSP: 0018:ffff95145ea03c40 EFLAGS: 00010246
       RAX: 0000000020000001 RBX: ffff9514499e5800 RCX: 0000000000000001
       RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
       RBP: ffff95145ea03e60 R08: 0000000000000000 R09: ffff95145ea03c9c
       R10: ffff95145ea03c78 R11: 0000000000000008 R12: ffff951456a69800
       R13: ffff951456a69808 R14: 0000000000000001 R15: ffff95144965ee40
       FS:  00007fd67ee11740(0000) GS:ffff95145ea00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000000 CR3: 00000001038a2006 CR4: 00000000001606f0
       Kernel panic - not syncing: Fatal exception in interrupt
       Kernel Offset: 0x11400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
       ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
      
      Fixes: d0f6dd8a ("net/sched: Introduce act_tunnel_key")
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38230a3e
    • D
      net/sched: act_csum: fix NULL dereference when 'goto chain' is used · 11a245e2
      Davide Caratti 提交于
      the control action in the common member of struct tcf_csum must be a valid
      value, as it can contain the chain index when 'goto chain' is used. Ensure
      that the control action can be read as x->tcfa_action, when x is a pointer
      to struct tc_action and x->ops->type is TCA_ACT_CSUM, to prevent the
      following command:
      
        # tc filter add dev $h2 ingress protocol ip pref 1 handle 101 flower \
        > $tcflags dst_mac $h2mac action csum ip or tcp or udp or sctp goto chain 1
      
      from triggering a NULL pointer dereference when a matching packet is
      received.
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
       PGD 800000010416b067 P4D 800000010416b067 PUD 1041be067 PMD 0
       Oops: 0000 [#1] SMP PTI
       CPU: 0 PID: 3072 Comm: mausezahn Tainted: G            E     4.18.0-rc2.auguri+ #421
       Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.58 02/07/2013
       RIP: 0010:tcf_action_exec+0xb8/0x100
       Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
       RSP: 0018:ffffa020dea03c40 EFLAGS: 00010246
       RAX: 0000000020000001 RBX: ffffa020d7ccef00 RCX: 0000000000000054
       RDX: 0000000000000000 RSI: ffffa020ca5ae000 RDI: ffffa020d7ccef00
       RBP: ffffa020dea03e60 R08: 0000000000000000 R09: ffffa020dea03c9c
       R10: ffffa020dea03c78 R11: 0000000000000008 R12: ffffa020d3fe4f00
       R13: ffffa020d3fe4f08 R14: 0000000000000001 R15: ffffa020d53ca300
       FS:  00007f5a46942740(0000) GS:ffffa020dea00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000000 CR3: 0000000104218002 CR4: 00000000001606f0
       Call Trace:
        <IRQ>
        fl_classify+0x1ad/0x1c0 [cls_flower]
        ? arp_rcv+0x121/0x1b0
        ? __x2apic_send_IPI_dest+0x40/0x40
        ? smp_reschedule_interrupt+0x1c/0xd0
        ? reschedule_interrupt+0xf/0x20
        ? reschedule_interrupt+0xa/0x20
        ? device_is_rmrr_locked+0xe/0x50
        ? iommu_should_identity_map+0x49/0xd0
        ? __intel_map_single+0x30/0x140
        ? e1000e_update_rdt_wa.isra.52+0x22/0xb0 [e1000e]
        ? e1000_alloc_rx_buffers+0x233/0x250 [e1000e]
        ? kmem_cache_alloc+0x38/0x1c0
        tcf_classify+0x89/0x140
        __netif_receive_skb_core+0x5ea/0xb70
        ? enqueue_task_fair+0xb6/0x7d0
        ? process_backlog+0x97/0x150
        process_backlog+0x97/0x150
        net_rx_action+0x14b/0x3e0
        __do_softirq+0xde/0x2b4
        do_softirq_own_stack+0x2a/0x40
        </IRQ>
        do_softirq.part.18+0x49/0x50
        __local_bh_enable_ip+0x49/0x50
        __dev_queue_xmit+0x4ab/0x8a0
        ? wait_woken+0x80/0x80
        ? packet_sendmsg+0x38f/0x810
        ? __dev_queue_xmit+0x8a0/0x8a0
        packet_sendmsg+0x38f/0x810
        sock_sendmsg+0x36/0x40
        __sys_sendto+0x10e/0x140
        ? do_vfs_ioctl+0xa4/0x630
        ? syscall_trace_enter+0x1df/0x2e0
        ? __audit_syscall_exit+0x22a/0x290
        __x64_sys_sendto+0x24/0x30
        do_syscall_64+0x5b/0x180
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f5a45cbec93
       Code: 48 8b 0d 18 83 20 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 59 c7 20 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 2b f7 ff ff 48 89 04 24
       RSP: 002b:00007ffd0ee6d748 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
       RAX: ffffffffffffffda RBX: 0000000001161010 RCX: 00007f5a45cbec93
       RDX: 0000000000000062 RSI: 0000000001161322 RDI: 0000000000000003
       RBP: 00007ffd0ee6d780 R08: 00007ffd0ee6d760 R09: 0000000000000014
       R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000062
       R13: 0000000001161322 R14: 00007ffd0ee6d760 R15: 0000000000000003
       Modules linked in: act_csum act_gact cls_flower sch_ingress vrf veth act_tunnel_key(E) xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_hdmi snd_hda_codec_realtek kvm snd_hda_codec_generic hp_wmi iTCO_wdt sparse_keymap rfkill mei_wdt iTCO_vendor_support wmi_bmof gpio_ich irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel snd_hda_intel crypto_simd cryptd snd_hda_codec glue_helper snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm pcspkr i2c_i801 snd_timer snd sg lpc_ich soundcore wmi mei_me
        mei ie31200_edac nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom sd_mod ahci libahci crc32c_intel i915 ixgbe serio_raw libata video dca i2c_algo_bit sfc drm_kms_helper syscopyarea mtd sysfillrect mdio sysimgblt fb_sys_fops drm e1000e i2c_core
       CR2: 0000000000000000
       ---[ end trace 3c9e9d1a77df4026 ]---
       RIP: 0010:tcf_action_exec+0xb8/0x100
       Code: 00 00 00 20 74 1d 83 f8 03 75 09 49 83 c4 08 4d 39 ec 75 bc 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8b 97 a8 00 00 00 <48> 8b 12 48 89 55 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3
       RSP: 0018:ffffa020dea03c40 EFLAGS: 00010246
       RAX: 0000000020000001 RBX: ffffa020d7ccef00 RCX: 0000000000000054
       RDX: 0000000000000000 RSI: ffffa020ca5ae000 RDI: ffffa020d7ccef00
       RBP: ffffa020dea03e60 R08: 0000000000000000 R09: ffffa020dea03c9c
       R10: ffffa020dea03c78 R11: 0000000000000008 R12: ffffa020d3fe4f00
       R13: ffffa020d3fe4f08 R14: 0000000000000001 R15: ffffa020d53ca300
       FS:  00007f5a46942740(0000) GS:ffffa020dea00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000000 CR3: 0000000104218002 CR4: 00000000001606f0
       Kernel panic - not syncing: Fatal exception in interrupt
       Kernel Offset: 0x26400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
       ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
      
      Fixes: 9c5f69bb ("net/sched: act_csum: don't use spinlock in the fast path")
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11a245e2
    • U
      net/smc: reduce sock_put() for fallback sockets · e1bbdd57
      Ursula Braun 提交于
      smc_release() calls a sock_put() for smc fallback sockets to cover
      the passive closing sock_hold() in __smc_connect() and
      smc_tcp_listen_work(). This does not make sense for sockets in state
      SMC_LISTEN and SMC_INIT.
      An SMC socket stays in state SMC_INIT if connect fails. The sock_put
      in smc_connect_abort() does not cover all failures. Move it into
      smc_connect_decline_fallback().
      
      Fixes: ee9dfbef ("net/smc: handle sockopts forcing fallback")
      Reported-by: syzbot+3a0748c8f2f210c0ef9b@syzkaller.appspotmail.com
      Reported-by: syzbot+9e60d2428a42049a592a@syzkaller.appspotmail.com
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1bbdd57
    • J
      tipc: make function tipc_net_finalize() thread safe · 9faa89d4
      Jon Maloy 提交于
      The setting of the node address is not thread safe, meaning that
      two discoverers may decide to set it simultanously, with a duplicate
      entry in the name table as result. We fix that with this commit.
      
      Fixes: 25b0b9c4 ("tipc: handle collisions of 32-bit node address hash values")
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9faa89d4
    • J
      tipc: fix correct setting of message type in second discoverer · 92018c7c
      Jon Maloy 提交于
      The duplicate address discovery protocol is not safe against two
      discoverers running in parallel. The one executing first after the
      trial period is over will set the node address and change its own
      message type to DSC_REQ_MSG. The one executing last may find that the
      node address is already set, and never change message type, with the
      result that its links may never be established.
      
      In this commmit we ensure that the message type always is set correctly
      after the trial period is over.
      
      Fixes: 25b0b9c4 ("tipc: handle collisions of 32-bit node address hash values")
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92018c7c
    • J
      tipc: correct discovery message handling during address trial period · e415577f
      Jon Maloy 提交于
      With the duplicate address discovery protocol for tipc nodes addresses
      we introduced a one second trial period before a node is allocated a
      hash number to use as address.
      
      Unfortunately, we miss to handle the case when a regular LINK REQUEST/
      RESPONSE arrives from a cluster node during the trial period. Such
      messages are not ignored as they should be, leading to links setup
      attempts while the node still has no address.
      
      Fixes: 25b0b9c4 ("tipc: handle collisions of 32-bit node address hash values")
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e415577f
    • J
      tipc: fix wrong return value from function tipc_node_try_addr() · 2a57f182
      Jon Maloy 提交于
      The function for checking if there is an node address conflict is
      supposed to return a suggestion for a new address if it finds a
      conflict, and zero otherwise. But in case the peer being checked
      is previously unknown it does instead return a "suggestion" for
      the checked address itself. This results in a DSC_TRIAL_FAIL_MSG
      being sent unecessarily to the peer, and sometimes makes the trial
      period starting over again.
      
      Fixes: 25b0b9c4 ("tipc: handle collisions of 32-bit node address hash values")
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a57f182
    • P
      netfilter: nf_tables: place all set backends in one single module · e240cd0d
      Pablo Neira Ayuso 提交于
      This patch disallows rbtree with single elements, which is causing
      problems with the recent timeout support. Before this patch, you
      could opt out individual set representations per module, which is
      just adding extra complexity.
      
      Fixes: 8d8540c4("netfilter: nft_set_rbtree: add timeout support")
      Reported-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      e240cd0d
  8. 06 7月, 2018 2 次提交
    • M
      netfilter: nf_tproxy: fix possible non-linear access to transport header · 5711b4e8
      Máté Eckl 提交于
      This patch fixes a silent out-of-bound read possibility that was present
      because of the misuse of this function.
      
      Mostly it was called with a struct udphdr *hp which had only the udphdr
      part linearized by the skb_header_pointer, however
      nf_tproxy_get_sock_v{4,6} uses it as a tcphdr pointer, so some reads for
      tcp specific attributes may be invalid.
      
      Fixes: a583636a ("inet: refactor inet[6]_lookup functions to take skb")
      Signed-off-by: NMáté Eckl <ecklm94@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      5711b4e8
    • T
      ipv4: Return EINVAL when ping_group_range sysctl doesn't map to user ns · 70ba5b6d
      Tyler Hicks 提交于
      The low and high values of the net.ipv4.ping_group_range sysctl were
      being silently forced to the default disabled state when a write to the
      sysctl contained GIDs that didn't map to the associated user namespace.
      Confusingly, the sysctl's write operation would return success and then
      a subsequent read of the sysctl would indicate that the low and high
      values are the overflowgid.
      
      This patch changes the behavior by clearly returning an error when the
      sysctl write operation receives a GID range that doesn't map to the
      associated user namespace. In such a situation, the previous value of
      the sysctl is preserved and that range will be returned in a subsequent
      read of the sysctl.
      Signed-off-by: NTyler Hicks <tyhicks@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70ba5b6d
  9. 05 7月, 2018 4 次提交
    • A
      net: qrtr: Reset the node and port ID of broadcast messages · d27e77a3
      Arun Kumar Neelakantam 提交于
      All the control messages broadcast to remote routers are using
      QRTR_NODE_BCAST instead of using local router NODE ID which cause
      the packets to be dropped on remote router due to invalid NODE ID.
      Signed-off-by: NArun Kumar Neelakantam <aneela@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d27e77a3
    • A
      net: qrtr: Broadcast messages only from control port · fdf5fd39
      Arun Kumar Neelakantam 提交于
      The broadcast node id should only be sent with the control port id.
      Signed-off-by: NArun Kumar Neelakantam <aneela@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fdf5fd39
    • P
      ipv6: make ipv6_renew_options() interrupt/kernel safe · a9ba23d4
      Paul Moore 提交于
      At present the ipv6_renew_options_kern() function ends up calling into
      access_ok() which is problematic if done from inside an interrupt as
      access_ok() calls WARN_ON_IN_IRQ() on some (all?) architectures
      (x86-64 is affected).  Example warning/backtrace is shown below:
      
       WARNING: CPU: 1 PID: 3144 at lib/usercopy.c:11 _copy_from_user+0x85/0x90
       ...
       Call Trace:
        <IRQ>
        ipv6_renew_option+0xb2/0xf0
        ipv6_renew_options+0x26a/0x340
        ipv6_renew_options_kern+0x2c/0x40
        calipso_req_setattr+0x72/0xe0
        netlbl_req_setattr+0x126/0x1b0
        selinux_netlbl_inet_conn_request+0x80/0x100
        selinux_inet_conn_request+0x6d/0xb0
        security_inet_conn_request+0x32/0x50
        tcp_conn_request+0x35f/0xe00
        ? __lock_acquire+0x250/0x16c0
        ? selinux_socket_sock_rcv_skb+0x1ae/0x210
        ? tcp_rcv_state_process+0x289/0x106b
        tcp_rcv_state_process+0x289/0x106b
        ? tcp_v6_do_rcv+0x1a7/0x3c0
        tcp_v6_do_rcv+0x1a7/0x3c0
        tcp_v6_rcv+0xc82/0xcf0
        ip6_input_finish+0x10d/0x690
        ip6_input+0x45/0x1e0
        ? ip6_rcv_finish+0x1d0/0x1d0
        ipv6_rcv+0x32b/0x880
        ? ip6_make_skb+0x1e0/0x1e0
        __netif_receive_skb_core+0x6f2/0xdf0
        ? process_backlog+0x85/0x250
        ? process_backlog+0x85/0x250
        ? process_backlog+0xec/0x250
        process_backlog+0xec/0x250
        net_rx_action+0x153/0x480
        __do_softirq+0xd9/0x4f7
        do_softirq_own_stack+0x2a/0x40
        </IRQ>
        ...
      
      While not present in the backtrace, ipv6_renew_option() ends up calling
      access_ok() via the following chain:
      
        access_ok()
        _copy_from_user()
        copy_from_user()
        ipv6_renew_option()
      
      The fix presented in this patch is to perform the userspace copy
      earlier in the call chain such that it is only called when the option
      data is actually coming from userspace; that place is
      do_ipv6_setsockopt().  Not only does this solve the problem seen in
      the backtrace above, it also allows us to simplify the code quite a
      bit by removing ipv6_renew_options_kern() completely.  We also take
      this opportunity to cleanup ipv6_renew_options()/ipv6_renew_option()
      a small amount as well.
      
      This patch is heavily based on a rough patch by Al Viro.  I've taken
      his original patch, converted a kmemdup() call in do_ipv6_setsockopt()
      to a memdup_user() call, made better use of the e_inval jump target in
      the same function, and cleaned up the use ipv6_renew_option() by
      ipv6_renew_options().
      
      CC: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9ba23d4
    • F
      netfilter: x_tables: set module owner for icmp(6) matches · d376bef9
      Florian Westphal 提交于
      nft_compat relies on xt_request_find_match to increment
      refcount of the module that provides the match/target.
      
      The (builtin) icmp matches did't set the module owner so it
      was possible to rmmod ip(6)tables while icmp extensions were still in use.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d376bef9