1. 14 7月, 2018 6 次提交
    • Y
      tcp: remove DELAYED ACK events in DCTCP · a69258f7
      Yuchung Cheng 提交于
      After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK
      related callbacks are no longer needed
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a69258f7
    • Y
      tcp: fix dctcp delayed ACK schedule · b0c05d0e
      Yuchung Cheng 提交于
      Previously, when a data segment was sent an ACK was piggybacked
      on the data segment without generating a CA_EVENT_NON_DELAYED_ACK
      event to notify congestion control modules. So the DCTCP
      ca->delayed_ack_reserved flag could incorrectly stay set when
      in fact there were no delayed ACKs being reserved. This could result
      in sending a special ECN notification ACK that carries an older
      ACK sequence, when in fact there was no need for such an ACK.
      DCTCP keeps track of the delayed ACK status with its own separate
      state ca->delayed_ack_reserved. Previously it may accidentally cancel
      the delayed ACK without updating this field upon sending a special
      ACK that carries a older ACK sequence. This inconsistency would
      lead to DCTCP receiver never acknowledging the latest data until the
      sender times out and retry in some cases.
      
      Packetdrill script (provided by Larry Brakmo)
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      0.110 < [ect0] . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      
      0.200 < [ect0] . 1:1001(1000) ack 1 win 257
      0.200 > [ect01] . 1:1(0) ack 1001
      
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 1:2(1) ack 1001
      
      0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 2:3(1) ack 2001
      
      0.200 < [ect0] . 2001:3001(1000) ack 3 win 257
      0.200 < [ect0] . 3001:4001(1000) ack 3 win 257
      0.200 > [ect01] . 3:3(0) ack 4001
      
      0.210 < [ce] P. 4001:4501(500) ack 3 win 257
      
      +0.001 read(4, ..., 4500) = 4500
      +0 write(4, ..., 1) = 1
      +0 > [ect01] PE. 3:4(1) ack 4501
      
      +0.010 < [ect0] W. 4501:5501(1000) ack 4 win 257
      // Previously the ACK sequence below would be 4501, causing a long RTO
      +0.040~+0.045 > [ect01] . 4:4(0) ack 5501   // delayed ack
      
      +0.311 < [ect0] . 5501:6501(1000) ack 4 win 257  // More data
      +0 > [ect01] . 4:4(0) ack 6501     // now acks everything
      
      +0.500 < F. 9501:9501(0) ack 4 win 257
      Reported-by: NLarry Brakmo <brakmo@fb.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0c05d0e
    • D
      qlogic: check kstrtoul() for errors · 5fc853cc
      Dan Carpenter 提交于
      We accidentally left out the error handling for kstrtoul().
      
      Fixes: a520030e ("qlcnic: Implement flash sysfs callback for 83xx adapter")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5fc853cc
    • M
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · c849eb0d
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2018-07-13
      
      The following pull-request contains BPF updates for your *net* tree.
      
      The main changes are:
      
      1) Fix AF_XDP TX error reporting before final kernel release such that it
         becomes consistent between copy mode and zero-copy, from Magnus.
      
      2) Fix three different syzkaller reported issues: oob due to ld_abs
         rewrite with too large offset, another oob in l3 based skb test run
         and a bug leaving mangled prog in subprog JITing error path, from Daniel.
      
      3) Fix BTF handling for bitfield extraction on big endian, from Okash.
      
      4) Fix a missing linux/errno.h include in cgroup/BPF found by kbuild bot,
         from Roman.
      
      5) Fix xdp2skb_meta.sh sample by using just command names instead of
         absolute paths for tc and ip and allow them to be redefined, from Taeung.
      
      6) Fix availability probing for BPF seg6 helpers before final kernel ships
         so they can be detected at prog load time, from Mathieu.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c849eb0d
    • S
      skbuff: Unconditionally copy pfmemalloc in __skb_clone() · e78bfb07
      Stefano Brivio 提交于
      Commit 8b700862 ("net: Don't copy pfmemalloc flag in
      __copy_skb_header()") introduced a different handling for the
      pfmemalloc flag in copy and clone paths.
      
      In __skb_clone(), now, the flag is set only if it was set in the
      original skb, but not cleared if it wasn't. This is wrong and
      might lead to socket buffers being flagged with pfmemalloc even
      if the skb data wasn't allocated from pfmemalloc reserves. Copy
      the flag instead of ORing it.
      Reported-by: NSabrina Dubroca <sd@queasysnail.net>
      Fixes: 8b700862 ("net: Don't copy pfmemalloc flag in __copy_skb_header()")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Tested-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e78bfb07
  2. 13 7月, 2018 18 次提交
    • D
      Merge branch 'bpf-af-xdp-consistent-err-reporting' · 5e3e6e83
      Daniel Borkmann 提交于
      Magnus Karlsson says:
      
      ====================
      This patch set adjusts the AF_XDP TX error reporting so that it becomes
      consistent between copy mode and zero-copy. First some background:
      
      Copy-mode for TX uses the SKB path in which the action of sending the
      packet is performed from process context using the sendmsg
      syscall. Completions are usually done asynchronously from NAPI mode by
      using a TX interrupt. In this mode, send errors can be returned back
      through the syscall.
      
      In zero-copy mode both the sending of the packet and the completions
      are done asynchronously from NAPI mode for performance reasons. In
      this mode, the sendmsg syscall only makes sure that the TX NAPI loop
      will be run that performs both the actions of sending and
      completing. In this mode it is therefore not possible to return errors
      through the sendmsg syscall as the sending is done from the NAPI
      loop. Note that it is possible to implement a synchronous send with
      our API, but in our benchmarks that made the TX performance drop by
      nearly half due to synchronization requirements and cache line
      bouncing. But for some netdevs this might be preferable so let us
      leave it up to the implementation to decide.
      
      The problem is that the current code base returns some errors in
      copy-mode that are not possible to return in zero-copy mode. This
      patch set aligns them so that the two modes always return the same
      error code. We achieve this by removing some of the errors returned by
      sendmsg in copy-mode (and in one case adding an error message for
      zero-copy mode) and offering alternative error detection methods that
      are consistent between the two modes.
      
      The structure of the patch set is as follows:
      
      Patch 1: removes the ENXIO return code from copy-mode when someone has
      forcefully changed the number of queues on the device so that the
      queue bound to the socket is no longer available. Just silently stop
      sending anything as in zero-copy mode.
      
      Patch 2: stop returning EAGAIN in copy mode when the completion queue
      is full as zero-copy does not do this. Instead this situation can be
      detected by comparing the head and tail pointers of the completion
      queue in both modes. In any case, EAGAIN was not the correct error code
      here since no amount of calling sendmsg will solve the problem. Only
      consuming one or more messages on the completion queue will fix this.
      
      Patch 3: Always return ENOBUFS from sendmsg if there is no TX queue
      configured. This was not the case for zero-copy mode.
      
      Patch 4: stop returning EMSGSIZE when the size of the packet is larger
      than the MTU. Just send it to the device so that it will drop it as in
      zero-copy mode.
      
      Note that copy-mode can still return EAGAIN in certain circumstances,
      but as these conditions cannot occur in zero-copy mode it is fine for
      copy-mode to return them.
      ====================
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5e3e6e83
    • M
      xsk: do not return EMSGSIZE in copy mode for packets larger than MTU · 09210c4b
      Magnus Karlsson 提交于
      This patch stops returning EMSGSIZE from sendmsg in copy mode when the
      size of the packet is larger than the MTU. Just send it to the device
      so that it will drop it as in zero-copy mode. This makes the error
      reporting consistent between copy mode and zero-copy mode.
      
      Fixes: 35fcde7f ("xsk: support for Tx")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      09210c4b
    • M
      xsk: always return ENOBUFS from sendmsg if there is no TX queue · 6efb4436
      Magnus Karlsson 提交于
      This patch makes sure ENOBUFS is always returned from sendmsg if there
      is no TX queue configured. This was not the case for zero-copy
      mode. With this patch this error reporting is consistent between copy
      mode and zero-copy mode.
      
      Fixes: ac98d8aa ("xsk: wire upp Tx zero-copy functions")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6efb4436
    • M
      xsk: do not return EAGAIN from sendmsg when completion queue is full · 9684f5e7
      Magnus Karlsson 提交于
      This patch stops returning EAGAIN in TX copy mode when the completion
      queue is full as zero-copy does not do this. Instead this situation
      can be detected by comparing the head and tail pointers of the
      completion queue in both modes. In any case, EAGAIN was not the
      correct error code here since no amount of calling sendmsg will solve
      the problem. Only consuming one or more messages on the completion
      queue will fix this.
      
      With this patch, the error reporting becomes consistent between copy
      mode and zero-copy mode.
      
      Fixes: 35fcde7f ("xsk: support for Tx")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9684f5e7
    • M
      xsk: do not return ENXIO from TX copy mode · 509d7648
      Magnus Karlsson 提交于
      This patch removes the ENXIO return code from TX copy-mode when
      someone has forcefully changed the number of queues on the device so
      that the queue bound to the socket is no longer available. Just
      silently stop sending anything as in zero-copy mode so the error
      reporting gets consistent between the two modes.
      
      Fixes: 35fcde7f ("xsk: support for Tx")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      509d7648
    • W
      selftests: in udpgso_bench do not test udp zerocopy · 8f19f12b
      Willem de Bruijn 提交于
      The udpgso benchmark compares various configurations of UDP and TCP.
      Including one that is not upstream, udp zerocopy. This is a leftover
      from the earlier RFC patchset.
      
      The test is part of kselftests and run in continuous spinners. Remove
      the failing case to make the test start passing.
      
      Fixes: 3a687bef ("selftests: udp gso benchmark")
      Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f19f12b
    • W
      packet: reset network header if packet shorter than ll reserved space · 993675a3
      Willem de Bruijn 提交于
      If variable length link layer headers result in a packet shorter
      than dev->hard_header_len, reset the network header offset. Else
      skb->mac_len may exceed skb->len after skb_mac_reset_len.
      
      packet_sendmsg_spkt already has similar logic.
      
      Fixes: b84bbaf7 ("packet: in packet_snd start writing at link layer allocation")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      993675a3
    • W
      nsh: set mac len based on inner packet · bab2c80e
      Willem de Bruijn 提交于
      When pulling the NSH header in nsh_gso_segment, set the mac length
      based on the encapsulated packet type.
      
      skb_reset_mac_len computes an offset to the network header, which
      here still points to the outer packet:
      
        >     skb_reset_network_header(skb);
        >     [...]
        >     __skb_pull(skb, nsh_len);
        >     skb_reset_mac_header(skb);    // now mac hdr starts nsh_len == 8B after net hdr
        >     skb_reset_mac_len(skb);       // mac len = net hdr - mac hdr == (u16) -8 == 65528
        >     [..]
        >     skb_mac_gso_segment(skb, ..)
      
      Link: http://lkml.kernel.org/r/CAF=yD-KeAcTSOn4AxirAxL8m7QAS8GBBe1w09eziYwvPbbUeYA@mail.gmail.com
      Reported-by: syzbot+7b9ed9872dab8c32305d@syzkaller.appspotmail.com
      Fixes: c411ed85 ("nsh: add GSO support")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bab2c80e
    • S
      net: Don't copy pfmemalloc flag in __copy_skb_header() · 8b700862
      Stefano Brivio 提交于
      The pfmemalloc flag indicates that the skb was allocated from
      the PFMEMALLOC reserves, and the flag is currently copied on skb
      copy and clone.
      
      However, an skb copied from an skb flagged with pfmemalloc
      wasn't necessarily allocated from PFMEMALLOC reserves, and on
      the other hand an skb allocated that way might be copied from an
      skb that wasn't.
      
      So we should not copy the flag on skb copy, and rather decide
      whether to allow an skb to be associated with sockets unrelated
      to page reclaim depending only on how it was allocated.
      
      Move the pfmemalloc flag before headers_start[0] using an
      existing 1-bit hole, so that __copy_skb_header() doesn't copy
      it.
      
      When cloning, we'll now take care of this flag explicitly,
      contravening to the warning comment of __skb_clone().
      
      While at it, restore the newline usage introduced by commit
      b1937227 ("net: reorganize sk_buff for faster
      __copy_skb_header()") to visually separate bytes used in
      bitfields after headers_start[0], that was gone after commit
      a9e419dc ("netfilter: merge ctinfo into nfct pointer storage
      area"), and describe the pfmemalloc flag in the kernel-doc
      structure comment.
      
      This doesn't change the size of sk_buff or cacheline boundaries,
      but consolidates the 15 bits hole before tc_index into a 2 bytes
      hole before csum, that could now be filled more easily.
      Reported-by: NPatrick Talbert <ptalbert@redhat.com>
      Fixes: c93bdd0e ("netvm: allow skb allocation to use PFMEMALLOC reserves")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b700862
    • D
      Merge branch 'sfc-filter-locking-fixes' · 1ff9c66b
      David S. Miller 提交于
      Bert Kenward says:
      
      ====================
      sfc: filter locking fixes
      
      Two fixes for sfc ef10 filter table locking. Initially spotted
      by lockdep, but one issue has also been seen in normal use.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ff9c66b
    • B
      sfc: hold filter_sem consistently during reset · 193f2003
      Bert Kenward 提交于
      We should take and release the filter_sem consistently during the
      reset process, in the same manner as the mac_lock and reset_lock.
      
      For lockdep consistency we also take the filter_sem for write around
      other calls to efx->type->init().
      
      Fixes: c2bebe37 ("sfc: give ef10 its own rwsem in the filter table instead of filter_lock")
      Signed-off-by: NBert Kenward <bkenward@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      193f2003
    • B
      sfc: avoid hang from nested use of the filter_sem · 1c56c099
      Bert Kenward 提交于
      In some situations we may end up calling down_read while already
      holding the semaphore for write, thus hanging. This has been seen
      when setting the MAC address for the interface. The hung task log
      in this situation includes this stack:
        down_read
        efx_ef10_filter_insert
        efx_ef10_filter_insert_addr_list
        efx_ef10_filter_vlan_sync_rx_mode
        efx_ef10_filter_add_vlan
        efx_ef10_filter_table_probe
        efx_ef10_set_mac_address
        efx_set_mac_address
        dev_set_mac_address
      
      In addition, lockdep rightly points out that nested calling of
      down_read is incorrect.
      
      Fixes: c2bebe37 ("sfc: give ef10 its own rwsem in the filter table instead of filter_lock")
      Tested-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NBert Kenward <bkenward@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c56c099
    • F
      net: systemport: Fix CRC forwarding check for SYSTEMPORT Lite · 9e3bff92
      Florian Fainelli 提交于
      SYSTEMPORT Lite reversed the logic compared to SYSTEMPORT, the
      GIB_FCS_STRIP bit is set when the Ethernet FCS is stripped, and that bit
      is not set by default. Fix the logic such that we properly check whether
      that bit is set or not and we don't forward an extra 4 bytes to the
      network stack.
      
      Fixes: 44a4524c ("net: systemport: Add support for SYSTEMPORT Lite")
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e3bff92
    • S
      tcp: allow user to create repair socket without window probes · 70b7ff13
      Stefan Baranoff 提交于
      Under rare conditions where repair code may be used it is possible that
      window probes are either unnecessary or undesired. If the user knows that
      window probes are not wanted or needed this change allows them to skip
      sending them when a socket comes out of repair.
      Signed-off-by: NStefan Baranoff <sbaranoff@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70b7ff13
    • S
      tcp: fix sequence numbers for repaired sockets re-using TIME-WAIT sockets · 21684dc4
      Stefan Baranoff 提交于
      This patch fixes a bug where the sequence numbers of a socket created using
      TCP repair functionality are lower than set after connect is called.
      This occurs when the repair socket overlaps with a TIME-WAIT socket and
      triggers the re-use code. The amount lower is equal to the number of times
      that a particular IP/port set is re-used and then put back into TIME-WAIT.
      Re-using the first time the sequence number is 1 lower, closing that socket
      and then re-opening (with repair) a new socket with the same addresses/ports
      puts the sequence number 2 lower than set via setsockopt. The third time is
      3 lower, etc. I have not tested what the limit of this acrewal is, if any.
      
      The fix is, if a socket is in repair mode, to respect the already set
      sequence number and timestamp when it would have already re-used the
      TIME-WAIT socket.
      Signed-off-by: NStefan Baranoff <sbaranoff@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21684dc4
    • D
      bpf: don't leave partial mangled prog in jit_subprogs error path · c7a89784
      Daniel Borkmann 提交于
      syzkaller managed to trigger the following bug through fault injection:
      
        [...]
        [  141.043668] verifier bug. No program starts at insn 3
        [  141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
                       get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
        [  141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
                       fixup_call_args kernel/bpf/verifier.c:5587 [inline]
        [  141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613
                       bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
        [  141.047355] CPU: 3 PID: 4072 Comm: a.out Not tainted 4.18.0-rc4+ #51
        [  141.048446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS 1.10.2-1 04/01/2014
        [  141.049877] Call Trace:
        [  141.050324]  __dump_stack lib/dump_stack.c:77 [inline]
        [  141.050324]  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
        [  141.050950]  ? dump_stack_print_info.cold.2+0x52/0x52 lib/dump_stack.c:60
        [  141.051837]  panic+0x238/0x4e7 kernel/panic.c:184
        [  141.052386]  ? add_taint.cold.5+0x16/0x16 kernel/panic.c:385
        [  141.053101]  ? __warn.cold.8+0x148/0x1ba kernel/panic.c:537
        [  141.053814]  ? __warn.cold.8+0x117/0x1ba kernel/panic.c:530
        [  141.054506]  ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
        [  141.054506]  ? fixup_call_args kernel/bpf/verifier.c:5587 [inline]
        [  141.054506]  ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
        [  141.055163]  __warn.cold.8+0x163/0x1ba kernel/panic.c:538
        [  141.055820]  ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline]
        [  141.055820]  ? fixup_call_args kernel/bpf/verifier.c:5587 [inline]
        [  141.055820]  ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952
        [...]
      
      What happens in jit_subprogs() is that kcalloc() for the subprog func
      buffer is failing with NULL where we then bail out. Latter is a plain
      return -ENOMEM, and this is definitely not okay since earlier in the
      loop we are walking all subprogs and temporarily rewrite insn->off to
      remember the subprog id as well as insn->imm to temporarily point the
      call to __bpf_call_base + 1 for the initial JIT pass. Thus, bailing
      out in such state and handing this over to the interpreter is troublesome
      since later/subsequent e.g. find_subprog() lookups are based on wrong
      insn->imm.
      
      Therefore, once we hit this point, we need to jump to out_free path
      where we undo all changes from earlier loop, so that interpreter can
      work on unmodified insn->{off,imm}.
      
      Another point is that should find_subprog() fail in jit_subprogs() due
      to a verifier bug, then we also should not simply defer the program to
      the interpreter since also here we did partial modifications. Instead
      we should just bail out entirely and return an error to the user who is
      trying to load the program.
      
      Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
      Reported-by: syzbot+7d427828b2ea6e592804@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c7a89784
    • J
      sch_fq_codel: zero q->flows_cnt when fq_codel_init fails · 83fe6b87
      Jacob Keller 提交于
      When fq_codel_init fails, qdisc_create_dflt will cleanup by using
      qdisc_destroy. This function calls the ->reset() op prior to calling the
      ->destroy() op.
      
      Unfortunately, during the failure flow for sch_fq_codel, the ->flows
      parameter is not initialized, so the fq_codel_reset function will null
      pointer dereference.
      
         kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
         kernel: IP: fq_codel_reset+0x58/0xd0 [sch_fq_codel]
         kernel: PGD 0 P4D 0
         kernel: Oops: 0000 [#1] SMP PTI
         kernel: Modules linked in: i40iw i40e(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc devlink ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert iscsi_target_mod sunrpc ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate iTCO_wdt iTCO_vendor_support intel_uncore ib_core intel_rapl_perf mei_me mei joydev i2c_i801 lpc_ich ioatdma shpchp wmi sch_fq_codel xfs libcrc32c mgag200 ixgbe drm_kms_helper isci ttm firewire_ohci
         kernel:  mdio drm igb libsas crc32c_intel firewire_core ptp pps_core scsi_transport_sas crc_itu_t dca i2c_algo_bit ipmi_si ipmi_devintf ipmi_msghandler [last unloaded: i40e]
         kernel: CPU: 10 PID: 4219 Comm: ip Tainted: G           OE    4.16.13custom-fq-codel-test+ #3
         kernel: Hardware name: Intel Corporation S2600CO/S2600CO, BIOS SE5C600.86B.02.05.0004.051120151007 05/11/2015
         kernel: RIP: 0010:fq_codel_reset+0x58/0xd0 [sch_fq_codel]
         kernel: RSP: 0018:ffffbfbf4c1fb620 EFLAGS: 00010246
         kernel: RAX: 0000000000000400 RBX: 0000000000000000 RCX: 00000000000005b9
         kernel: RDX: 0000000000000000 RSI: ffff9d03264a60c0 RDI: ffff9cfd17b31c00
         kernel: RBP: 0000000000000001 R08: 00000000000260c0 R09: ffffffffb679c3e9
         kernel: R10: fffff1dab06a0e80 R11: ffff9cfd163af800 R12: ffff9cfd17b31c00
         kernel: R13: 0000000000000001 R14: ffff9cfd153de600 R15: 0000000000000001
         kernel: FS:  00007fdec2f92800(0000) GS:ffff9d0326480000(0000) knlGS:0000000000000000
         kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         kernel: CR2: 0000000000000008 CR3: 0000000c1956a006 CR4: 00000000000606e0
         kernel: Call Trace:
         kernel:  qdisc_destroy+0x56/0x140
         kernel:  qdisc_create_dflt+0x8b/0xb0
         kernel:  mq_init+0xc1/0xf0
         kernel:  qdisc_create_dflt+0x5a/0xb0
         kernel:  dev_activate+0x205/0x230
         kernel:  __dev_open+0xf5/0x160
         kernel:  __dev_change_flags+0x1a3/0x210
         kernel:  dev_change_flags+0x21/0x60
         kernel:  do_setlink+0x660/0xdf0
         kernel:  ? down_trylock+0x25/0x30
         kernel:  ? xfs_buf_trylock+0x1a/0xd0 [xfs]
         kernel:  ? rtnl_newlink+0x816/0x990
         kernel:  ? _xfs_buf_find+0x327/0x580 [xfs]
         kernel:  ? _cond_resched+0x15/0x30
         kernel:  ? kmem_cache_alloc+0x20/0x1b0
         kernel:  ? rtnetlink_rcv_msg+0x200/0x2f0
         kernel:  ? rtnl_calcit.isra.30+0x100/0x100
         kernel:  ? netlink_rcv_skb+0x4c/0x120
         kernel:  ? netlink_unicast+0x19e/0x260
         kernel:  ? netlink_sendmsg+0x1ff/0x3c0
         kernel:  ? sock_sendmsg+0x36/0x40
         kernel:  ? ___sys_sendmsg+0x295/0x2f0
         kernel:  ? ebitmap_cmp+0x6d/0x90
         kernel:  ? dev_get_by_name_rcu+0x73/0x90
         kernel:  ? skb_dequeue+0x52/0x60
         kernel:  ? __inode_wait_for_writeback+0x7f/0xf0
         kernel:  ? bit_waitqueue+0x30/0x30
         kernel:  ? fsnotify_grab_connector+0x3c/0x60
         kernel:  ? __sys_sendmsg+0x51/0x90
         kernel:  ? do_syscall_64+0x74/0x180
         kernel:  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
         kernel: Code: 00 00 48 89 87 00 02 00 00 8b 87 a0 01 00 00 85 c0 0f 84 84 00 00 00 31 ed 48 63 dd 83 c5 01 48 c1 e3 06 49 03 9c 24 90 01 00 00 <48> 8b 73 08 48 8b 3b e8 6c 9a 4f f6 48 8d 43 10 48 c7 03 00 00
         kernel: RIP: fq_codel_reset+0x58/0xd0 [sch_fq_codel] RSP: ffffbfbf4c1fb620
         kernel: CR2: 0000000000000008
         kernel: ---[ end trace e81a62bede66274e ]---
      
      This is caused because flows_cnt is non-zero, but flows hasn't been
      initialized. fq_codel_init has left the private data in a partially
      initialized state.
      
      To fix this, reset flows_cnt to 0 when we fail to initialize.
      Additionally, to make the state more consistent, also cleanup the flows
      pointer when the allocation of backlogs fails.
      
      This fixes the NULL pointer dereference, since both the for-loop and
      memset in fq_codel_reset will be no-ops when flow_cnt is zero.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83fe6b87
    • D
      Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue · 35288486
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2018-07-12
      
      This series contains updates to ixgbe and e100/e1000 kernel documentation.
      
      Alex fixes ixgbe to ensure that we are more explicit about the ordering
      of updates to the receive address register (RAR) table.
      
      Dan Carpenter fixes an issue where we were reading one element beyond
      the end of the array.
      
      Mauro Carvalho Chehab fixes formatting issues in the e100.rst and
      e1000.rst that were causing errors during 'make htmldocs'.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35288486
  3. 12 7月, 2018 11 次提交
  4. 10 7月, 2018 5 次提交
    • D
      bpf: fix ldx in ld_abs rewrite for large offsets · 59ee4129
      Daniel Borkmann 提交于
      Mark reported that syzkaller triggered a KASAN detected slab-out-of-bounds
      bug in ___bpf_prog_run() with a BPF_LD | BPF_ABS word load at offset 0x8001.
      After further investigation it became clear that the issue was the
      BPF_LDX_MEM() which takes offset as an argument whereas it cannot encode
      larger than S16_MAX offsets into it. For this synthetical case we need to
      move the full address into tmp register instead and do the LDX without
      immediate value.
      
      Fixes: e0cea7ce ("bpf: implement ld_abs/ld_ind in native bpf")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reported-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      59ee4129
    • T
      samples/bpf: Fix tc and ip paths in xdp2skb_meta.sh · b9626f45
      Taeung Song 提交于
      The below path error can occur:
      
        # ./xdp2skb_meta.sh --dev eth0 --list
        ./xdp2skb_meta.sh: line 61: /usr/sbin/tc: No such file or directory
      
      So just use command names instead of absolute paths of tc and ip.
      In addition, it allow callers to redefine $TC and $IP paths
      
      Fixes: 36e04a2d ("samples/bpf: xdp2skb_meta shows transferring info from XDP to SKB")
      Reviewed-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NTaeung Song <treeze.taeung@gmail.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b9626f45
    • T
      rhashtable: add restart routine in rhashtable_free_and_destroy() · 0026129c
      Taehee Yoo 提交于
      rhashtable_free_and_destroy() cancels re-hash deferred work
      then walks and destroys elements. at this moment, some elements can be
      still in future_tbl. that elements are not destroyed.
      
      test case:
      nft_rhash_destroy() calls rhashtable_free_and_destroy() to destroy
      all elements of sets before destroying sets and chains.
      But rhashtable_free_and_destroy() doesn't destroy elements of future_tbl.
      so that splat occurred.
      
      test script:
         %cat test.nft
         table ip aa {
      	   map map1 {
      		   type ipv4_addr : verdict;
      		   elements = {
      			   0 : jump a0,
      			   1 : jump a0,
      			   2 : jump a0,
      			   3 : jump a0,
      			   4 : jump a0,
      			   5 : jump a0,
      			   6 : jump a0,
      			   7 : jump a0,
      			   8 : jump a0,
      			   9 : jump a0,
      		}
      	   }
      	   chain a0 {
      	   }
         }
         flush ruleset
         table ip aa {
      	   map map1 {
      		   type ipv4_addr : verdict;
      		   elements = {
      			   0 : jump a0,
      			   1 : jump a0,
      			   2 : jump a0,
      			   3 : jump a0,
      			   4 : jump a0,
      			   5 : jump a0,
      			   6 : jump a0,
      			   7 : jump a0,
      			   8 : jump a0,
      			   9 : jump a0,
      		   }
      	   }
      	   chain a0 {
      	   }
         }
         flush ruleset
      
         %while :; do nft -f test.nft; done
      
      Splat looks like:
      [  200.795603] kernel BUG at net/netfilter/nf_tables_api.c:1363!
      [  200.806944] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      [  200.812253] CPU: 1 PID: 1582 Comm: nft Not tainted 4.17.0+ #24
      [  200.820297] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
      [  200.830309] RIP: 0010:nf_tables_chain_destroy.isra.34+0x62/0x240 [nf_tables]
      [  200.838317] Code: 43 50 85 c0 74 26 48 8b 45 00 48 8b 4d 08 ba 54 05 00 00 48 c7 c6 60 6d 29 c0 48 c7 c7 c0 65 29 c0 4c 8b 40 08 e8 58 e5 fd f8 <0f> 0b 48 89 da 48 b8 00 00 00 00 00 fc ff
      [  200.860366] RSP: 0000:ffff880118dbf4d0 EFLAGS: 00010282
      [  200.866354] RAX: 0000000000000061 RBX: ffff88010cdeaf08 RCX: 0000000000000000
      [  200.874355] RDX: 0000000000000061 RSI: 0000000000000008 RDI: ffffed00231b7e90
      [  200.882361] RBP: ffff880118dbf4e8 R08: ffffed002373bcfb R09: ffffed002373bcfa
      [  200.890354] R10: 0000000000000000 R11: ffffed002373bcfb R12: dead000000000200
      [  200.898356] R13: dead000000000100 R14: ffffffffbb62af38 R15: dffffc0000000000
      [  200.906354] FS:  00007fefc31fd700(0000) GS:ffff88011b800000(0000) knlGS:0000000000000000
      [  200.915533] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  200.922355] CR2: 0000557f1c8e9128 CR3: 0000000106880000 CR4: 00000000001006e0
      [  200.930353] Call Trace:
      [  200.932351]  ? nf_tables_commit+0x26f6/0x2c60 [nf_tables]
      [  200.939525]  ? nf_tables_setelem_notify.constprop.49+0x1a0/0x1a0 [nf_tables]
      [  200.947525]  ? nf_tables_delchain+0x6e0/0x6e0 [nf_tables]
      [  200.952383]  ? nft_add_set_elem+0x1700/0x1700 [nf_tables]
      [  200.959532]  ? nla_parse+0xab/0x230
      [  200.963529]  ? nfnetlink_rcv_batch+0xd06/0x10d0 [nfnetlink]
      [  200.968384]  ? nfnetlink_net_init+0x130/0x130 [nfnetlink]
      [  200.975525]  ? debug_show_all_locks+0x290/0x290
      [  200.980363]  ? debug_show_all_locks+0x290/0x290
      [  200.986356]  ? sched_clock_cpu+0x132/0x170
      [  200.990352]  ? find_held_lock+0x39/0x1b0
      [  200.994355]  ? sched_clock_local+0x10d/0x130
      [  200.999531]  ? memset+0x1f/0x40
      
      V2:
       - free all tables requested by Herbert Xu
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0026129c
    • D
      Merge branch 'bnxt_en-Bug-fixes' · 252dd176
      David S. Miller 提交于
      Michael Chan says:
      
      ====================
      bnxt_en: Bug fixes.
      
      These are bug fixes in error code paths, TC Flower VLAN TCI flow
      checking bug fix, proper filtering of Broadcast packets if IFF_BROADCAST
      is not set, and a bug fix in bnxt_get_max_rings() to return 0 ring
      parameters when the return value is -ENOMEM.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      252dd176
    • V
      bnxt_en: Fix for system hang if request_irq fails · c58387ab
      Vikas Gupta 提交于
      Fix bug in the error code path when bnxt_request_irq() returns failure.
      bnxt_disable_napi() should not be called in this error path because
      NAPI has not been enabled yet.
      
      Fixes: c0c050c5 ("bnxt_en: New Broadcom ethernet driver.")
      Signed-off-by: NVikas Gupta <vikas.gupta@broadcom.com>
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c58387ab