1. 08 8月, 2017 5 次提交
    • D
      Merge branch 'asix-Improve-robustness' · c0e0fb83
      David S. Miller 提交于
      Dean Jenkins says:
      
      ====================
      asix: Improve robustness
      
      Please consider taking these patches to improve the robustness of the ASIX USB
      to Ethernet driver.
      
      Failures prompting an ASIX driver code review
      =============================================
      
      On an ARM i.MX6 embedded platform some strange one-off and two-off failures were
      observed in and around the ASIX USB to Ethernet driver. This was observed on a
      highly modified kernel 3.14 with the ASIX driver containing back-ported changes
      from kernel.org up to kernel 4.8 approximately.
      
      a) A one-off failure in asix_rx_fixup_internal():
      
      There was an occurrence of an attempt to write off the end of the netdev buffer
      which was trapped by skb_over_panic() in skb_put().
      
      [20030.846440] skbuff: skb_over_panic: text:7f2271c0 len:120 put:60 head:8366ecc0 data:8366ed02 tail:0x8366ed7a end:0x8366ed40 dev:eth0
      [20030.863007] Kernel BUG at 8044ce38 [verbose debug info unavailable]
      
      [20031.215345] Backtrace:
      [20031.217884] [<8044cde0>] (skb_panic) from [<8044d50c>] (skb_put+0x50/0x5c)
      [20031.227408] [<8044d4bc>] (skb_put) from [<7f2271c0>] (asix_rx_fixup_internal+0x1c4/0x23c [asix])
      [20031.242024] [<7f226ffc>] (asix_rx_fixup_internal [asix]) from [<7f22724c>] (asix_rx_fixup_common+0x14/0x18 [asix])
      [20031.260309] [<7f227238>] (asix_rx_fixup_common [asix]) from [<7f21f7d4>] (usbnet_bh+0x74/0x224 [usbnet])
      [20031.269879] [<7f21f760>] (usbnet_bh [usbnet]) from [<8002f834>] (call_timer_fn+0xa4/0x1f0)
      [20031.283961] [<8002f790>] (call_timer_fn) from [<80030834>] (run_timer_softirq+0x230/0x2a8)
      [20031.302782] [<80030604>] (run_timer_softirq) from [<80028780>] (__do_softirq+0x15c/0x37c)
      [20031.321511] [<80028624>] (__do_softirq) from [<80028c38>] (irq_exit+0x8c/0xe8)
      [20031.339298] [<80028bac>] (irq_exit) from [<8000e9c8>] (handle_IRQ+0x8c/0xc8)
      [20031.350038] [<8000e93c>] (handle_IRQ) from [<800085c8>] (gic_handle_irq+0xb8/0xf8)
      [20031.365528] [<80008510>] (gic_handle_irq) from [<8050de80>] (__irq_svc+0x40/0x70)
      
      Analysis of the logic of the ASIX driver (containing backported changes from
      kernel.org up to kernel 4.8 approximately) suggested that the software could not
      trigger skb_over_panic(). The analysis of the kernel BUG() crash information
      suggested that the netdev buffer was written with 2 minimal 60 octet length
      Ethernet frames (ASIX hardware drops the 4 octet FCS field) and the 2nd Ethernet
      frame attempted to write off the end of the netdev buffer.
      
      Note that the netdev buffer should only contain 1 Ethernet frame so if an
      attempt to write 2 Ethernet frames into the buffer is made then that is wrong.
      However, the logic of the asix_rx_fixup_internal() only allows 1 Ethernet frame
      to be written into the netdev buffer.
      
      Potentially this failure was due to memory corruption because it was only seen
      once.
      
      b) Two-off failures in the NAPI layer's backlog queue:
      
      There were 2 crashes in the NAPI layer's backlog queue presumably after
      asix_rx_fixup_internal() called usbnet_skb_return().
      
      [24097.273945] Unable to handle kernel NULL pointer dereference at virtual address 00000004
      
      [24097.398944] PC is at process_backlog+0x80/0x16c
      
      [24097.569466] Backtrace:
      [24097.572007] [<8045ad98>] (process_backlog) from [<8045b64c>] (net_rx_action+0xcc/0x248)
      [24097.591631] [<8045b580>] (net_rx_action) from [<80028780>] (__do_softirq+0x15c/0x37c)
      [24097.610022] [<80028624>] (__do_softirq) from [<800289cc>] (run_ksoftirqd+0x2c/0x84)
      
      and
      
      [ 1059.828452] Unable to handle kernel NULL pointer dereference at virtual address 00000000
      
      [ 1059.953715] PC is at process_backlog+0x84/0x16c
      
      [ 1060.140896] Backtrace:
      [ 1060.143434] [<8045ad98>] (process_backlog) from [<8045b64c>] (net_rx_action+0xcc/0x248)
      [ 1060.163075] [<8045b580>] (net_rx_action) from [<80028780>] (__do_softirq+0x15c/0x37c)
      [ 1060.181474] [<80028624>] (__do_softirq) from [<80028c38>] (irq_exit+0x8c/0xe8)
      [ 1060.199256] [<80028bac>] (irq_exit) from [<8000e9c8>] (handle_IRQ+0x8c/0xc8)
      [ 1060.210006] [<8000e93c>] (handle_IRQ) from [<800085c8>] (gic_handle_irq+0xb8/0xf8)
      [ 1060.225492] [<80008510>] (gic_handle_irq) from [<8050de80>] (__irq_svc+0x40/0x70)
      
      The embedded board was only using an ASIX USB to Ethernet adaptor eth0.
      
      Analysis suggested that the doubly-linked list pointers of the backlog queue had
      been corrupted because one of the link pointers was NULL.
      
      Potentially this failure was due to memory corruption because it was only seen
      twice.
      
      Results of the ASIX driver code review
      ======================================
      
      During the code review some weaknesses were observed in the ASIX driver and the
      following patches have been created to improve the robustness.
      
      Brief overview of the patches
      -----------------------------
      
      1. asix: Add rx->ax_skb = NULL after usbnet_skb_return()
      
      The current ASIX driver sends the received Ethernet frame to the NAPI layer of
      the network stack via the call to usbnet_skb_return() in
      asix_rx_fixup_internal() but retains the rx->ax_skb pointer to the netdev
      buffer. The driver no longer needs the rx->ax_skb pointer at this point because
      the NAPI layer now has the Ethernet frame.
      
      This means that asix_rx_fixup_internal() must not use rx->ax_skb after the call
      to usbnet_skb_return() because it could corrupt the handling of the Ethernet
      frame within the network layer.
      
      Therefore, to remove the risk of erroneous usage of rx->ax_skb, set rx->ax_skb
      to NULL after the call to usbnet_skb_return(). This avoids potential erroneous
      freeing of rx->ax_skb and erroneous writing to the netdev buffer.  If the
      software now somehow inappropriately reused rx->ax_skb, then a NULL pointer
      dereference of rx->ax_skb would occur which makes investigation easier.
      
      2. asix: Ensure asix_rx_fixup_info members are all reset
      
      This patch creates reset_asix_rx_fixup_info() to allow all the
      asix_rx_fixup_info structure members to be consistently reset to initial
      conditions.
      
      Call reset_asix_rx_fixup_info() upon each detectable error condition so that the
      next URB is processed from a known state.
      
      Otherwise, there is a risk that some members of the asix_rx_fixup_info structure
      may be incorrect after an error occurred so potentially leading to a
      malfunction.
      
      3. asix: Fix small memory leak in ax88772_unbind()
      
      This patch creates asix_rx_fixup_common_free() to allow the rx->ax_skb to be
      freed when necessary.
      
      asix_rx_fixup_common_free() is called from ax88772_unbind() before the parent
      private data structure is freed.
      
      Without this patch, there is a risk of a small netdev buffer memory leak each
      time ax88772_unbind() is called during the reception of an Ethernet frame that
      spans across 2 URBs.
      
      Testing
      =======
      
      The patches have been sanity tested on a 64-bit Linux laptop running kernel
      4.13-rc2 with the 3 patches applied on top.
      
      The ASIX USB to Adaptor used for testing was (output of lsusb):
      ID 0b95:772b ASIX Electronics Corp. AX88772B
      
      Test #1
      -------
      
      The test ran a flood ping test script which slowly incremented the ICMP Echo
      Request's payload from 0 to 5000 octets. This eventually causes IPv4
      fragmentation to occur which causes Ethernet frames to be sent very close to
      each other so increases the probability that an Ethernet frame will span 2 URBs.
      The test showed that all pings were successful. The test took about 15 minutes
      to complete.
      
      Test #2
      -------
      
      A script was run on the laptop to periodically run ifdown and ifup every second
      so that the ASIX USB to Adaptor was up for 1 second and down for 1 second.
      
      From a Linux PC connected to the laptop, the following ping command was used
      ping -f -s 5000 <ip address of laptop>
      
      The large ICMP payload causes IPv4 fragmentation resulting in multiple
      Ethernet frames per original IP packet.
      
      Kernel debug within the ASIX driver was enabled to see whether any ASIX errors
      were generated. The test was run for about 24 hours and no ASIX errors were
      seen.
      
      Patches
      =======
      
      The 3 patches have been rebased off the net-next repo master branch with HEAD
      fbbeefdd net: fec: Allow reception of frames bigger than 1522 bytes
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0e0fb83
    • D
      asix: Fix small memory leak in ax88772_unbind() · d0c8f338
      Dean Jenkins 提交于
      When Ethernet frames span mulitple URBs, the netdev buffer memory
      pointed to by the asix_rx_fixup_info structure remains allocated
      during the time gap between the 2 executions of asix_rx_fixup_internal().
      
      This means that if ax88772_unbind() is called within this time
      gap to free the memory of the parent private data structure then
      a memory leak of the part filled netdev buffer memory will occur.
      
      Therefore, create a new function asix_rx_fixup_common_free() to
      free the memory of the netdev buffer and add a call to
      asix_rx_fixup_common_free() from inside ax88772_unbind().
      
      Consequently when an unbind occurs part way through receiving
      an Ethernet frame, the netdev buffer memory that is holding part
      of the received Ethernet frame will now be freed.
      Signed-off-by: NDean Jenkins <Dean_Jenkins@mentor.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0c8f338
    • D
      asix: Ensure asix_rx_fixup_info members are all reset · 960eb4ee
      Dean Jenkins 提交于
      There is a risk that the members of the structure asix_rx_fixup_info
      become unsynchronised leading to the possibility of a malfunction.
      
      For example, rx->split_head was not being set to false after an
      error was detected so potentially could cause a malformed 32-bit
      Data header word to be formed.
      
      Therefore add function reset_asix_rx_fixup_info() to reset all the
      members of asix_rx_fixup_info so that future processing will start
      with known initial conditions.
      
      Also, if (skb->len != offset) becomes true then call
      reset_asix_rx_fixup_info() so that the processing of the next URB
      starts with known initial conditions. Without the call, the check
      does nothing which potentially could lead to a malfunction
      when the next URB is processed.
      
      In addition, for robustness, call reset_asix_rx_fixup_info() before
      every error path's "return 0". This ensures that the next URB is
      processed from known initial conditions.
      Signed-off-by: NDean Jenkins <Dean_Jenkins@mentor.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      960eb4ee
    • D
      asix: Add rx->ax_skb = NULL after usbnet_skb_return() · 22889dbb
      Dean Jenkins 提交于
      In asix_rx_fixup_internal() there is a risk that rx->ax_skb gets
      reused after passing the Ethernet frame into the network stack via
      usbnet_skb_return().
      
      The risks include:
      
      a) asynchronously freeing rx->ax_skb after passing the netdev buffer
         to the NAPI layer which might corrupt the backlog queue.
      
      b) erroneously reusing rx->ax_skb such as calling skb_put_data() multiple
         times which causes writing off the end of the netdev buffer.
      
      Therefore add a defensive rx->ax_skb = NULL after usbnet_skb_return()
      so that it is not possible to free rx->ax_skb or to apply
      skb_put_data() too many times.
      Signed-off-by: NDean Jenkins <Dean_Jenkins@mentor.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22889dbb
    • T
      bpf: fix selftest/bpf/test_pkt_md_access on s390x · f9ea3225
      Thomas Richter 提交于
      Commit 18f3d6be ("selftests/bpf: Add test cases to test narrower ctx field loads")
      introduced new eBPF test cases. One of them (test_pkt_md_access.c)
      fails on s390x. The BPF verifier error message is:
      
      [root@s8360046 bpf]# ./test_progs
      test_pkt_access:PASS:ipv4 349 nsec
      test_pkt_access:PASS:ipv6 212 nsec
      [....]
      libbpf: load bpf program failed: Permission denied
      libbpf: -- BEGIN DUMP LOG ---
      libbpf:
      0: (71) r2 = *(u8 *)(r1 +0)
      invalid bpf_context access off=0 size=1
      
      libbpf: -- END LOG --
      libbpf: failed to load program 'test1'
      libbpf: failed to load object './test_pkt_md_access.o'
      Summary: 29 PASSED, 1 FAILED
      [root@s8360046 bpf]#
      
      This is caused by a byte endianness issue. S390x is a big endian
      architecture.  Pointer access to the lowest byte or halfword of a
      four byte value need to add an offset.
      On little endian architectures this offset is not needed.
      
      Fix this and use the same approach as the originator used for other files
      (for example test_verifier.c) in his original commit.
      
      With this fix the test program test_progs succeeds on s390x:
      [root@s8360046 bpf]# ./test_progs
      test_pkt_access:PASS:ipv4 236 nsec
      test_pkt_access:PASS:ipv6 217 nsec
      test_xdp:PASS:ipv4 3624 nsec
      test_xdp:PASS:ipv6 1722 nsec
      test_l4lb:PASS:ipv4 926 nsec
      test_l4lb:PASS:ipv6 1322 nsec
      test_tcp_estats:PASS: 0 nsec
      test_bpf_obj_id:PASS:get-fd-by-notexist-prog-id 0 nsec
      test_bpf_obj_id:PASS:get-fd-by-notexist-map-id 0 nsec
      test_bpf_obj_id:PASS:get-prog-info(fd) 0 nsec
      test_bpf_obj_id:PASS:get-map-info(fd) 0 nsec
      test_bpf_obj_id:PASS:get-prog-info(fd) 0 nsec
      test_bpf_obj_id:PASS:get-map-info(fd) 0 nsec
      test_bpf_obj_id:PASS:get-prog-fd(next_id) 0 nsec
      test_bpf_obj_id:PASS:get-prog-info(next_id->fd) 0 nsec
      test_bpf_obj_id:PASS:get-prog-fd(next_id) 0 nsec
      test_bpf_obj_id:PASS:get-prog-info(next_id->fd) 0 nsec
      test_bpf_obj_id:PASS:check total prog id found by get_next_id 0 nsec
      test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
      test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
      test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
      test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
      test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
      test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
      test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
      test_bpf_obj_id:PASS:check get-map-info(next_id->fd) 0 nsec
      test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec
      test_bpf_obj_id:PASS:check get-map-info(next_id->fd) 0 nsec
      test_bpf_obj_id:PASS:check total map id found by get_next_id 0 nsec
      test_pkt_md_access:PASS: 277 nsec
      Summary: 30 PASSED, 0 FAILED
      [root@s8360046 bpf]#
      
      Fixes: 18f3d6be ("selftests/bpf: Add test cases to test narrower ctx field loads")
      Signed-off-by: NThomas Richter <tmricht@linux.vnet.ibm.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9ea3225
  2. 07 8月, 2017 1 次提交
    • S
      netvsc: fix race on sub channel creation · 732e4985
      stephen hemminger 提交于
      The existing sub channel code did not wait for all the sub-channels
      to completely initialize. This could lead to race causing crash
      in napi_netif_del() from bad list. The existing code would send
      an init message, then wait only for the initial response that
      the init message was received. It thought it was waiting for
      sub channels but really the init response did the wakeup.
      
      The new code keeps track of the number of open channels and
      waits until that many are open.
      
      Other issues here were:
        * host might return less sub-channels than was requested.
        * the new init status is not valid until after init was completed.
      
      Fixes: b3e6b82a ("hv_netvsc: Wait for sub-channels to be processed during probe")
      Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      732e4985
  3. 05 8月, 2017 9 次提交
    • D
      bpf: fix byte order test in test_verifier · 2c460621
      Daniel Borkmann 提交于
      We really must check with #if __BYTE_ORDER == XYZ instead of
      just presence of #ifdef __LITTLE_ENDIAN. I noticed that when
      actually running this on big endian machine, the latter test
      resolves to true for user space, same for #ifdef __BIG_ENDIAN.
      
      E.g., looking at endian.h from libc, both are also defined
      there, so we really must test this against __BYTE_ORDER instead
      for proper insns selection. For the kernel, such checks are
      fine though e.g. see 13da9e20 ("Revert "endian: #define
      __BYTE_ORDER"") and 415586c9 ("UAPI: fix endianness conditionals
      in M32R's asm/stat.h") for some more context, but not for
      user space. Lets also make sure to properly include endian.h.
      After that, suite passes for me:
      
      ./test_verifier: ELF 64-bit MSB executable, [...]
      
      Linux foo 4.13.0-rc3+ #4 SMP Fri Aug 4 06:59:30 EDT 2017 s390x s390x s390x GNU/Linux
      
      Before fix: Summary: 505 PASSED, 11 FAILED
      After  fix: Summary: 516 PASSED,  0 FAILED
      
      Fixes: 18f3d6be ("selftests/bpf: Add test cases to test narrower ctx field loads")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong <yhs@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c460621
    • T
      xgene: Always get clk source, but ignore if it's missing for SGMII ports · aaf83aec
      Thomas Bogendoerfer 提交于
      Even the driver doesn't do anything with the clk source for SGMII
      ports it needs to be enabled by doing a devm_clk_get(), if there is
      a clk source in DT.
      
      Fixes: 0db01097 ('xgene: Don't fail probe, if there is no clk resource for SGMII interfaces')
      Signed-off-by: NThomas Bogendoerfer <tbogendoerfer@suse.de>
      Tested-by: NLaura Abbott <labbott@redhat.com>
      Acked-by: NIyappan Subramanian <isubramanian@apm.com>
      Tested-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aaf83aec
    • D
      MIPS: Add missing file for eBPF JIT. · b6bd53f9
      David Daney 提交于
      Inexplicably, commit f381bf6d ("MIPS: Add support for eBPF JIT.")
      lost a file somewhere on its path to Linus' tree.  Add back the
      missing ebpf_jit.c so that we can build with CONFIG_BPF_JIT selected.
      
      This version of ebpf_jit.c is identical to the original except for two
      minor change need to resolve conflicts with changes merged from the
      BPF branch:
      
      A) Set prog->jited_len = image_size;
      B) Use BPF_TAIL_CALL instead of BPF_CALL | BPF_X
      
      Fixes: f381bf6d ("MIPS: Add support for eBPF JIT.")
      Signed-off-by: NDavid Daney <david.daney@cavium.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6bd53f9
    • D
      Merge branch 's390-bpf-jit-fixes' · 7a973251
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      Two BPF fixes for s390
      
      Found while testing some other work touching JITs.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a973251
    • D
      bpf, s390: fix build for libbpf and selftest suite · bad1926d
      Daniel Borkmann 提交于
      The BPF feature test as well as libbpf is missing the __NR_bpf
      define for s390 and currently refuses to compile (selftest suite
      depends on libbpf as well). Similar issue was fixed some time
      ago via b0c47807 ("bpf: Add sparc support to tools and
      samples."), just do the same and add definitions.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bad1926d
    • D
      bpf, s390: fix jit branch offset related to ldimm64 · b0a0c256
      Daniel Borkmann 提交于
      While testing some other work that required JIT modifications, I
      run into test_bpf causing a hang when JIT enabled on s390. The
      problematic test case was the one from ddc665a4 (bpf, arm64:
      fix jit branch offset related to ldimm64), and turns out that we
      do have a similar issue on s390 as well. In bpf_jit_prog() we
      update next instruction address after returning from bpf_jit_insn()
      with an insn_count. bpf_jit_insn() returns either -1 in case of
      error (e.g. unsupported insn), 1 or 2. The latter is only the
      case for ldimm64 due to spanning 2 insns, however, next address
      is only set to i + 1 not taking actual insn_count into account,
      thus fix is to use insn_count instead of 1. bpf_jit_enable in
      mode 2 provides also disasm on s390:
      
      Before fix:
      
        000003ff800349b6: a7f40003   brc     15,3ff800349bc                 ; target
        000003ff800349ba: 0000               unknown
        000003ff800349bc: e3b0f0700024       stg     %r11,112(%r15)
        000003ff800349c2: e3e0f0880024       stg     %r14,136(%r15)
        000003ff800349c8: 0db0               basr    %r11,%r0
        000003ff800349ca: c0ef00000000       llilf   %r14,0
        000003ff800349d0: e320b0360004       lg      %r2,54(%r11)
        000003ff800349d6: e330b03e0004       lg      %r3,62(%r11)
        000003ff800349dc: ec23ffeda065       clgrj   %r2,%r3,10,3ff800349b6 ; jmp
        000003ff800349e2: e3e0b0460004       lg      %r14,70(%r11)
        000003ff800349e8: e3e0b04e0004       lg      %r14,78(%r11)
        000003ff800349ee: b904002e   lgr     %r2,%r14
        000003ff800349f2: e3b0f0700004       lg      %r11,112(%r15)
        000003ff800349f8: e3e0f0880004       lg      %r14,136(%r15)
        000003ff800349fe: 07fe               bcr     15,%r14
      
      After fix:
      
        000003ff80ef3db4: a7f40003   brc     15,3ff80ef3dba
        000003ff80ef3db8: 0000               unknown
        000003ff80ef3dba: e3b0f0700024       stg     %r11,112(%r15)
        000003ff80ef3dc0: e3e0f0880024       stg     %r14,136(%r15)
        000003ff80ef3dc6: 0db0               basr    %r11,%r0
        000003ff80ef3dc8: c0ef00000000       llilf   %r14,0
        000003ff80ef3dce: e320b0360004       lg      %r2,54(%r11)
        000003ff80ef3dd4: e330b03e0004       lg      %r3,62(%r11)
        000003ff80ef3dda: ec230006a065       clgrj   %r2,%r3,10,3ff80ef3de6 ; jmp
        000003ff80ef3de0: e3e0b0460004       lg      %r14,70(%r11)
        000003ff80ef3de6: e3e0b04e0004       lg      %r14,78(%r11)          ; target
        000003ff80ef3dec: b904002e   lgr     %r2,%r14
        000003ff80ef3df0: e3b0f0700004       lg      %r11,112(%r15)
        000003ff80ef3df6: e3e0f0880004       lg      %r14,136(%r15)
        000003ff80ef3dfc: 07fe               bcr     15,%r14
      
      test_bpf.ko suite runs fine after the fix.
      
      Fixes: 05462310 ("s390/bpf: Add s390x eBPF JIT compiler backend")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0a0c256
    • D
      Merge branch 'mlxsw-Couple-of-fixes' · 1aff0c34
      David S. Miller 提交于
      Jiri Pirko says:
      
      ====================
      mlxsw: Couple of fixes
      
      Ido says:
      
      The first patch prevents us from warning about valid situations that can
      happen due to the fact that some operations in switchdev are deferred.
      
      Second patch fixes a long standing problem in which we didn't correctly
      free resources upon module removal, resulting in a memory leak.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1aff0c34
    • I
      mlxsw: spectrum_switchdev: Release multicast groups during fini · 852cfeed
      Ido Schimmel 提交于
      Each multicast group (MID) stores a bitmap of ports to which a packet
      should be forwarded to in case an MDB entry associated with the MID is
      hit.
      
      Since the initial introduction of IGMP snooping in commit 3a49b4fd
      ("mlxsw: Adding layer 2 multicast support") the driver didn't correctly
      free these multicast groups upon ungraceful situations such as the
      removal of the upper bridge device or module removal.
      
      The correct way to fix this is to associate each MID with the bridge
      ports member in it and then drop the reference in case the bridge port
      is destroyed, but this will result in a lot more code and will be fixed
      in net-next.
      
      For now, upon module removal, traverse the MID list and release each
      one.
      
      Fixes: 3a49b4fd ("mlxsw: Adding layer 2 multicast support")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      852cfeed
    • I
      mlxsw: spectrum_switchdev: Don't warn about valid situations · 17b334a8
      Ido Schimmel 提交于
      Some operations in the bridge driver such as MDB deletion are preformed
      in an atomic context and thus deferred to a process context by the
      switchdev infrastructure.
      
      Therefore, by the time the operation is performed by the underlying
      device driver it's possible the bridge port context is already gone.
      This is especially true for removal flows, but theoretically can also be
      invoked during addition.
      
      Remove the warnings in such situations and return normally.
      
      Fixes: c57529e1 ("mlxsw: spectrum: Replace vPorts with Port-VLAN")
      Fixes: 3922285d ("net: bridge: Add support for offloading port attributes")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17b334a8
  4. 04 8月, 2017 7 次提交
    • D
      Merge branch 'tcp-xmit-timer-rearming' · 337f1b07
      David S. Miller 提交于
      Neal Cardwell says:
      
      ====================
      tcp: fix xmit timer rearming to avoid stalls
      
      This patch series is a bug fix for a TCP loss recovery performance bug
      reported independently in recent netdev threads:
      
       (i)  July 26, 2017: netdev thread "TCP fast retransmit issues"
       (ii) July 26, 2017: netdev thread:
             "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
             outstanding TLP retransmission"
      
      Many thanks to Klavs Klavsen and Mao Wenan for the detailed reports,
      traces, and packetdrill test cases, which enabled us to root-cause
      this issue and verify the fix.
      
      - v1 -> v2:
       - In patch 2/3, changed an unclear comment in the pre-existing code
         in tcp_schedule_loss_probe() to be more clear (thanks to Eric Dumazet
         for suggesting we improve this).
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      337f1b07
    • N
      tcp: fix xmit timer to only be reset if data ACKed/SACKed · df92c839
      Neal Cardwell 提交于
      Fix a TCP loss recovery performance bug raised recently on the netdev
      list, in two threads:
      
      (i)  July 26, 2017: netdev thread "TCP fast retransmit issues"
      (ii) July 26, 2017: netdev thread:
           "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
           outstanding TLP retransmission"
      
      The basic problem is that incoming TCP packets that did not indicate
      forward progress could cause the xmit timer (TLP or RTO) to be rearmed
      and pushed back in time. In certain corner cases this could result in
      the following problems noted in these threads:
      
       - Repeated ACKs coming in with bogus SACKs corrupted by middleboxes
         could cause TCP to repeatedly schedule TLPs forever. We kept
         sending TLPs after every ~200ms, which elicited bogus SACKs, which
         caused more TLPs, ad infinitum; we never fired an RTO to fill in
         the holes.
      
       - Incoming data segments could, in some cases, cause us to reschedule
         our RTO or TLP timer further out in time, for no good reason. This
         could cause repeated inbound data to result in stalls in outbound
         data, in the presence of packet loss.
      
      This commit fixes these bugs by changing the TLP and RTO ACK
      processing to:
      
       (a) Only reschedule the xmit timer once per ACK.
      
       (b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the
           ACK indicates sufficient forward progress (a packet was
           cumulatively ACKed, or we got a SACK for a packet that was sent
           before the most recent retransmit of the write queue head).
      
      This brings us back into closer compliance with the RFCs, since, as
      the comment for tcp_rearm_rto() notes, we should only restart the RTO
      timer after forward progress on the connection. Previously we were
      restarting the xmit timer even in these cases where there was no
      forward progress.
      
      As a side benefit, this commit simplifies and speeds up the TCP timer
      arming logic. We had been calling inet_csk_reset_xmit_timer() three
      times on normal ACKs that cumulatively acknowledged some data:
      
      1) Once near the top of tcp_ack() to switch from TLP timer to RTO:
              if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
                     tcp_rearm_rto(sk);
      
      2) Once in tcp_clean_rtx_queue(), to update the RTO:
              if (flag & FLAG_ACKED) {
                     tcp_rearm_rto(sk);
      
      3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO
         to TLP:
              if (icsk->icsk_pending == ICSK_TIME_RETRANS)
                     tcp_schedule_loss_probe(sk);
      
      This commit, by only rescheduling the xmit timer once per ACK,
      simplifies the code and reduces CPU overhead.
      
      This commit was tested in an A/B test with Google web server
      traffic. SNMP stats and request latency metrics were within noise
      levels, substantiating that for normal web traffic patterns this is a
      rare issue. This commit was also tested with packetdrill tests to
      verify that it fixes the timer behavior in the corner cases discussed
      in the netdev threads mentioned above.
      
      This patch is a bug fix patch intended to be queued for -stable
      relases.
      
      Fixes: 6ba8a3b1 ("tcp: Tail loss probe (TLP)")
      Reported-by: NKlavs Klavsen <kl@vsen.dk>
      Reported-by: NMao Wenan <maowenan@huawei.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df92c839
    • N
      tcp: enable xmit timer fix by having TLP use time when RTO should fire · a2815817
      Neal Cardwell 提交于
      Have tcp_schedule_loss_probe() base the TLP scheduling decision based
      on when the RTO *should* fire. This is to enable the upcoming xmit
      timer fix in this series, where tcp_schedule_loss_probe() cannot
      assume that the last timer installed was an RTO timer (because we are
      no longer doing the "rearm RTO, rearm RTO, rearm TLP" dance on every
      ACK). So tcp_schedule_loss_probe() must independently figure out when
      an RTO would want to fire.
      
      In the new TLP implementation following in this series, we cannot
      assume that icsk_timeout was set based on an RTO; after processing a
      cumulative ACK the icsk_timeout we see can be from a previous TLP or
      RTO. So we need to independently recalculate the RTO time (instead of
      reading it out of icsk_timeout). Removing this dependency on the
      nature of icsk_timeout makes things a little easier to reason about
      anyway.
      
      Note that the old and new code should be equivalent, since they are
      both saying: "if the RTO is in the future, but at an earlier time than
      the normal TLP time, then set the TLP timer to fire when the RTO would
      have fired".
      
      Fixes: 6ba8a3b1 ("tcp: Tail loss probe (TLP)")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2815817
    • N
      tcp: introduce tcp_rto_delta_us() helper for xmit timer fix · e1a10ef7
      Neal Cardwell 提交于
      Pure refactor. This helper will be required in the xmit timer fix
      later in the patch series. (Because the TLP logic will want to make
      this calculation.)
      
      Fixes: 6ba8a3b1 ("tcp: Tail loss probe (TLP)")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1a10ef7
    • X
      ipv6: set rt6i_protocol properly in the route when it is installed · b91d5329
      Xin Long 提交于
      After commit c2ed1880 ("net: ipv6: check route protocol when
      deleting routes"), ipv6 route checks rt protocol when trying to
      remove a rt entry.
      
      It introduced a side effect causing 'ip -6 route flush cache' not
      to work well. When flushing caches with iproute, all route caches
      get dumped from kernel then removed one by one by sending DELROUTE
      requests to kernel for each cache.
      
      The thing is iproute sends the request with the cache whose proto
      is set with RTPROT_REDIRECT by rt6_fill_node() when kernel dumps
      it. But in kernel the rt_cache protocol is still 0, which causes
      the cache not to be matched and removed.
      
      So the real reason is rt6i_protocol in the route is not set when
      it is allocated. As David Ahern's suggestion, this patch is to
      set rt6i_protocol properly in the route when it is installed and
      remove the codes setting rtm_protocol according to rt6i_flags in
      rt6_fill_node.
      
      This is also an improvement to keep rt6i_protocol consistent with
      rtm_protocol.
      
      Fixes: c2ed1880 ("net: ipv6: check route protocol when deleting routes")
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Suggested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b91d5329
    • E
      net: fix keepalive code vs TCP_FASTOPEN_CONNECT · 2dda6400
      Eric Dumazet 提交于
      syzkaller was able to trigger a divide by 0 in TCP stack [1]
      
      Issue here is that keepalive timer needs to be updated to not attempt
      to send a probe if the connection setup was deferred using
      TCP_FASTOPEN_CONNECT socket option added in linux-4.11
      
      [1]
       divide error: 0000 [#1] SMP
       CPU: 18 PID: 0 Comm: swapper/18 Not tainted
       task: ffff986f62f4b040 ti: ffff986f62fa2000 task.ti: ffff986f62fa2000
       RIP: 0010:[<ffffffff8409cc0d>]  [<ffffffff8409cc0d>] __tcp_select_window+0x8d/0x160
       Call Trace:
        <IRQ>
        [<ffffffff8409d951>] tcp_transmit_skb+0x11/0x20
        [<ffffffff8409da21>] tcp_xmit_probe_skb+0xc1/0xe0
        [<ffffffff840a0ee8>] tcp_write_wakeup+0x68/0x160
        [<ffffffff840a151b>] tcp_keepalive_timer+0x17b/0x230
        [<ffffffff83b3f799>] call_timer_fn+0x39/0xf0
        [<ffffffff83b40797>] run_timer_softirq+0x1d7/0x280
        [<ffffffff83a04ddb>] __do_softirq+0xcb/0x257
        [<ffffffff83ae03ac>] irq_exit+0x9c/0xb0
        [<ffffffff83a04c1a>] smp_apic_timer_interrupt+0x6a/0x80
        [<ffffffff83a03eaf>] apic_timer_interrupt+0x7f/0x90
        <EOI>
        [<ffffffff83fed2ea>] ? cpuidle_enter_state+0x13a/0x3b0
        [<ffffffff83fed2cd>] ? cpuidle_enter_state+0x11d/0x3b0
      
      Tested:
      
      Following packetdrill no longer crashes the kernel
      
      `echo 0 >/proc/sys/net/ipv4/tcp_timestamps`
      
      // Cache warmup: send a Fast Open cookie request
          0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
         +0 setsockopt(3, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
         +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation is now in progress)
         +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
       +.01 < S. 123:123(0) ack 1 win 14600 <mss 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
         +0 > . 1:1(0) ack 1
         +0 close(3) = 0
         +0 > F. 1:1(0) ack 1
         +0 < F. 1:1(0) ack 2 win 92
         +0 > .  2:2(0) ack 2
      
         +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
         +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
         +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
         +0 setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
       +.01 connect(4, ..., ...) = 0
         +0 setsockopt(4, SOL_TCP, TCP_KEEPIDLE, [5], 4) = 0
         +10 close(4) = 0
      
      `echo 1 >/proc/sys/net/ipv4/tcp_timestamps`
      
      Fixes: 19f6d3f3 ("net/tcp-fastopen: Add new API support")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2dda6400
    • D
      Merge tag 'batadv-net-for-davem-20170802' of git://git.open-mesh.org/linux-merge · 4d2bbb0e
      David S. Miller 提交于
      Simon Wunderlich says:
      
      ====================
      Here is a batman-adv bugfix:
      
       - fix TT sync flag inconsistency problems, which can lead to excess packets,
         by Linus Luessing
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d2bbb0e
  5. 03 8月, 2017 11 次提交
  6. 02 8月, 2017 7 次提交