1. 24 7月, 2018 2 次提交
    • W
      ipv6: use fib6_info_hold_safe() when necessary · e873e4b9
      Wei Wang 提交于
      In the code path where only rcu read lock is held, e.g. in the route
      lookup code path, it is not safe to directly call fib6_info_hold()
      because the fib6_info may already have been deleted but still exists
      in the rcu grace period. Holding reference to it could cause double
      free and crash the kernel.
      
      This patch adds a new function fib6_info_hold_safe() and replace
      fib6_info_hold() in all necessary places.
      
      Syzbot reported 3 crash traces because of this. One of them is:
      8021q: adding VLAN 0 to HW filter on device team0
      IPv6: ADDRCONF(NETDEV_CHANGE): team0: link becomes ready
      dst_release: dst:(____ptrval____) refcnt:-1
      dst_release: dst:(____ptrval____) refcnt:-2
      WARNING: CPU: 1 PID: 4845 at include/net/dst.h:239 dst_hold include/net/dst.h:239 [inline]
      WARNING: CPU: 1 PID: 4845 at include/net/dst.h:239 ip6_setup_cork+0xd66/0x1830 net/ipv6/ip6_output.c:1204
      dst_release: dst:(____ptrval____) refcnt:-1
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 1 PID: 4845 Comm: syz-executor493 Not tainted 4.18.0-rc3+ #10
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
       panic+0x238/0x4e7 kernel/panic.c:184
      dst_release: dst:(____ptrval____) refcnt:-2
      dst_release: dst:(____ptrval____) refcnt:-3
       __warn.cold.8+0x163/0x1ba kernel/panic.c:536
      dst_release: dst:(____ptrval____) refcnt:-4
       report_bug+0x252/0x2d0 lib/bug.c:186
       fixup_bug arch/x86/kernel/traps.c:178 [inline]
       do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296
      dst_release: dst:(____ptrval____) refcnt:-5
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316
       invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
      RIP: 0010:dst_hold include/net/dst.h:239 [inline]
      RIP: 0010:ip6_setup_cork+0xd66/0x1830 net/ipv6/ip6_output.c:1204
      Code: c1 ed 03 89 9d 18 ff ff ff 48 b8 00 00 00 00 00 fc ff df 41 c6 44 05 00 f8 e9 2d 01 00 00 4c 8b a5 c8 fe ff ff e8 1a f6 e6 fa <0f> 0b e9 6a fc ff ff e8 0e f6 e6 fa 48 8b 85 d0 fe ff ff 48 8d 78
      RSP: 0018:ffff8801a8fcf178 EFLAGS: 00010293
      RAX: ffff8801a8eba5c0 RBX: 0000000000000000 RCX: ffffffff869511e6
      RDX: 0000000000000000 RSI: ffffffff869515b6 RDI: 0000000000000005
      RBP: ffff8801a8fcf2c8 R08: ffff8801a8eba5c0 R09: ffffed0035ac8338
      R10: ffffed0035ac8338 R11: ffff8801ad6419c3 R12: ffff8801a8fcf720
      R13: ffff8801a8fcf6a0 R14: ffff8801ad6419c0 R15: ffff8801ad641980
       ip6_make_skb+0x2c8/0x600 net/ipv6/ip6_output.c:1768
       udpv6_sendmsg+0x2c90/0x35f0 net/ipv6/udp.c:1376
       inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:641 [inline]
       sock_sendmsg+0xd5/0x120 net/socket.c:651
       ___sys_sendmsg+0x51d/0x930 net/socket.c:2125
       __sys_sendmmsg+0x240/0x6f0 net/socket.c:2220
       __do_sys_sendmmsg net/socket.c:2249 [inline]
       __se_sys_sendmmsg net/socket.c:2246 [inline]
       __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2246
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x446ba9
      Code: e8 cc bb 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fb39a469da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
      RAX: ffffffffffffffda RBX: 00000000006dcc54 RCX: 0000000000446ba9
      RDX: 00000000000000b8 RSI: 0000000020001b00 RDI: 0000000000000003
      RBP: 00000000006dcc50 R08: 00007fb39a46a700 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 45c828efc7a64843
      R13: e6eeb815b9d8a477 R14: 5068caf6f713c6fc R15: 0000000000000001
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Kernel Offset: disabled
      Rebooting in 86400 seconds..
      
      Fixes: 93531c67 ("net/ipv6: separate handling of FIB entries from dst based routes")
      Reported-by: syzbot+902e2a1bcd4f7808cef5@syzkaller.appspotmail.com
      Reported-by: syzbot+8ae62d67f647abeeceb9@syzkaller.appspotmail.com
      Reported-by: syzbot+3f08feb14086930677d0@syzkaller.appspotmail.com
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e873e4b9
    • D
      Merge tag 'linux-can-fixes-for-4.18-20180723' of... · 5302a84e
      David S. Miller 提交于
      Merge tag 'linux-can-fixes-for-4.18-20180723' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
      
      Marc Kleine-Budde says:
      
      ====================
      pull-request: can 2018-07-23
      
      this is a pull request of 12 patches for net/master.
      
      The patch by Stephane Grosjean for the peak_canfd CAN driver fixes a problem
      with older firmware. The next patch is by Roman Fietze and fixes the setup of
      the CCCR register in the m_can driver. Nicholas Mc Guire's patch for the
      mpc5xxx_can driver adds missing error checking. The two patches by Faiz Abbas
      fix the runtime resume and clean up the probe function in the m_can driver. The
      last 7 patches by Anssi Hannula fix several problem in the xilinx_can driver.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5302a84e
  2. 23 7月, 2018 20 次提交
    • A
      can: xilinx_can: fix power management handling · 8ebd83bd
      Anssi Hannula 提交于
      There are several issues with the suspend/resume handling code of the
      driver:
      
      - The device is attached and detached in the runtime_suspend() and
        runtime_resume() callbacks if the interface is running. However,
        during xcan_chip_start() the interface is considered running,
        causing the resume handler to incorrectly call netif_start_queue()
        at the beginning of xcan_chip_start(), and on xcan_chip_start() error
        return the suspend handler detaches the device leaving the user
        unable to bring-up the device anymore.
      
      - The device is not brought properly up on system resume. A reset is
        done and the code tries to determine the bus state after that.
        However, after reset the device is always in Configuration mode
        (down), so the state checking code does not make sense and
        communication will also not work.
      
      - The suspend callback tries to set the device to sleep mode (low-power
        mode which monitors the bus and brings the device back to normal mode
        on activity), but then immediately disables the clocks (possibly
        before the device reaches the sleep mode), which does not make sense
        to me. If a clean shutdown is wanted before disabling clocks, we can
        just bring it down completely instead of only sleep mode.
      
      Reorganize the PM code so that only the clock logic remains in the
      runtime PM callbacks and the system PM callbacks contain the device
      bring-up/down logic. This makes calling the runtime PM callbacks during
      e.g. xcan_chip_start() safe.
      
      The system PM callbacks now simply call common code to start/stop the
      HW if the interface was running, replacing the broken code from before.
      
      xcan_chip_stop() is updated to use the common reset code so that it will
      wait for the reset to complete. Reset also disables all interrupts so do
      not do that separately.
      
      Also, the device_may_wakeup() checks are removed as the driver does not
      have wakeup support.
      
      Tested on Zynq-7000 integrated CAN.
      Signed-off-by: NAnssi Hannula <anssi.hannula@bitwise.fi>
      Cc: Michal Simek <michal.simek@xilinx.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      8ebd83bd
    • A
      can: xilinx_can: fix incorrect clear of non-processed interrupts · 2f4f0f33
      Anssi Hannula 提交于
      xcan_interrupt() clears ERROR|RXOFLV|BSOFF|ARBLST interrupts if any of
      them is asserted. This does not take into account that some of them
      could have been asserted between interrupt status read and interrupt
      clear, therefore clearing them without handling them.
      
      Fix the code to only clear those interrupts that it knows are asserted
      and therefore going to be processed in xcan_err_interrupt().
      
      Fixes: b1201e44 ("can: xilinx CAN controller support")
      Signed-off-by: NAnssi Hannula <anssi.hannula@bitwise.fi>
      Cc: Michal Simek <michal.simek@xilinx.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      2f4f0f33
    • A
      can: xilinx_can: fix RX overflow interrupt not being enabled · 83997997
      Anssi Hannula 提交于
      RX overflow interrupt (RXOFLW) is disabled even though xcan_interrupt()
      processes it. This means that an RX overflow interrupt will only be
      processed when another interrupt gets asserted (e.g. for RX/TX).
      
      Fix that by enabling the RXOFLW interrupt.
      
      Fixes: b1201e44 ("can: xilinx CAN controller support")
      Signed-off-by: NAnssi Hannula <anssi.hannula@bitwise.fi>
      Cc: Michal Simek <michal.simek@xilinx.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      83997997
    • A
      can: xilinx_can: keep only 1-2 frames in TX FIFO to fix TX accounting · 620050d9
      Anssi Hannula 提交于
      The xilinx_can driver assumes that the TXOK interrupt only clears after
      it has been acknowledged as many times as there have been successfully
      sent frames.
      
      However, the documentation does not mention such behavior, instead
      saying just that the interrupt is cleared when the clear bit is set.
      
      Similarly, testing seems to also suggest that it is immediately cleared
      regardless of the amount of frames having been sent. Performing some
      heavy TX load and then going back to idle has the tx_head drifting
      further away from tx_tail over time, steadily reducing the amount of
      frames the driver keeps in the TX FIFO (but not to zero, as the TXOK
      interrupt always frees up space for 1 frame from the driver's
      perspective, so frames continue to be sent) and delaying the local echo
      frames.
      
      The TX FIFO tracking is also otherwise buggy as it does not account for
      TX FIFO being cleared after software resets, causing
        BUG!, TX FIFO full when queue awake!
      messages to be output.
      
      There does not seem to be any way to accurately track the state of the
      TX FIFO for local echo support while using the full TX FIFO.
      
      The Zynq version of the HW (but not the soft-AXI version) has watermark
      programming support and with it an additional TX-FIFO-empty interrupt
      bit.
      
      Modify the driver to only put 1 frame into TX FIFO at a time on soft-AXI
      and 2 frames at a time on Zynq. On Zynq the TXFEMP interrupt bit is used
      to detect whether 1 or 2 frames have been sent at interrupt processing
      time.
      
      Tested with the integrated CAN on Zynq-7000 SoC. The 1-frame-FIFO mode
      was also tested.
      
      An alternative way to solve this would be to drop local echo support but
      keep using the full TX FIFO.
      
      v2: Add FIFO space check before TX queue wake with locking to
      synchronize with queue stop. This avoids waking the queue when xmit()
      had just filled it.
      
      v3: Keep local echo support and reduce the amount of frames in FIFO
      instead as suggested by Marc Kleine-Budde.
      
      Fixes: b1201e44 ("can: xilinx CAN controller support")
      Signed-off-by: NAnssi Hannula <anssi.hannula@bitwise.fi>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      620050d9
    • A
      can: xilinx_can: fix recovery from error states not being propagated · 877e0b75
      Anssi Hannula 提交于
      The xilinx_can driver contains no mechanism for propagating recovery
      from CAN_STATE_ERROR_WARNING and CAN_STATE_ERROR_PASSIVE.
      
      Add such a mechanism by factoring the handling of
      XCAN_STATE_ERROR_PASSIVE and XCAN_STATE_ERROR_WARNING out of
      xcan_err_interrupt and checking for recovery after RX and TX if the
      interface is in one of those states.
      
      Tested with the integrated CAN on Zynq-7000 SoC.
      
      Fixes: b1201e44 ("can: xilinx CAN controller support")
      Signed-off-by: NAnssi Hannula <anssi.hannula@bitwise.fi>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      877e0b75
    • A
      can: xilinx_can: fix RX loop if RXNEMP is asserted without RXOK · 32852c56
      Anssi Hannula 提交于
      If the device gets into a state where RXNEMP (RX FIFO not empty)
      interrupt is asserted without RXOK (new frame received successfully)
      interrupt being asserted, xcan_rx_poll() will continue to try to clear
      RXNEMP without actually reading frames from RX FIFO. If the RX FIFO is
      not empty, the interrupt will not be cleared and napi_schedule() will
      just be called again.
      
      This situation can occur when:
      
      (a) xcan_rx() returns without reading RX FIFO due to an error condition.
      The code tries to clear both RXOK and RXNEMP but RXNEMP will not clear
      due to a frame still being in the FIFO. The frame will never be read
      from the FIFO as RXOK is no longer set.
      
      (b) A frame is received between xcan_rx_poll() reading interrupt status
      and clearing RXOK. RXOK will be cleared, but RXNEMP will again remain
      set as the new message is still in the FIFO.
      
      I'm able to trigger case (b) by flooding the bus with frames under load.
      
      There does not seem to be any benefit in using both RXNEMP and RXOK in
      the way the driver does, and the polling example in the reference manual
      (UG585 v1.10 18.3.7 Read Messages from RxFIFO) also says that either
      RXOK or RXNEMP can be used for detecting incoming messages.
      
      Fix the issue and simplify the RX processing by only using RXNEMP
      without RXOK.
      
      Tested with the integrated CAN on Zynq-7000 SoC.
      
      Fixes: b1201e44 ("can: xilinx CAN controller support")
      Signed-off-by: NAnssi Hannula <anssi.hannula@bitwise.fi>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      32852c56
    • A
      can: xilinx_can: fix device dropping off bus on RX overrun · 2574fe54
      Anssi Hannula 提交于
      The xilinx_can driver performs a software reset when an RX overrun is
      detected. This causes the device to enter Configuration mode where no
      messages are received or transmitted.
      
      The documentation does not mention any need to perform a reset on an RX
      overrun, and testing by inducing an RX overflow also indicated that the
      device continues to work just fine without a reset.
      
      Remove the software reset.
      
      Tested with the integrated CAN on Zynq-7000 SoC.
      
      Fixes: b1201e44 ("can: xilinx CAN controller support")
      Signed-off-by: NAnssi Hannula <anssi.hannula@bitwise.fi>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      2574fe54
    • F
      can: m_can: Move accessing of message ram to after clocks are enabled · 54e4a0c4
      Faiz Abbas 提交于
      MCAN message ram should only be accessed once clocks are enabled.
      Therefore, move the call to parse/init the message ram to after
      clocks are enabled.
      Signed-off-by: NFaiz Abbas <faiz_abbas@ti.com>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      54e4a0c4
    • F
      can: m_can: Fix runtime resume call · 1675bee3
      Faiz Abbas 提交于
      pm_runtime_get_sync() returns a 1 if the state of the device is already
      'active'. This is not a failure case and should return a success.
      
      Therefore fix error handling for pm_runtime_get_sync() call such that
      it returns success when the value is 1.
      
      Also cleanup the TODO for using runtime PM for sleep mode as that is
      implemented.
      Signed-off-by: NFaiz Abbas <faiz_abbas@ti.com>
      Cc: <stable@vger.kernel.org
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      1675bee3
    • N
      can: mpc5xxx_can: check of_iomap return before use · b5c1a23b
      Nicholas Mc Guire 提交于
      of_iomap() can return NULL so that return needs to be checked and NULL
      treated as failure. While at it also take care of the missing
      of_node_put() in the error path.
      Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org>
      Fixes: commit afa17a50 ("net/can: add driver for mscan family & mpc52xx_mscan")
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      b5c1a23b
    • R
      can: m_can.c: fix setup of CCCR register: clear CCCR NISO bit before checking can.ctrlmode · 393753b2
      Roman Fietze 提交于
      Inside m_can_chip_config(), when setting up the new value of the CCCR,
      the CCCR_NISO bit is not cleared like the others, CCCR_TEST, CCCR_MON,
      CCCR_BRSE and CCCR_FDOE, before checking the can.ctrlmode bits for
      CAN_CTRLMODE_FD_NON_ISO.
      
      This way once the controller was configured for CAN_CTRLMODE_FD_NON_ISO,
      this mode could never be cleared again.
      
      This fix is only relevant for controllers with version 3.1.x or 3.2.x.
      Older versions do not support NISO.
      Signed-off-by: NRoman Fietze <roman.fietze@telemotive.de>
      Cc: linux-stable <stable@vger.kernel.org>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      393753b2
    • S
      can: peak_canfd: fix firmware < v3.3.0: limit allocation to 32-bit DMA addr only · 5d4c94ed
      Stephane Grosjean 提交于
      The DMA logic in firmwares < v3.3.0 embedded in the PCAN-PCIe FD cards
      family is not capable of handling a mix of 32-bit and 64-bit logical
      addresses. If the board is equipped with 2 or 4 CAN ports, then such a
      situation might lead to a PCIe Bus Error "Malformed TLP" packet
      as well as "irq xx: nobody cared" issue.
      
      This patch adds a workaround that requests only 32-bit DMA addresses
      when these might be allocated outside of the 4 GB area.
      
      This issue has been fixed in firmware v3.3.0 and next.
      Signed-off-by: NStephane Grosjean <s.grosjean@peak-system.com>
      Cc: linux-stable <stable@vger.kernel.org>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      5d4c94ed
    • R
      net: prevent ISA drivers from building on PPC32 · c9ce1fa1
      Randy Dunlap 提交于
      Prevent drivers from building on PPC32 if they use isa_bus_to_virt(),
      isa_virt_to_bus(), or isa_page_to_bus(), which are not available and
      thus cause build errors.
      
      ../drivers/net/ethernet/3com/3c515.c: In function 'corkscrew_open':
      ../drivers/net/ethernet/3com/3c515.c:824:9: error: implicit declaration of function 'isa_virt_to_bus'; did you mean 'virt_to_bus'? [-Werror=implicit-function-declaration]
      
      ../drivers/net/ethernet/amd/lance.c: In function 'lance_rx':
      ../drivers/net/ethernet/amd/lance.c:1203:23: error: implicit declaration of function 'isa_bus_to_virt'; did you mean 'bus_to_virt'? [-Werror=implicit-function-declaration]
      
      ../drivers/net/ethernet/amd/ni65.c: In function 'ni65_init_lance':
      ../drivers/net/ethernet/amd/ni65.c:585:20: error: implicit declaration of function 'isa_virt_to_bus'; did you mean 'virt_to_bus'? [-Werror=implicit-function-declaration]
      
      ../drivers/net/ethernet/cirrus/cs89x0.c: In function 'net_open':
      ../drivers/net/ethernet/cirrus/cs89x0.c:897:20: error: implicit declaration of function 'isa_virt_to_bus'; did you mean 'virt_to_bus'? [-Werror=implicit-function-declaration]
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Suggested-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9ce1fa1
    • J
      nfp: flower: ensure dead neighbour entries are not offloaded · b809ec86
      John Hurley 提交于
      Previously only the neighbour state was checked to decide if an offloaded
      entry should be removed. However, there can be situations when the entry
      is dead but still marked as valid. This can lead to dead entries not
      being removed from fw tables or even incorrect data being added.
      
      Check the entry dead bit before deciding if it should be added to or
      removed from fw neighbour tables.
      
      Fixes: 8e6a9046 ("nfp: flower vxlan neighbour offload")
      Signed-off-by: NJohn Hurley <john.hurley@netronome.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b809ec86
    • D
      Merge branch 'vxlan-fix-default-fdb-entry-user-space-notify-ordering-race' · 0fb8d5a0
      David S. Miller 提交于
      Roopa Prabhu says:
      
      ====================
      vxlan: fix default fdb entry user-space notify ordering/race
      
      Problem:
      In vxlan_newlink, a default fdb entry is added before register_netdev.
      The default fdb creation function notifies user-space of the
      fdb entry on the vxlan device which user-space does not know about yet.
      (RTM_NEWNEIGH goes before RTM_NEWLINK for the same ifindex).
      
      This series fixes the user-space netlink notification ordering issue
      with the following changes:
      - decouple fdb notify from fdb create.
      - Move fdb notify after register_netdev.
      - modify rtnl_configure_link to allow configuring a link early.
      - Call rtnl_configure_link in vxlan newlink handler to notify
      userspace about the newlink before fdb notify and
      hence avoiding the user-space race.
      ====================
      
      Fixes: afbd8bae ("vxlan: add implicit fdb entry for default destination")
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      0fb8d5a0
    • R
      vxlan: fix default fdb entry netlink notify ordering during netdev create · e99465b9
      Roopa Prabhu 提交于
      Problem:
      In vxlan_newlink, a default fdb entry is added before register_netdev.
      The default fdb creation function also notifies user-space of the
      fdb entry on the vxlan device which user-space does not know about yet.
      (RTM_NEWNEIGH goes before RTM_NEWLINK for the same ifindex).
      
      This patch fixes the user-space netlink notification ordering issue
      with the following changes:
      - decouple fdb notify from fdb create.
      - Move fdb notify after register_netdev.
      - Call rtnl_configure_link in vxlan newlink handler to notify
      userspace about the newlink before fdb notify and
      hence avoiding the user-space race.
      
      Fixes: afbd8bae ("vxlan: add implicit fdb entry for default destination")
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e99465b9
    • R
      vxlan: make netlink notify in vxlan_fdb_destroy optional · f6e05385
      Roopa Prabhu 提交于
      Add a new option do_notify to vxlan_fdb_destroy to make
      sending netlink notify optional. Used by a later patch.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6e05385
    • R
      vxlan: add new fdb alloc and create helpers · 7431016b
      Roopa Prabhu 提交于
      - Add new vxlan_fdb_alloc helper
      - rename existing vxlan_fdb_create into vxlan_fdb_update:
              because it really creates or updates an existing
              fdb entry
      - move new fdb creation into a separate vxlan_fdb_create
      
      Main motivation for this change is to introduce the ability
      to decouple vxlan fdb creation and notify, used in a later patch.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7431016b
    • R
      rtnetlink: add rtnl_link_state check in rtnl_configure_link · 5025f7f7
      Roopa Prabhu 提交于
      rtnl_configure_link sets dev->rtnl_link_state to
      RTNL_LINK_INITIALIZED and unconditionally calls
      __dev_notify_flags to notify user-space of dev flags.
      
      current call sequence for rtnl_configure_link
      rtnetlink_newlink
          rtnl_link_ops->newlink
          rtnl_configure_link (unconditionally notifies userspace of
                               default and new dev flags)
      
      If a newlink handler wants to call rtnl_configure_link
      early, we will end up with duplicate notifications to
      user-space.
      
      This patch fixes rtnl_configure_link to check rtnl_link_state
      and call __dev_notify_flags with gchanges = 0 if already
      RTNL_LINK_INITIALIZED.
      
      Later in the series, this patch will help the following sequence
      where a driver implementing newlink can call rtnl_configure_link
      to initialize the link early.
      
      makes the following call sequence work:
      rtnetlink_newlink
          rtnl_link_ops->newlink (vxlan) -> rtnl_configure_link (initializes
                                                      link and notifies
                                                      user-space of default
                                                      dev flags)
          rtnl_configure_link (updates dev flags if requested by user ifm
                               and notifies user-space of new dev flags)
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5025f7f7
    • F
      atl1c: reserve min skb headroom · 6e568307
      Florian Westphal 提交于
      Got crash report with following backtrace:
      BUG: unable to handle kernel paging request at ffff8801869daffe
      RIP: 0010:[<ffffffff816429c4>]  [<ffffffff816429c4>] ip6_finish_output2+0x394/0x4c0
      RSP: 0018:ffff880186c83a98  EFLAGS: 00010283
      RAX: ffff8801869db00e ...
        [<ffffffff81644cdc>] ip6_finish_output+0x8c/0xf0
        [<ffffffff81644d97>] ip6_output+0x57/0x100
        [<ffffffff81643dc9>] ip6_forward+0x4b9/0x840
        [<ffffffff81645566>] ip6_rcv_finish+0x66/0xc0
        [<ffffffff81645db9>] ipv6_rcv+0x319/0x530
        [<ffffffff815892ac>] netif_receive_skb+0x1c/0x70
        [<ffffffffc0060bec>] atl1c_clean+0x1ec/0x310 [atl1c]
        ...
      
      The bad access is in neigh_hh_output(), at skb->data - 16 (HH_DATA_MOD).
      atl1c driver provided skb with no headroom, so 14 bytes (ethernet
      header) got pulled, but then 16 are copied.
      
      Reserve NET_SKB_PAD bytes headroom, like netdev_alloc_skb().
      
      Compile tested only; I lack hardware.
      
      Fixes: 7b701764 ("atl1c: Fix misuse of netdev_alloc_skb in refilling rx ring")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e568307
  3. 22 7月, 2018 13 次提交
    • H
      multicast: do not restore deleted record source filter mode to new one · 08d3ffcc
      Hangbin Liu 提交于
      There are two scenarios that we will restore deleted records. The first is
      when device down and up(or unmap/remap). In this scenario the new filter
      mode is same with previous one. Because we get it from in_dev->mc_list and
      we do not touch it during device down and up.
      
      The other scenario is when a new socket join a group which was just delete
      and not finish sending status reports. In this scenario, we should use the
      current filter mode instead of restore old one. Here are 4 cases in total.
      
      old_socket        new_socket       before_fix       after_fix
        IN(A)             IN(A)           ALLOW(A)         ALLOW(A)
        IN(A)             EX( )           TO_IN( )         TO_EX( )
        EX( )             IN(A)           TO_EX( )         ALLOW(A)
        EX( )             EX( )           TO_EX( )         TO_EX( )
      
      Fixes: 24803f38 (igmp: do not remove igmp souce list info when set link down)
      Fixes: 1666d49e (mld: do not remove mld souce list info when set link down)
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08d3ffcc
    • U
      net: dsa: mv88e6xxx: fix races between lock and irq freeing · 3d82475a
      Uwe Kleine-König 提交于
      free_irq() waits until all handlers for this IRQ have completed. As the
      relevant handler (mv88e6xxx_g1_irq_thread_fn()) takes the chip's reg_lock
      it might never return if the thread calling free_irq() holds this lock.
      
      For the same reason kthread_cancel_delayed_work_sync() in the polling case
      must not hold this lock.
      
      Also first free the irq (or stop the worker respectively) such that
      mv88e6xxx_g1_irq_thread_work() isn't called any more before the irq
      mappings are dropped in mv88e6xxx_g1_irq_free_common() to prevent the
      worker thread to call handle_nested_irq(0) which results in a NULL-pointer
      exception.
      Signed-off-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d82475a
    • E
      net: skb_segment() should not return NULL · ff907a11
      Eric Dumazet 提交于
      syzbot caught a NULL deref [1], caused by skb_segment()
      
      skb_segment() has many "goto err;" that assume the @err variable
      contains -ENOMEM.
      
      A successful call to __skb_linearize() should not clear @err,
      otherwise a subsequent memory allocation error could return NULL.
      
      While we are at it, we might use -EINVAL instead of -ENOMEM when
      MAX_SKB_FRAGS limit is reached.
      
      [1]
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      CPU: 0 PID: 13285 Comm: syz-executor3 Not tainted 4.18.0-rc4+ #146
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:tcp_gso_segment+0x3dc/0x1780 net/ipv4/tcp_offload.c:106
      Code: f0 ff ff 0f 87 1c fd ff ff e8 00 88 0b fb 48 8b 75 d0 48 b9 00 00 00 00 00 fc ff df 48 8d be 90 00 00 00 48 89 f8 48 c1 e8 03 <0f> b6 14 08 48 8d 86 94 00 00 00 48 89 c6 83 e0 07 48 c1 ee 03 0f
      RSP: 0018:ffff88019b7fd060 EFLAGS: 00010206
      RAX: 0000000000000012 RBX: 0000000000000020 RCX: dffffc0000000000
      RDX: 0000000000040000 RSI: 0000000000000000 RDI: 0000000000000090
      RBP: ffff88019b7fd0f0 R08: ffff88019510e0c0 R09: ffffed003b5c46d6
      R10: ffffed003b5c46d6 R11: ffff8801dae236b3 R12: 0000000000000001
      R13: ffff8801d6c581f4 R14: 0000000000000000 R15: ffff8801d6c58128
      FS:  00007fcae64d6700(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000004e8664 CR3: 00000001b669b000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       tcp4_gso_segment+0x1c3/0x440 net/ipv4/tcp_offload.c:54
       inet_gso_segment+0x64e/0x12d0 net/ipv4/af_inet.c:1342
       inet_gso_segment+0x64e/0x12d0 net/ipv4/af_inet.c:1342
       skb_mac_gso_segment+0x3b5/0x740 net/core/dev.c:2792
       __skb_gso_segment+0x3c3/0x880 net/core/dev.c:2865
       skb_gso_segment include/linux/netdevice.h:4099 [inline]
       validate_xmit_skb+0x640/0xf30 net/core/dev.c:3104
       __dev_queue_xmit+0xc14/0x3910 net/core/dev.c:3561
       dev_queue_xmit+0x17/0x20 net/core/dev.c:3602
       neigh_hh_output include/net/neighbour.h:473 [inline]
       neigh_output include/net/neighbour.h:481 [inline]
       ip_finish_output2+0x1063/0x1860 net/ipv4/ip_output.c:229
       ip_finish_output+0x841/0xfa0 net/ipv4/ip_output.c:317
       NF_HOOK_COND include/linux/netfilter.h:276 [inline]
       ip_output+0x223/0x880 net/ipv4/ip_output.c:405
       dst_output include/net/dst.h:444 [inline]
       ip_local_out+0xc5/0x1b0 net/ipv4/ip_output.c:124
       iptunnel_xmit+0x567/0x850 net/ipv4/ip_tunnel_core.c:91
       ip_tunnel_xmit+0x1598/0x3af1 net/ipv4/ip_tunnel.c:778
       ipip_tunnel_xmit+0x264/0x2c0 net/ipv4/ipip.c:308
       __netdev_start_xmit include/linux/netdevice.h:4148 [inline]
       netdev_start_xmit include/linux/netdevice.h:4157 [inline]
       xmit_one net/core/dev.c:3034 [inline]
       dev_hard_start_xmit+0x26c/0xc30 net/core/dev.c:3050
       __dev_queue_xmit+0x29ef/0x3910 net/core/dev.c:3569
       dev_queue_xmit+0x17/0x20 net/core/dev.c:3602
       neigh_direct_output+0x15/0x20 net/core/neighbour.c:1403
       neigh_output include/net/neighbour.h:483 [inline]
       ip_finish_output2+0xa67/0x1860 net/ipv4/ip_output.c:229
       ip_finish_output+0x841/0xfa0 net/ipv4/ip_output.c:317
       NF_HOOK_COND include/linux/netfilter.h:276 [inline]
       ip_output+0x223/0x880 net/ipv4/ip_output.c:405
       dst_output include/net/dst.h:444 [inline]
       ip_local_out+0xc5/0x1b0 net/ipv4/ip_output.c:124
       ip_queue_xmit+0x9df/0x1f80 net/ipv4/ip_output.c:504
       tcp_transmit_skb+0x1bf9/0x3f10 net/ipv4/tcp_output.c:1168
       tcp_write_xmit+0x1641/0x5c20 net/ipv4/tcp_output.c:2363
       __tcp_push_pending_frames+0xb2/0x290 net/ipv4/tcp_output.c:2536
       tcp_push+0x638/0x8c0 net/ipv4/tcp.c:735
       tcp_sendmsg_locked+0x2ec5/0x3f00 net/ipv4/tcp.c:1410
       tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1447
       inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:641 [inline]
       sock_sendmsg+0xd5/0x120 net/socket.c:651
       __sys_sendto+0x3d7/0x670 net/socket.c:1797
       __do_sys_sendto net/socket.c:1809 [inline]
       __se_sys_sendto net/socket.c:1805 [inline]
       __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1805
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x455ab9
      Code: 1d ba fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb b9 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fcae64d5c68 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 00007fcae64d66d4 RCX: 0000000000455ab9
      RDX: 0000000000000001 RSI: 0000000020000200 RDI: 0000000000000013
      RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000014
      R13: 00000000004c1145 R14: 00000000004d1818 R15: 0000000000000006
      Modules linked in:
      Dumping ftrace buffer:
         (ftrace buffer empty)
      
      Fixes: ddff00d4 ("net: Move skb_has_shared_frag check out of GRE code and into segmentation")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff907a11
    • D
      net/ipv6: Fix linklocal to global address with VRF · 24b711ed
      David Ahern 提交于
      Example setup:
          host: ip -6 addr add dev eth1 2001:db8:104::4
                 where eth1 is enslaved to a VRF
      
          switch: ip -6 ro add 2001:db8:104::4/128 dev br1
                  where br1 only has an LLA
      
                 ping6 2001:db8:104::4
                 ssh   2001:db8:104::4
      
      (NOTE: UDP works fine if the PKTINFO has the address set to the global
      address and ifindex is set to the index of eth1 with a destination an
      LLA).
      
      For ICMP, icmp6_iif needs to be updated to check if skb->dev is an
      L3 master. If it is then return the ifindex from rt6i_idev similar
      to what is done for loopback.
      
      For TCP, restore the original tcp_v6_iif definition which is needed in
      most places and add a new tcp_v6_iif_l3_slave that considers the
      l3_slave variability. This latter check is only needed for socket
      lookups.
      
      Fixes: 9ff74384 ("net: vrf: Handle ipv6 multicast and link-local addresses")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24b711ed
    • Y
      bpfilter: Fix mismatch in function argument types · f95de8aa
      YueHaibing 提交于
      Fix following warning:
      net/ipv4/bpfilter/sockopt.c:28:5: error: symbol 'bpfilter_ip_set_sockopt' redeclared with different type
      net/ipv4/bpfilter/sockopt.c:34:5: error: symbol 'bpfilter_ip_get_sockopt' redeclared with different type
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f95de8aa
    • H
      net: phy: consider PHY_IGNORE_INTERRUPT in phy_start_aneg_priv · 215d08a8
      Heiner Kallweit 提交于
      The situation described in the comment can occur also with
      PHY_IGNORE_INTERRUPT, therefore change the condition to include it.
      
      Fixes: f555f34f ("net: phy: fix auto-negotiation stall due to unavailable interrupt")
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      215d08a8
    • D
      Merge branch 'qed-Fix-series-II' · 1a03f867
      David S. Miller 提交于
      Sudarsana Reddy Kalluru says:
      
      ====================
      qed: Fix series II.
      
      The patch series fixes few issues in the qed driver.
      
      Please  consider applying it to 'net' branch.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a03f867
    • S
      qed: Correct Multicast API to reflect existence of 256 approximate buckets. · 25c020a9
      Sudarsana Reddy Kalluru 提交于
      FW hsi contains 256 approximation buckets which are split in ramrod into
      eight u32 values, but driver is using eight 'unsigned long' variables.
      
      This patch fixes the mcast logic by making the API utilize u32.
      
      Fixes: 83aeb933 ("qed*: Trivial modifications")
      Signed-off-by: NSudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
      Signed-off-by: NAriel Elior <ariel.elior@cavium.com>
      Signed-off-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25c020a9
    • S
      qed: Fix possible race for the link state value. · 58874c7b
      Sudarsana Reddy Kalluru 提交于
      There's a possible race where driver can read link status in mid-transition
      and see that virtual-link is up yet speed is 0. Since in this
      mid-transition we're guaranteed to see a mailbox from MFW soon, we can
      afford to treat this as link down.
      
      Fixes: cc875c2e ("qed: Add link support")
      Signed-off-by: NSudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
      Signed-off-by: NAriel Elior <ariel.elior@cavium.com>
      Signed-off-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58874c7b
    • S
      qed: Fix link flap issue due to mismatching EEE capabilities. · 4ad95a93
      Sudarsana Reddy Kalluru 提交于
      Apparently, MFW publishes EEE capabilities even for Fiber-boards that don't
      support them, and later since qed internally sets adv_caps it would cause
      link-flap avoidance (LFA) to fail when driver would initiate the link.
      This in turn delays the link, causing traffic to fail.
      
      Driver has been modified to not to ask MFW for any EEE config if EEE isn't
      to be enabled.
      
      Fixes: 645874e5 ("qed: Add support for Energy efficient ethernet.")
      Signed-off-by: NSudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
      Signed-off-by: NAriel Elior <ariel.elior@cavium.com>
      Signed-off-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ad95a93
    • Y
      net: caif: Add a missing rcu_read_unlock() in caif_flow_cb · 64119e05
      YueHaibing 提交于
      Add a missing rcu_read_unlock in the error path
      
      Fixes: c95567c8 ("caif: added check for potential null return")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64119e05
    • J
      bonding: set default miimon value for non-arp modes if not set · c1f897ce
      Jarod Wilson 提交于
      For some time now, if you load the bonding driver and configure bond
      parameters via sysfs using minimal config options, such as specifying
      nothing but the mode, relying on defaults for everything else, modes
      that cannot use arp monitoring (802.3ad, balance-tlb, balance-alb) all
      wind up with both arp_interval=0 (as it should be) and miimon=0, which
      means the miimon monitor thread never actually runs. This is particularly
      problematic for 802.3ad.
      
      For example, from an LNST recipe I've set up:
      
      $ modprobe bonding max_bonds=0"
      $ echo "+t_bond0" > /sys/class/net/bonding_masters"
      $ ip link set t_bond0 down"
      $ echo "802.3ad" > /sys/class/net/t_bond0/bonding/mode"
      $ ip link set ens1f1 down"
      $ echo "+ens1f1" > /sys/class/net/t_bond0/bonding/slaves"
      $ ip link set ens1f0 down"
      $ echo "+ens1f0" > /sys/class/net/t_bond0/bonding/slaves"
      $ ethtool -i t_bond0"
      $ ip link set ens1f1 up"
      $ ip link set ens1f0 up"
      $ ip link set t_bond0 up"
      $ ip addr add 192.168.9.1/24 dev t_bond0"
      $ ip addr add 2002::1/64 dev t_bond0"
      
      This bond comes up okay, but things look slightly suspect in
      /proc/net/bonding/t_bond0 output:
      
      $ grep -i mii /proc/net/bonding/t_bond0
      MII Status: up
      MII Polling Interval (ms): 0
      MII Status: up
      MII Status: up
      
      Now, pull a cable on one of the ports in the bond, then reconnect it, and
      you'll see:
      
      Slave Interface: ens1f0
      MII Status: down
      Speed: 1000 Mbps
      Duplex: full
      
      I believe this became a major issue as of commit 4d2c0cda, which for
      802.3ad bonds, sets slave->link = BOND_LINK_DOWN, with a comment about
      relying on link monitoring via miimon to set it correctly, but since the
      miimon work queue never runs, the link just stays marked down.
      
      If we simply tweak bond_option_mode_set() slightly, we can check for the
      non-arp modes having no miimon value set, and insert BOND_DEFAULT_MIIMON,
      which gets things back in full working order. This problem exists as far
      back as 4.14, and might be worth fixing in all stable trees since, though
      the work-around is to simply specify an miimon value yourself.
      Reported-by: NBob Ball <ball@umich.edu>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Acked-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c1f897ce
    • D
      Merge tag 'mlx5-fixes-2018-07-18' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · a6fc8594
      David S. Miller 提交于
      Saeed Mahameed says:
      
      ====================
      Mellanox, mlx5 fixes 2018-07-18
      
      The following series provides fixes to mlx5 core and net device driver.
      
      Please pull and let me know if there's any problem.
      
      For -stable v4.7
          net/mlx5e: Don't allow aRFS for encapsulated packets
          net/mlx5e: Fix quota counting in aRFS expire flow
      
      For -stable v4.15
          net/mlx5e: Only allow offloading decap egress (egdev) flows
          net/mlx5e: Refine ets validation function
          net/mlx5: Adjust clock overflow work period
      
      For -stable v4.17
          net/mlx5: E-Switch, UBSAN fix undefined behavior in mlx5_eswitch_mode
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6fc8594
  4. 21 7月, 2018 5 次提交
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · f1d66bf9
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2018-07-20
      
      The following pull-request contains BPF updates for your *net* tree.
      
      The main changes are:
      
      1) Fix in BPF Makefile to detect llvm-objcopy in a more robust way which is
         needed for pahole's BTF converter and minor UAPI tweaks in BTF_INT_BITS()
         to shrink the mask before eventual UAPI freeze, from Martin.
      
      2) Fix a segfault in bpftool when prog pin id has no further arguments such
         as id value or file specified, from Taeung.
      
      3) Fix powerpc JIT handling of XADD which has jumps to exit path that would
         potentially bypass verifier expectations e.g. with subprog calls. Also add
         a test case to make sure XADD is not mangling src/dst register, from Daniel.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1d66bf9
    • D
      tls: check RCV_SHUTDOWN in tls_wait_data · fcf4793e
      Doron Roberts-Kedes 提交于
      The current code does not check sk->sk_shutdown & RCV_SHUTDOWN.
      tls_sw_recvmsg may return a positive value in the case where bytes have
      already been copied when the socket is shutdown. sk->sk_err has been
      cleared, causing the tls_wait_data to hang forever on a subsequent
      invocation. Checking sk->sk_shutdown & RCV_SHUTDOWN, as in tcp_recvmsg,
      fixes this problem.
      
      Fixes: c46234eb ("tls: RX path for ktls")
      Acked-by: NDave Watson <davejwatson@fb.com>
      Signed-off-by: NDoron Roberts-Kedes <doronrk@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fcf4793e
    • D
      Merge branch 'tcp-fix-DCTCP-ECE-Ack-series' · f7a6eb1e
      David S. Miller 提交于
      Yuchung Cheng says:
      
      ====================
      fix DCTCP ECE Ack series
      
      This patch set address that the existing DCTCP implementation does not
      fully implement the ACK policy specified in the RFC. This improves
      the responsiveness of CE status change particularly on flows with
      small inflight.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7a6eb1e
    • Y
      tcp: do not delay ACK in DCTCP upon CE status change · a0496ef2
      Yuchung Cheng 提交于
      Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
      has to be sent immediately so the sender can respond quickly:
      
      """ When receiving packets, the CE codepoint MUST be processed as follows:
      
         1.  If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
             true and send an immediate ACK.
      
         2.  If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
             to false and send an immediate ACK.
      """
      
      Previously DCTCP implementation may continue to delay the ACK. This
      patch fixes that to implement the RFC by forcing an immediate ACK.
      
      Tested with this packetdrill script provided by Larry Brakmo
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      0.110 < [ect0] . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
         +0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
      
      0.200 < [ect0] . 1:1001(1000) ack 1 win 257
      0.200 > [ect01] . 1:1(0) ack 1001
      
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 1:2(1) ack 1001
      
      0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
      +0.005 < [ce] . 2001:3001(1000) ack 2 win 257
      
      +0.000 > [ect01] . 2:2(0) ack 2001
      // Previously the ACK below would be delayed by 40ms
      +0.000 > [ect01] E. 2:2(0) ack 3001
      
      +0.500 < F. 9501:9501(0) ack 4 win 257
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0496ef2
    • Y
      tcp: do not cancel delay-AcK on DCTCP special ACK · 27cde44a
      Yuchung Cheng 提交于
      Currently when a DCTCP receiver delays an ACK and receive a
      data packet with a different CE mark from the previous one's, it
      sends two immediate ACKs acking previous and latest sequences
      respectly (for ECN accounting).
      
      Previously sending the first ACK may mark off the delayed ACK timer
      (tcp_event_ack_sent). This may subsequently prevent sending the
      second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
      The culprit is that tcp_send_ack() assumes it always acknowleges
      the latest sequence, which is not true for the first special ACK.
      
      The fix is to not make the assumption in tcp_send_ack and check the
      actual ack sequence before cancelling the delayed ACK. Further it's
      safer to pass the ack sequence number as a local variable into
      tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
      future bugs like this.
      Reported-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27cde44a