1. 17 7月, 2018 24 次提交
    • U
      net/smc: take sock lock in smc_ioctl() · 1992d998
      Ursula Braun 提交于
      SMC ioctl processing requires the sock lock to work properly in
      all thinkable scenarios.
      Problem has been found with RaceFuzzer and fixes:
         KASAN: null-ptr-deref Read in smc_ioctl
      Reported-by: NByoungyoung Lee <lifeasageek@gmail.com>
      Reported-by: syzbot+35b2c5aa76fd398b9fd4@syzkaller.appspotmail.com
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1992d998
    • D
      Merge branch 'tg3-fixes' · bd598d20
      David S. Miller 提交于
      Siva Reddy Kallam says:
      
      ====================
      tg3: Update copyright and fix for tx timeout with 5762
      
      First patch:
              Update copyright
      
      Second patch:
              Add higher cpu clock for 5762
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd598d20
    • S
      tg3: Add higher cpu clock for 5762. · 3a498606
      Sanjeev Bansal 提交于
      This patch has fix for TX timeout while running bi-directional
      traffic with 100 Mbps using 5762.
      Signed-off-by: NSanjeev Bansal <sanjeevb.bansal@broadcom.com>
      Signed-off-by: NSiva Reddy Kallam <siva.kallam@broadcom.com>
      Reviewed-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a498606
    • S
      0f2605fb
    • J
      ibmvnic: Fix error recovery on login failure · 3578a7ec
      John Allen 提交于
      Testing has uncovered a failure case that is not handled properly. In the
      event that a login fails and we are not able to recover on the spot, we
      return 0 from do_reset, preventing any error recovery code from being
      triggered.  Additionally, the state is set to "probed" meaning that when we
      are able to trigger the error recovery, the driver always comes up in the
      probed state. To handle the case properly, we need to return a failure code
      here and set the adapter state to the state that we entered the reset in
      indicating the state that we would like to come out of the recovery reset
      in.
      Signed-off-by: NJohn Allen <jallen@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3578a7ec
    • S
      net: lan78xx: Fix race in tx pending skb size calculation · dea39aca
      Stefan Wahren 提交于
      The skb size calculation in lan78xx_tx_bh is in race with the start_xmit,
      which could lead to rare kernel oopses. So protect the whole skb walk with
      a spin lock. As a benefit we can unlink the skb directly.
      
      This patch was tested on Raspberry Pi 3B+
      
      Link: https://github.com/raspberrypi/linux/issues/2608
      Fixes: 55d7de9d ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 Ethernet")
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: NFloris Bos <bos@je-eigen-domein.nl>
      Signed-off-by: NStefan Wahren <stefan.wahren@i2se.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dea39aca
    • D
      net/ipv6: Do not allow device only routes via the multipath API · b5d2d75e
      David Ahern 提交于
      Eric reported that reverting the patch that fixed and simplified IPv6
      multipath routes means reverting back to invalid userspace notifications.
      eg.,
      $ ip -6 route add 2001:db8:1::/64 nexthop dev eth0 nexthop dev eth1
      
      only generates a single notification:
      2001:db8:1::/64 dev eth0 metric 1024 pref medium
      
      While working on a fix for this problem I found another case that is just
      broken completely - a multipath route with a gateway followed by device
      followed by gateway:
          $ ip -6 ro add 2001:db8:103::/64
                nexthop via 2001:db8:1::64
                nexthop dev dummy2
                nexthop via 2001:db8:3::64
      
      In this case the device only route is dropped completely - no notification
      to userpsace but no addition to the FIB either:
      
      $ ip -6 ro ls
      2001:db8:1::/64 dev dummy1 proto kernel metric 256 pref medium
      2001:db8:2::/64 dev dummy2 proto kernel metric 256 pref medium
      2001:db8:3::/64 dev dummy3 proto kernel metric 256 pref medium
      2001:db8:103::/64 metric 1024
      	nexthop via 2001:db8:1::64 dev dummy1 weight 1
      	nexthop via 2001:db8:3::64 dev dummy3 weight 1 pref medium
      fe80::/64 dev dummy1 proto kernel metric 256 pref medium
      fe80::/64 dev dummy2 proto kernel metric 256 pref medium
      fe80::/64 dev dummy3 proto kernel metric 256 pref medium
      
      Really, IPv6 multipath is just FUBAR'ed beyond repair when it comes to
      device only routes, so do not allow it all.
      
      This change will break any scripts relying on the mpath api for insert,
      but I don't see any other way to handle the permutations. Besides, since
      the routes are added to the FIB as standalone (non-multipath) routes the
      kernel is not doing what the user requested, so it might as well tell the
      user that.
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5d2d75e
    • S
      tcp: Fix broken repair socket window probe patch · 31048d7a
      Stefan Baranoff 提交于
      Correct previous bad attempt at allowing sockets to come out of TCP
      repair without sending window probes. To avoid changing size of
      the repair variable in struct tcp_sock, this lets the decision for
      sending probes or not to be made when coming out of repair by
      introducing two ways to turn it off.
      
      v2:
      * Remove erroneous comment; defines now make behavior clear
      
      Fixes: 70b7ff13 ("tcp: allow user to create repair socket without window probes")
      Signed-off-by: NStefan Baranoff <sbaranoff@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31048d7a
    • S
      net/mlx4_en: Don't reuse RX page when XDP is set · 432e629e
      Saeed Mahameed 提交于
      When a new rx packet arrives, the rx path will decide whether to reuse
      the remainder of the page or not according to one of the below conditions:
      1. frag_info->frag_stride == PAGE_SIZE / 2
      2. frags->page_offset + frag_info->frag_size > PAGE_SIZE;
      
      The first condition is no met for when XDP is set.
      For XDP, page_offset is always set to priv->rx_headroom which is
      XDP_PACKET_HEADROOM and frag_info->frag_size is around mtu size + some
      padding, still the 2nd release condition will hold since
      XDP_PACKET_HEADROOM + 1536 < PAGE_SIZE, as a result the page will not
      be released and will be _wrongly_ reused for next free rx descriptor.
      
      In XDP there is an assumption to have a page per packet and reuse can
      break such assumption and might cause packet data corruptions.
      
      Fix this by adding an extra condition (!priv->rx_headroom) to the 2nd
      case to avoid page reuse when XDP is set, since rx_headroom is set to 0
      for non XDP setup and set to XDP_PACKET_HEADROOM for XDP setup.
      
      No additional cache line is required for the new condition.
      
      Fixes: 34db548b ("mlx4: add page recycling in receive path")
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Suggested-by: NMartin KaFai Lau <kafai@fb.com>
      CC: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      432e629e
    • R
      net/ethernet/freescale/fman: fix cross-build error · c1334597
      Randy Dunlap 提交于
        CC [M]  drivers/net/ethernet/freescale/fman/fman.o
      In file included from ../drivers/net/ethernet/freescale/fman/fman.c:35:
      ../include/linux/fsl/guts.h: In function 'guts_set_dmacr':
      ../include/linux/fsl/guts.h:165:2: error: implicit declaration of function 'clrsetbits_be32' [-Werror=implicit-function-declaration]
        clrsetbits_be32(&guts->dmacr, 3 << shift, device << shift);
        ^~~~~~~~~~~~~~~
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Madalin Bucur <madalin.bucur@nxp.com>
      Cc: netdev@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c1334597
    • S
      hv/netvsc: fix handling of fallback to single queue mode · 916c5e14
      Stephen Hemminger 提交于
      The netvsc device may need to fallback to running in single queue
      mode if host side only wants to support single queue.
      
      Recent change for handling mtu broke this in setup logic.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Fixes: 3ffe64f1 ("hv_netvsc: split sub-channel setup into async and sync")
      Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      916c5e14
    • T
      ibmvnic: Revise RX/TX queue error messages · 2d14d379
      Thomas Falcon 提交于
      During a device failover, there may be latency between the loss
      of the current backing device and a notification from firmware that
      a failover has occurred. This latency can result in a large amount of
      error printouts as firmware returns outgoing traffic with a generic
      error code. These are not necessarily errors in this case as the
      firmware is busy swapping in a new backing adapter and is not ready
      to send packets yet. This patch reclassifies those error codes as
      warnings with an explanation that a failover may be pending. All
      other return codes will be considered errors.
      Signed-off-by: NThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d14d379
    • S
      ipv6: make DAD fail with enhanced DAD when nonce length differs · e6651599
      Sabrina Dubroca 提交于
      Commit adc176c5 ("ipv6 addrconf: Implemented enhanced DAD (RFC7527)")
      added enhanced DAD with a nonce length of 6 bytes. However, RFC7527
      doesn't specify the length of the nonce, other than being 6 + 8*k bytes,
      with integer k >= 0 (RFC3971 5.3.2). The current implementation simply
      assumes that the nonce will always be 6 bytes, but others systems are
      free to choose different sizes.
      
      If another system sends a nonce of different length but with the same 6
      bytes prefix, it shouldn't be considered as the same nonce. Thus, check
      that the length of the received nonce is the same as the length we sent.
      
      Ugly scapy test script running on veth0:
      
      def loop():
          pkt=sniff(iface="veth0", filter="icmp6", count=1)
          pkt = pkt[0]
          b = bytearray(pkt[Raw].load)
          b[1] += 1
          b += b'\xde\xad\xbe\xef\xde\xad\xbe\xef'
          pkt[Raw].load = bytes(b)
          pkt[IPv6].plen += 8
          # fixup checksum after modifying the payload
          pkt[IPv6].payload.cksum -= 0x3b44
          if pkt[IPv6].payload.cksum < 0:
              pkt[IPv6].payload.cksum += 0xffff
          sendp(pkt, iface="veth0")
      
      This should result in DAD failure for any address added to veth0's peer,
      but is currently ignored.
      
      Fixes: adc176c5 ("ipv6 addrconf: Implemented enhanced DAD (RFC7527)")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6651599
    • C
      net: ethernet: stmmac: fix documentation warning · 014dd768
      Corentin Labbe 提交于
      This patch remove the following documentation warning
      drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c:103: warning: Excess function parameter 'priv' description in 'stmmac_axi_setup'
      It was introduced in commit afea0365 ("stmmac: rework DMA bus setting and introduce new platform AXI structure")
      Signed-off-by: NCorentin Labbe <clabbe@baylibre.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      014dd768
    • C
      net: stmmac: dwmac-sun8i: fix typo descrive => describe · 56c266dc
      Corentin Labbe 提交于
      This patch fix a typo in the word Describe
      Signed-off-by: NCorentin Labbe <clabbe@baylibre.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56c266dc
    • P
      net: ip6_gre: get ipv6hdr after skb_cow_head() · b7ed8794
      Prashant Bhole 提交于
      A KASAN:use-after-free bug was found related to ip6-erspan
      while running selftests/net/ip6_gre_headroom.sh
      
      It happens because of following sequence:
      - ipv6hdr pointer is obtained from skb
      - skb_cow_head() is called, skb->head memory is reallocated
      - old data is accessed using ipv6hdr pointer
      
      skb_cow_head() call was added in e41c7c68 ("ip6erspan: make sure
      enough headroom at xmit."), but looking at the history there was a
      chance of similar bug because gre_handle_offloads() and pskb_trim()
      can also reallocate skb->head memory. Fixes tag points to commit
      which introduced possibility of this bug.
      
      This patch moves ipv6hdr pointer assignment after skb_cow_head() call.
      
      Fixes: 5a963eb6 ("ip6_gre: Add ERSPAN native tunnel support")
      Signed-off-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Reviewed-by: NGreg Rose <gvrose8192@gmail.com>
      Acked-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7ed8794
    • T
      tun: Fix use-after-free on XDP_TX · 6e8cfd6d
      Toshiaki Makita 提交于
      On XDP_TX we need to free up the frame only when tun_xdp_tx() returns a
      negative value. A positive value indicates that the packet is
      successfully enqueued to the ptr_ring, so freeing the page causes
      use-after-free.
      
      Fixes: 735fc405 ("xdp: change ndo_xdp_xmit API to support bulking")
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e8cfd6d
    • M
      bonding: Fix a typo in bonding.txt · 9f80a072
      Masanari Iida 提交于
      This patch fixes a spelling typo in bonding.txt
      Signed-off-by: NMasanari Iida <standby24x7@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f80a072
    • D
      tls: Stricter error checking in zerocopy sendmsg path · 32da1221
      Dave Watson 提交于
      In the zerocopy sendmsg() path, there are error checks to revert
      the zerocopy if we get any error code.  syzkaller has discovered
      that tls_push_record can return -ECONNRESET, which is fatal, and
      happens after the point at which it is safe to revert the iter,
      as we've already passed the memory to do_tcp_sendpages.
      
      Previously this code could return -ENOMEM and we would want to
      revert the iter, but AFAIK this no longer returns ENOMEM after
      a447da7d ("tls: fix waitall behavior in tls_sw_recvmsg"),
      so we fail for all error codes.
      
      Reported-by: syzbot+c226690f7b3126c5ee04@syzkaller.appspotmail.com
      Reported-by: syzbot+709f2810a6a05f11d4d3@syzkaller.appspotmail.com
      Signed-off-by: NDave Watson <davejwatson@fb.com>
      Fixes: 3c4d7559 ("tls: kernel TLS support")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32da1221
    • C
    • E
      KEYS: DNS: fix parsing multiple options · c604cb76
      Eric Biggers 提交于
      My recent fix for dns_resolver_preparse() printing very long strings was
      incomplete, as shown by syzbot which still managed to hit the
      WARN_ONCE() in set_precision() by adding a crafted "dns_resolver" key:
      
          precision 50001 too large
          WARNING: CPU: 7 PID: 864 at lib/vsprintf.c:2164 vsnprintf+0x48a/0x5a0
      
      The bug this time isn't just a printing bug, but also a logical error
      when multiple options ("#"-separated strings) are given in the key
      payload.  Specifically, when separating an option string into name and
      value, if there is no value then the name is incorrectly considered to
      end at the end of the key payload, rather than the end of the current
      option.  This bypasses validation of the option length, and also means
      that specifying multiple options is broken -- which presumably has gone
      unnoticed as there is currently only one valid option anyway.
      
      A similar problem also applied to option values, as the kstrtoul() when
      parsing the "dnserror" option will read past the end of the current
      option and into the next option.
      
      Fix these bugs by correctly computing the length of the option name and
      by copying the option value, null-terminated, into a temporary buffer.
      
      Reproducer for the WARN_ONCE() that syzbot hit:
      
          perl -e 'print "#A#", "\0" x 50000' | keyctl padd dns_resolver desc @s
      
      Reproducer for "dnserror" option being parsed incorrectly (expected
      behavior is to fail when seeing the unknown option "foo", actual
      behavior was to read the dnserror value as "1#foo" and fail there):
      
          perl -e 'print "#dnserror=1#foo\0"' | keyctl padd dns_resolver desc @s
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Fixes: 4a2d7892 ("DNS: If the DNS server returns an error, allow that to be cached [ver #2]")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c604cb76
    • D
      Merge branch 'multicast-init-as-INCLUDE-when-join-SSM-INCLUDE-group' · 8e05fd83
      David S. Miller 提交于
      Hangbin Liu says:
      
      ====================
      multicast: init as INCLUDE when join SSM INCLUDE group
      
      Based on RFC3376 5.1 and RFC3810 6.1, we should init as INCLUDE when join SSM
      INCLUDE group. In my first version I only clear the group change record. But
      this is not enough as when a new group join, it will init as EXCLUDE and
      trigger an filter mode change in ip/ip6_mc_add_src(), which will clear all
      source addresses' sf_crcount. This will prevent early joined address sending
      state change records if multi source addresses joined at the same time.
      
      In this v2 patchset, I fixed it by directly initializing the mode to INCLUDE
      for SSM JOIN_SOURCE_GROUP. I also split the original patch into two separated
      patches for IPv4 and IPv6.
      
      Test: test by myself and customer.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e05fd83
    • H
      ipv6/mcast: init as INCLUDE when join SSM INCLUDE group · c7ea20c9
      Hangbin Liu 提交于
      This an IPv6 version patch of "ipv4/igmp: init group mode as INCLUDE when
      join source group". From RFC3810, part 6.1:
      
         If no per-interface state existed for that
         multicast address before the change (i.e., the change consisted of
         creating a new per-interface record), or if no state exists after the
         change (i.e., the change consisted of deleting a per-interface
         record), then the "non-existent" state is considered to have an
         INCLUDE filter mode and an empty source list.
      
      Which means a new multicast group should start with state IN(). Currently,
      for MLDv2 SSM JOIN_SOURCE_GROUP mode, we first call ipv6_sock_mc_join(),
      then ip6_mc_source(), which will trigger a TO_IN() message instead of
      ALLOW().
      
      The issue was exposed by commit a052517a ("net/multicast: should not
      send source list records when have filter mode change"). Before this change,
      we sent both ALLOW(A) and TO_IN(A). Now, we only send TO_IN(A).
      
      Fix it by adding a new parameter to init group mode. Also add some wrapper
      functions to avoid changing too much code.
      
      v1 -> v2:
      In the first version I only cleared the group change record. But this is not
      enough. Because when a new group join, it will init as EXCLUDE and trigger
      a filter mode change in ip/ip6_mc_add_src(), which will clear all source
      addresses sf_crcount. This will prevent early joined address sending state
      change records if multi source addressed joined at the same time.
      
      In v2 patch, I fixed it by directly initializing the mode to INCLUDE for SSM
      JOIN_SOURCE_GROUP. I also split the original patch into two separated patches
      for IPv4 and IPv6.
      
      There is also a difference between v4 and v6 version. For IPv6, when the
      interface goes down and up, we will send correct state change record with
      unspecified IPv6 address (::) with function ipv6_mc_up(). But after DAD is
      completed, we resend the change record TO_IN() in mld_send_initial_cr().
      Fix it by sending ALLOW() for INCLUDE mode in mld_send_initial_cr().
      
      Fixes: a052517a ("net/multicast: should not send source list records when have filter mode change")
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7ea20c9
    • H
      ipv4/igmp: init group mode as INCLUDE when join source group · 6e2059b5
      Hangbin Liu 提交于
      Based on RFC3376 5.1
         If no interface
         state existed for that multicast address before the change (i.e., the
         change consisted of creating a new per-interface record), or if no
         state exists after the change (i.e., the change consisted of deleting
         a per-interface record), then the "non-existent" state is considered
         to have a filter mode of INCLUDE and an empty source list.
      
      Which means a new multicast group should start with state IN().
      
      Function ip_mc_join_group() works correctly for IGMP ASM(Any-Source Multicast)
      mode. It adds a group with state EX() and inits crcount to mc_qrv,
      so the kernel will send a TO_EX() report message after adding group.
      
      But for IGMPv3 SSM(Source-specific multicast) JOIN_SOURCE_GROUP mode, we
      split the group joining into two steps. First we join the group like ASM,
      i.e. via ip_mc_join_group(). So the state changes from IN() to EX().
      
      Then we add the source-specific address with INCLUDE mode. So the state
      changes from EX() to IN(A).
      
      Before the first step sends a group change record, we finished the second
      step. So we will only send the second change record. i.e. TO_IN(A).
      
      Regarding the RFC stands, we should actually send an ALLOW(A) message for
      SSM JOIN_SOURCE_GROUP as the state should mimic the 'IN() to IN(A)'
      transition.
      
      The issue was exposed by commit a052517a ("net/multicast: should not
      send source list records when have filter mode change"). Before this change,
      we used to send both ALLOW(A) and TO_IN(A). After this change we only send
      TO_IN(A).
      
      Fix it by adding a new parameter to init group mode. Also add new wrapper
      functions so we don't need to change too much code.
      
      v1 -> v2:
      In my first version I only cleared the group change record. But this is not
      enough. Because when a new group join, it will init as EXCLUDE and trigger
      an filter mode change in ip/ip6_mc_add_src(), which will clear all source
      addresses' sf_crcount. This will prevent early joined address sending state
      change records if multi source addressed joined at the same time.
      
      In v2 patch, I fixed it by directly initializing the mode to INCLUDE for SSM
      JOIN_SOURCE_GROUP. I also split the original patch into two separated patches
      for IPv4 and IPv6.
      
      Fixes: a052517a ("net/multicast: should not send source list records when have filter mode change")
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e2059b5
  2. 14 7月, 2018 7 次提交
    • D
      Merge branch 'fix-DCTCP-delayed-ACK' · 6bed5e26
      David S. Miller 提交于
      Yuchung Cheng says:
      
      ====================
      fix DCTCP delayed ACK
      
      This patch series addresses the issue that sometimes DCTCP
      fail to acknowledge the latest sequence and result in sender timeout
      if inflight is small.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bed5e26
    • Y
      tcp: remove DELAYED ACK events in DCTCP · a69258f7
      Yuchung Cheng 提交于
      After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK
      related callbacks are no longer needed
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a69258f7
    • Y
      tcp: fix dctcp delayed ACK schedule · b0c05d0e
      Yuchung Cheng 提交于
      Previously, when a data segment was sent an ACK was piggybacked
      on the data segment without generating a CA_EVENT_NON_DELAYED_ACK
      event to notify congestion control modules. So the DCTCP
      ca->delayed_ack_reserved flag could incorrectly stay set when
      in fact there were no delayed ACKs being reserved. This could result
      in sending a special ECN notification ACK that carries an older
      ACK sequence, when in fact there was no need for such an ACK.
      DCTCP keeps track of the delayed ACK status with its own separate
      state ca->delayed_ack_reserved. Previously it may accidentally cancel
      the delayed ACK without updating this field upon sending a special
      ACK that carries a older ACK sequence. This inconsistency would
      lead to DCTCP receiver never acknowledging the latest data until the
      sender times out and retry in some cases.
      
      Packetdrill script (provided by Larry Brakmo)
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      0.110 < [ect0] . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      
      0.200 < [ect0] . 1:1001(1000) ack 1 win 257
      0.200 > [ect01] . 1:1(0) ack 1001
      
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 1:2(1) ack 1001
      
      0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 2:3(1) ack 2001
      
      0.200 < [ect0] . 2001:3001(1000) ack 3 win 257
      0.200 < [ect0] . 3001:4001(1000) ack 3 win 257
      0.200 > [ect01] . 3:3(0) ack 4001
      
      0.210 < [ce] P. 4001:4501(500) ack 3 win 257
      
      +0.001 read(4, ..., 4500) = 4500
      +0 write(4, ..., 1) = 1
      +0 > [ect01] PE. 3:4(1) ack 4501
      
      +0.010 < [ect0] W. 4501:5501(1000) ack 4 win 257
      // Previously the ACK sequence below would be 4501, causing a long RTO
      +0.040~+0.045 > [ect01] . 4:4(0) ack 5501   // delayed ack
      
      +0.311 < [ect0] . 5501:6501(1000) ack 4 win 257  // More data
      +0 > [ect01] . 4:4(0) ack 6501     // now acks everything
      
      +0.500 < F. 9501:9501(0) ack 4 win 257
      Reported-by: NLarry Brakmo <brakmo@fb.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0c05d0e
    • D
      qlogic: check kstrtoul() for errors · 5fc853cc
      Dan Carpenter 提交于
      We accidentally left out the error handling for kstrtoul().
      
      Fixes: a520030e ("qlcnic: Implement flash sysfs callback for 83xx adapter")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5fc853cc
    • M
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · c849eb0d
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2018-07-13
      
      The following pull-request contains BPF updates for your *net* tree.
      
      The main changes are:
      
      1) Fix AF_XDP TX error reporting before final kernel release such that it
         becomes consistent between copy mode and zero-copy, from Magnus.
      
      2) Fix three different syzkaller reported issues: oob due to ld_abs
         rewrite with too large offset, another oob in l3 based skb test run
         and a bug leaving mangled prog in subprog JITing error path, from Daniel.
      
      3) Fix BTF handling for bitfield extraction on big endian, from Okash.
      
      4) Fix a missing linux/errno.h include in cgroup/BPF found by kbuild bot,
         from Roman.
      
      5) Fix xdp2skb_meta.sh sample by using just command names instead of
         absolute paths for tc and ip and allow them to be redefined, from Taeung.
      
      6) Fix availability probing for BPF seg6 helpers before final kernel ships
         so they can be detected at prog load time, from Mathieu.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c849eb0d
    • S
      skbuff: Unconditionally copy pfmemalloc in __skb_clone() · e78bfb07
      Stefano Brivio 提交于
      Commit 8b700862 ("net: Don't copy pfmemalloc flag in
      __copy_skb_header()") introduced a different handling for the
      pfmemalloc flag in copy and clone paths.
      
      In __skb_clone(), now, the flag is set only if it was set in the
      original skb, but not cleared if it wasn't. This is wrong and
      might lead to socket buffers being flagged with pfmemalloc even
      if the skb data wasn't allocated from pfmemalloc reserves. Copy
      the flag instead of ORing it.
      Reported-by: NSabrina Dubroca <sd@queasysnail.net>
      Fixes: 8b700862 ("net: Don't copy pfmemalloc flag in __copy_skb_header()")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Tested-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e78bfb07
  3. 13 7月, 2018 9 次提交
    • D
      Merge branch 'bpf-af-xdp-consistent-err-reporting' · 5e3e6e83
      Daniel Borkmann 提交于
      Magnus Karlsson says:
      
      ====================
      This patch set adjusts the AF_XDP TX error reporting so that it becomes
      consistent between copy mode and zero-copy. First some background:
      
      Copy-mode for TX uses the SKB path in which the action of sending the
      packet is performed from process context using the sendmsg
      syscall. Completions are usually done asynchronously from NAPI mode by
      using a TX interrupt. In this mode, send errors can be returned back
      through the syscall.
      
      In zero-copy mode both the sending of the packet and the completions
      are done asynchronously from NAPI mode for performance reasons. In
      this mode, the sendmsg syscall only makes sure that the TX NAPI loop
      will be run that performs both the actions of sending and
      completing. In this mode it is therefore not possible to return errors
      through the sendmsg syscall as the sending is done from the NAPI
      loop. Note that it is possible to implement a synchronous send with
      our API, but in our benchmarks that made the TX performance drop by
      nearly half due to synchronization requirements and cache line
      bouncing. But for some netdevs this might be preferable so let us
      leave it up to the implementation to decide.
      
      The problem is that the current code base returns some errors in
      copy-mode that are not possible to return in zero-copy mode. This
      patch set aligns them so that the two modes always return the same
      error code. We achieve this by removing some of the errors returned by
      sendmsg in copy-mode (and in one case adding an error message for
      zero-copy mode) and offering alternative error detection methods that
      are consistent between the two modes.
      
      The structure of the patch set is as follows:
      
      Patch 1: removes the ENXIO return code from copy-mode when someone has
      forcefully changed the number of queues on the device so that the
      queue bound to the socket is no longer available. Just silently stop
      sending anything as in zero-copy mode.
      
      Patch 2: stop returning EAGAIN in copy mode when the completion queue
      is full as zero-copy does not do this. Instead this situation can be
      detected by comparing the head and tail pointers of the completion
      queue in both modes. In any case, EAGAIN was not the correct error code
      here since no amount of calling sendmsg will solve the problem. Only
      consuming one or more messages on the completion queue will fix this.
      
      Patch 3: Always return ENOBUFS from sendmsg if there is no TX queue
      configured. This was not the case for zero-copy mode.
      
      Patch 4: stop returning EMSGSIZE when the size of the packet is larger
      than the MTU. Just send it to the device so that it will drop it as in
      zero-copy mode.
      
      Note that copy-mode can still return EAGAIN in certain circumstances,
      but as these conditions cannot occur in zero-copy mode it is fine for
      copy-mode to return them.
      ====================
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5e3e6e83
    • M
      xsk: do not return EMSGSIZE in copy mode for packets larger than MTU · 09210c4b
      Magnus Karlsson 提交于
      This patch stops returning EMSGSIZE from sendmsg in copy mode when the
      size of the packet is larger than the MTU. Just send it to the device
      so that it will drop it as in zero-copy mode. This makes the error
      reporting consistent between copy mode and zero-copy mode.
      
      Fixes: 35fcde7f ("xsk: support for Tx")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      09210c4b
    • M
      xsk: always return ENOBUFS from sendmsg if there is no TX queue · 6efb4436
      Magnus Karlsson 提交于
      This patch makes sure ENOBUFS is always returned from sendmsg if there
      is no TX queue configured. This was not the case for zero-copy
      mode. With this patch this error reporting is consistent between copy
      mode and zero-copy mode.
      
      Fixes: ac98d8aa ("xsk: wire upp Tx zero-copy functions")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6efb4436
    • M
      xsk: do not return EAGAIN from sendmsg when completion queue is full · 9684f5e7
      Magnus Karlsson 提交于
      This patch stops returning EAGAIN in TX copy mode when the completion
      queue is full as zero-copy does not do this. Instead this situation
      can be detected by comparing the head and tail pointers of the
      completion queue in both modes. In any case, EAGAIN was not the
      correct error code here since no amount of calling sendmsg will solve
      the problem. Only consuming one or more messages on the completion
      queue will fix this.
      
      With this patch, the error reporting becomes consistent between copy
      mode and zero-copy mode.
      
      Fixes: 35fcde7f ("xsk: support for Tx")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9684f5e7
    • M
      xsk: do not return ENXIO from TX copy mode · 509d7648
      Magnus Karlsson 提交于
      This patch removes the ENXIO return code from TX copy-mode when
      someone has forcefully changed the number of queues on the device so
      that the queue bound to the socket is no longer available. Just
      silently stop sending anything as in zero-copy mode so the error
      reporting gets consistent between the two modes.
      
      Fixes: 35fcde7f ("xsk: support for Tx")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      509d7648
    • W
      selftests: in udpgso_bench do not test udp zerocopy · 8f19f12b
      Willem de Bruijn 提交于
      The udpgso benchmark compares various configurations of UDP and TCP.
      Including one that is not upstream, udp zerocopy. This is a leftover
      from the earlier RFC patchset.
      
      The test is part of kselftests and run in continuous spinners. Remove
      the failing case to make the test start passing.
      
      Fixes: 3a687bef ("selftests: udp gso benchmark")
      Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f19f12b
    • W
      packet: reset network header if packet shorter than ll reserved space · 993675a3
      Willem de Bruijn 提交于
      If variable length link layer headers result in a packet shorter
      than dev->hard_header_len, reset the network header offset. Else
      skb->mac_len may exceed skb->len after skb_mac_reset_len.
      
      packet_sendmsg_spkt already has similar logic.
      
      Fixes: b84bbaf7 ("packet: in packet_snd start writing at link layer allocation")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      993675a3
    • W
      nsh: set mac len based on inner packet · bab2c80e
      Willem de Bruijn 提交于
      When pulling the NSH header in nsh_gso_segment, set the mac length
      based on the encapsulated packet type.
      
      skb_reset_mac_len computes an offset to the network header, which
      here still points to the outer packet:
      
        >     skb_reset_network_header(skb);
        >     [...]
        >     __skb_pull(skb, nsh_len);
        >     skb_reset_mac_header(skb);    // now mac hdr starts nsh_len == 8B after net hdr
        >     skb_reset_mac_len(skb);       // mac len = net hdr - mac hdr == (u16) -8 == 65528
        >     [..]
        >     skb_mac_gso_segment(skb, ..)
      
      Link: http://lkml.kernel.org/r/CAF=yD-KeAcTSOn4AxirAxL8m7QAS8GBBe1w09eziYwvPbbUeYA@mail.gmail.com
      Reported-by: syzbot+7b9ed9872dab8c32305d@syzkaller.appspotmail.com
      Fixes: c411ed85 ("nsh: add GSO support")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bab2c80e
    • S
      net: Don't copy pfmemalloc flag in __copy_skb_header() · 8b700862
      Stefano Brivio 提交于
      The pfmemalloc flag indicates that the skb was allocated from
      the PFMEMALLOC reserves, and the flag is currently copied on skb
      copy and clone.
      
      However, an skb copied from an skb flagged with pfmemalloc
      wasn't necessarily allocated from PFMEMALLOC reserves, and on
      the other hand an skb allocated that way might be copied from an
      skb that wasn't.
      
      So we should not copy the flag on skb copy, and rather decide
      whether to allow an skb to be associated with sockets unrelated
      to page reclaim depending only on how it was allocated.
      
      Move the pfmemalloc flag before headers_start[0] using an
      existing 1-bit hole, so that __copy_skb_header() doesn't copy
      it.
      
      When cloning, we'll now take care of this flag explicitly,
      contravening to the warning comment of __skb_clone().
      
      While at it, restore the newline usage introduced by commit
      b1937227 ("net: reorganize sk_buff for faster
      __copy_skb_header()") to visually separate bytes used in
      bitfields after headers_start[0], that was gone after commit
      a9e419dc ("netfilter: merge ctinfo into nfct pointer storage
      area"), and describe the pfmemalloc flag in the kernel-doc
      structure comment.
      
      This doesn't change the size of sk_buff or cacheline boundaries,
      but consolidates the 15 bits hole before tc_index into a 2 bytes
      hole before csum, that could now be filled more easily.
      Reported-by: NPatrick Talbert <ptalbert@redhat.com>
      Fixes: c93bdd0e ("netvm: allow skb allocation to use PFMEMALLOC reserves")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b700862