1. 13 7月, 2018 5 次提交
    • F
      net: systemport: Fix CRC forwarding check for SYSTEMPORT Lite · 9e3bff92
      Florian Fainelli 提交于
      SYSTEMPORT Lite reversed the logic compared to SYSTEMPORT, the
      GIB_FCS_STRIP bit is set when the Ethernet FCS is stripped, and that bit
      is not set by default. Fix the logic such that we properly check whether
      that bit is set or not and we don't forward an extra 4 bytes to the
      network stack.
      
      Fixes: 44a4524c ("net: systemport: Add support for SYSTEMPORT Lite")
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e3bff92
    • S
      tcp: allow user to create repair socket without window probes · 70b7ff13
      Stefan Baranoff 提交于
      Under rare conditions where repair code may be used it is possible that
      window probes are either unnecessary or undesired. If the user knows that
      window probes are not wanted or needed this change allows them to skip
      sending them when a socket comes out of repair.
      Signed-off-by: NStefan Baranoff <sbaranoff@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70b7ff13
    • S
      tcp: fix sequence numbers for repaired sockets re-using TIME-WAIT sockets · 21684dc4
      Stefan Baranoff 提交于
      This patch fixes a bug where the sequence numbers of a socket created using
      TCP repair functionality are lower than set after connect is called.
      This occurs when the repair socket overlaps with a TIME-WAIT socket and
      triggers the re-use code. The amount lower is equal to the number of times
      that a particular IP/port set is re-used and then put back into TIME-WAIT.
      Re-using the first time the sequence number is 1 lower, closing that socket
      and then re-opening (with repair) a new socket with the same addresses/ports
      puts the sequence number 2 lower than set via setsockopt. The third time is
      3 lower, etc. I have not tested what the limit of this acrewal is, if any.
      
      The fix is, if a socket is in repair mode, to respect the already set
      sequence number and timestamp when it would have already re-used the
      TIME-WAIT socket.
      Signed-off-by: NStefan Baranoff <sbaranoff@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21684dc4
    • J
      sch_fq_codel: zero q->flows_cnt when fq_codel_init fails · 83fe6b87
      Jacob Keller 提交于
      When fq_codel_init fails, qdisc_create_dflt will cleanup by using
      qdisc_destroy. This function calls the ->reset() op prior to calling the
      ->destroy() op.
      
      Unfortunately, during the failure flow for sch_fq_codel, the ->flows
      parameter is not initialized, so the fq_codel_reset function will null
      pointer dereference.
      
         kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
         kernel: IP: fq_codel_reset+0x58/0xd0 [sch_fq_codel]
         kernel: PGD 0 P4D 0
         kernel: Oops: 0000 [#1] SMP PTI
         kernel: Modules linked in: i40iw i40e(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc devlink ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert iscsi_target_mod sunrpc ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate iTCO_wdt iTCO_vendor_support intel_uncore ib_core intel_rapl_perf mei_me mei joydev i2c_i801 lpc_ich ioatdma shpchp wmi sch_fq_codel xfs libcrc32c mgag200 ixgbe drm_kms_helper isci ttm firewire_ohci
         kernel:  mdio drm igb libsas crc32c_intel firewire_core ptp pps_core scsi_transport_sas crc_itu_t dca i2c_algo_bit ipmi_si ipmi_devintf ipmi_msghandler [last unloaded: i40e]
         kernel: CPU: 10 PID: 4219 Comm: ip Tainted: G           OE    4.16.13custom-fq-codel-test+ #3
         kernel: Hardware name: Intel Corporation S2600CO/S2600CO, BIOS SE5C600.86B.02.05.0004.051120151007 05/11/2015
         kernel: RIP: 0010:fq_codel_reset+0x58/0xd0 [sch_fq_codel]
         kernel: RSP: 0018:ffffbfbf4c1fb620 EFLAGS: 00010246
         kernel: RAX: 0000000000000400 RBX: 0000000000000000 RCX: 00000000000005b9
         kernel: RDX: 0000000000000000 RSI: ffff9d03264a60c0 RDI: ffff9cfd17b31c00
         kernel: RBP: 0000000000000001 R08: 00000000000260c0 R09: ffffffffb679c3e9
         kernel: R10: fffff1dab06a0e80 R11: ffff9cfd163af800 R12: ffff9cfd17b31c00
         kernel: R13: 0000000000000001 R14: ffff9cfd153de600 R15: 0000000000000001
         kernel: FS:  00007fdec2f92800(0000) GS:ffff9d0326480000(0000) knlGS:0000000000000000
         kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         kernel: CR2: 0000000000000008 CR3: 0000000c1956a006 CR4: 00000000000606e0
         kernel: Call Trace:
         kernel:  qdisc_destroy+0x56/0x140
         kernel:  qdisc_create_dflt+0x8b/0xb0
         kernel:  mq_init+0xc1/0xf0
         kernel:  qdisc_create_dflt+0x5a/0xb0
         kernel:  dev_activate+0x205/0x230
         kernel:  __dev_open+0xf5/0x160
         kernel:  __dev_change_flags+0x1a3/0x210
         kernel:  dev_change_flags+0x21/0x60
         kernel:  do_setlink+0x660/0xdf0
         kernel:  ? down_trylock+0x25/0x30
         kernel:  ? xfs_buf_trylock+0x1a/0xd0 [xfs]
         kernel:  ? rtnl_newlink+0x816/0x990
         kernel:  ? _xfs_buf_find+0x327/0x580 [xfs]
         kernel:  ? _cond_resched+0x15/0x30
         kernel:  ? kmem_cache_alloc+0x20/0x1b0
         kernel:  ? rtnetlink_rcv_msg+0x200/0x2f0
         kernel:  ? rtnl_calcit.isra.30+0x100/0x100
         kernel:  ? netlink_rcv_skb+0x4c/0x120
         kernel:  ? netlink_unicast+0x19e/0x260
         kernel:  ? netlink_sendmsg+0x1ff/0x3c0
         kernel:  ? sock_sendmsg+0x36/0x40
         kernel:  ? ___sys_sendmsg+0x295/0x2f0
         kernel:  ? ebitmap_cmp+0x6d/0x90
         kernel:  ? dev_get_by_name_rcu+0x73/0x90
         kernel:  ? skb_dequeue+0x52/0x60
         kernel:  ? __inode_wait_for_writeback+0x7f/0xf0
         kernel:  ? bit_waitqueue+0x30/0x30
         kernel:  ? fsnotify_grab_connector+0x3c/0x60
         kernel:  ? __sys_sendmsg+0x51/0x90
         kernel:  ? do_syscall_64+0x74/0x180
         kernel:  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
         kernel: Code: 00 00 48 89 87 00 02 00 00 8b 87 a0 01 00 00 85 c0 0f 84 84 00 00 00 31 ed 48 63 dd 83 c5 01 48 c1 e3 06 49 03 9c 24 90 01 00 00 <48> 8b 73 08 48 8b 3b e8 6c 9a 4f f6 48 8d 43 10 48 c7 03 00 00
         kernel: RIP: fq_codel_reset+0x58/0xd0 [sch_fq_codel] RSP: ffffbfbf4c1fb620
         kernel: CR2: 0000000000000008
         kernel: ---[ end trace e81a62bede66274e ]---
      
      This is caused because flows_cnt is non-zero, but flows hasn't been
      initialized. fq_codel_init has left the private data in a partially
      initialized state.
      
      To fix this, reset flows_cnt to 0 when we fail to initialize.
      Additionally, to make the state more consistent, also cleanup the flows
      pointer when the allocation of backlogs fails.
      
      This fixes the NULL pointer dereference, since both the for-loop and
      memset in fq_codel_reset will be no-ops when flow_cnt is zero.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83fe6b87
    • D
      Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue · 35288486
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2018-07-12
      
      This series contains updates to ixgbe and e100/e1000 kernel documentation.
      
      Alex fixes ixgbe to ensure that we are more explicit about the ordering
      of updates to the receive address register (RAR) table.
      
      Dan Carpenter fixes an issue where we were reading one element beyond
      the end of the array.
      
      Mauro Carvalho Chehab fixes formatting issues in the e100.rst and
      e1000.rst that were causing errors during 'make htmldocs'.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35288486
  2. 12 7月, 2018 8 次提交
  3. 10 7月, 2018 11 次提交
    • T
      rhashtable: add restart routine in rhashtable_free_and_destroy() · 0026129c
      Taehee Yoo 提交于
      rhashtable_free_and_destroy() cancels re-hash deferred work
      then walks and destroys elements. at this moment, some elements can be
      still in future_tbl. that elements are not destroyed.
      
      test case:
      nft_rhash_destroy() calls rhashtable_free_and_destroy() to destroy
      all elements of sets before destroying sets and chains.
      But rhashtable_free_and_destroy() doesn't destroy elements of future_tbl.
      so that splat occurred.
      
      test script:
         %cat test.nft
         table ip aa {
      	   map map1 {
      		   type ipv4_addr : verdict;
      		   elements = {
      			   0 : jump a0,
      			   1 : jump a0,
      			   2 : jump a0,
      			   3 : jump a0,
      			   4 : jump a0,
      			   5 : jump a0,
      			   6 : jump a0,
      			   7 : jump a0,
      			   8 : jump a0,
      			   9 : jump a0,
      		}
      	   }
      	   chain a0 {
      	   }
         }
         flush ruleset
         table ip aa {
      	   map map1 {
      		   type ipv4_addr : verdict;
      		   elements = {
      			   0 : jump a0,
      			   1 : jump a0,
      			   2 : jump a0,
      			   3 : jump a0,
      			   4 : jump a0,
      			   5 : jump a0,
      			   6 : jump a0,
      			   7 : jump a0,
      			   8 : jump a0,
      			   9 : jump a0,
      		   }
      	   }
      	   chain a0 {
      	   }
         }
         flush ruleset
      
         %while :; do nft -f test.nft; done
      
      Splat looks like:
      [  200.795603] kernel BUG at net/netfilter/nf_tables_api.c:1363!
      [  200.806944] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      [  200.812253] CPU: 1 PID: 1582 Comm: nft Not tainted 4.17.0+ #24
      [  200.820297] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
      [  200.830309] RIP: 0010:nf_tables_chain_destroy.isra.34+0x62/0x240 [nf_tables]
      [  200.838317] Code: 43 50 85 c0 74 26 48 8b 45 00 48 8b 4d 08 ba 54 05 00 00 48 c7 c6 60 6d 29 c0 48 c7 c7 c0 65 29 c0 4c 8b 40 08 e8 58 e5 fd f8 <0f> 0b 48 89 da 48 b8 00 00 00 00 00 fc ff
      [  200.860366] RSP: 0000:ffff880118dbf4d0 EFLAGS: 00010282
      [  200.866354] RAX: 0000000000000061 RBX: ffff88010cdeaf08 RCX: 0000000000000000
      [  200.874355] RDX: 0000000000000061 RSI: 0000000000000008 RDI: ffffed00231b7e90
      [  200.882361] RBP: ffff880118dbf4e8 R08: ffffed002373bcfb R09: ffffed002373bcfa
      [  200.890354] R10: 0000000000000000 R11: ffffed002373bcfb R12: dead000000000200
      [  200.898356] R13: dead000000000100 R14: ffffffffbb62af38 R15: dffffc0000000000
      [  200.906354] FS:  00007fefc31fd700(0000) GS:ffff88011b800000(0000) knlGS:0000000000000000
      [  200.915533] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  200.922355] CR2: 0000557f1c8e9128 CR3: 0000000106880000 CR4: 00000000001006e0
      [  200.930353] Call Trace:
      [  200.932351]  ? nf_tables_commit+0x26f6/0x2c60 [nf_tables]
      [  200.939525]  ? nf_tables_setelem_notify.constprop.49+0x1a0/0x1a0 [nf_tables]
      [  200.947525]  ? nf_tables_delchain+0x6e0/0x6e0 [nf_tables]
      [  200.952383]  ? nft_add_set_elem+0x1700/0x1700 [nf_tables]
      [  200.959532]  ? nla_parse+0xab/0x230
      [  200.963529]  ? nfnetlink_rcv_batch+0xd06/0x10d0 [nfnetlink]
      [  200.968384]  ? nfnetlink_net_init+0x130/0x130 [nfnetlink]
      [  200.975525]  ? debug_show_all_locks+0x290/0x290
      [  200.980363]  ? debug_show_all_locks+0x290/0x290
      [  200.986356]  ? sched_clock_cpu+0x132/0x170
      [  200.990352]  ? find_held_lock+0x39/0x1b0
      [  200.994355]  ? sched_clock_local+0x10d/0x130
      [  200.999531]  ? memset+0x1f/0x40
      
      V2:
       - free all tables requested by Herbert Xu
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0026129c
    • D
      Merge branch 'bnxt_en-Bug-fixes' · 252dd176
      David S. Miller 提交于
      Michael Chan says:
      
      ====================
      bnxt_en: Bug fixes.
      
      These are bug fixes in error code paths, TC Flower VLAN TCI flow
      checking bug fix, proper filtering of Broadcast packets if IFF_BROADCAST
      is not set, and a bug fix in bnxt_get_max_rings() to return 0 ring
      parameters when the return value is -ENOMEM.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      252dd176
    • V
      bnxt_en: Fix for system hang if request_irq fails · c58387ab
      Vikas Gupta 提交于
      Fix bug in the error code path when bnxt_request_irq() returns failure.
      bnxt_disable_napi() should not be called in this error path because
      NAPI has not been enabled yet.
      
      Fixes: c0c050c5 ("bnxt_en: New Broadcom ethernet driver.")
      Signed-off-by: NVikas Gupta <vikas.gupta@broadcom.com>
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c58387ab
    • M
      bnxt_en: Do not modify max IRQ count after RDMA driver requests/frees IRQs. · 30f52947
      Michael Chan 提交于
      Calling bnxt_set_max_func_irqs() to modify the max IRQ count requested or
      freed by the RDMA driver is flawed.  The max IRQ count is checked when
      re-initializing the IRQ vectors and this can happen multiple times
      during ifup or ethtool -L.  If the max IRQ is reduced and the RDMA
      driver is operational, we may not initailize IRQs correctly.  This
      problem shows up on VFs with very small number of MSIX.
      
      There is no other logic that relies on the IRQ count excluding the ones
      used by RDMA.  So we fix it by just removing the call to subtract or
      add the IRQs used by RDMA.
      
      Fixes: a588e458 ("bnxt_en: Add interface to support RDMA driver.")
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30f52947
    • M
      bnxt_en: Support clearing of the IFF_BROADCAST flag. · 30e33848
      Michael Chan 提交于
      Currently, the driver assumes IFF_BROADCAST is always set and always sets
      the broadcast filter.  Modify the code to set or clear the broadcast
      filter according to the IFF_BROADCAST flag.
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30e33848
    • M
      bnxt_en: Always set output parameters in bnxt_get_max_rings(). · 78f058a4
      Michael Chan 提交于
      The current code returns -ENOMEM and does not bother to set the output
      parameters to 0 when no rings are available.  Some callers, such as
      bnxt_get_channels() will display garbage ring numbers when that happens.
      Fix it by always setting the output parameters.
      
      Fixes: 6e6c5a57 ("bnxt_en: Modify bnxt_get_max_rings() to support shared or non shared rings.")
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      78f058a4
    • M
      bnxt_en: Fix inconsistent BNXT_FLAG_AGG_RINGS logic. · 07f4fde5
      Michael Chan 提交于
      If there aren't enough RX rings available, the driver will attempt to
      use a single RX ring without the aggregation ring.  If that also
      fails, the BNXT_FLAG_AGG_RINGS flag is cleared but the other ring
      parameters are not set consistently to reflect that.  If more RX
      rings become available at the next open, the RX rings will be in
      an inconsistent state and may crash when freeing the RX rings.
      
      Fix it by restoring the BNXT_FLAG_AGG_RINGS if not enough RX rings are
      available to run without aggregation rings.
      
      Fixes: bdbd1eb5 ("bnxt_en: Handle no aggregation ring gracefully.")
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07f4fde5
    • V
      bnxt_en: Fix the vlan_tci exact match check. · e32d4e60
      Venkat Duvvuru 提交于
      It is possible that OVS may set don’t care for DEI/CFI bit in
      vlan_tci mask. Hence, checking for vlan_tci exact match will endup
      in a vlan flow rejection.
      
      This patch fixes the problem by checking for vlan_pcp and vid
      separately, instead of checking for the entire vlan_tci.
      
      Fixes: e85a9be9 (bnxt_en: do not allow wildcard matches for L2 flows)
      Signed-off-by: NVenkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e32d4e60
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · 26420d9c
      David S. Miller 提交于
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for your net tree:
      
      1) Missing module autoloadfor icmp and icmpv6 x_tables matches,
         from Florian Westphal.
      
      2) Possible non-linear access to TCP header from tproxy, from
         Mate Eckl.
      
      3) Do not allow rbtree to be used for single elements, this patch
         moves all set backend into one single module since such thing
         can only happen if hashtable module is explicitly blacklisted,
         which should not ever be done.
      
      4) Reject error and standard targets from nft_compat for sanity
         reasons, they are never used from there.
      
      5) Don't crash on double hashsize module parameter, from Andrey
         Ryabinin.
      
      6) Drop dst on skb before placing it in the fragmentation
         reassembly queue, from Florian Westphal.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26420d9c
    • F
      netfilter: ipv6: nf_defrag: drop skb dst before queueing · 84379c9a
      Florian Westphal 提交于
      Eric Dumazet reports:
       Here is a reproducer of an annoying bug detected by syzkaller on our production kernel
       [..]
       ./b78305423 enable_conntrack
       Then :
       sleep 60
       dmesg | tail -10
       [  171.599093] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  181.631024] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  191.687076] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  201.703037] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  211.711072] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  221.959070] unregister_netdevice: waiting for lo to become free. Usage count = 2
      
      Reproducer sends ipv6 fragment that hits nfct defrag via LOCAL_OUT hook.
      skb gets queued until frag timer expiry -- 1 minute.
      
      Normally nf_conntrack_reasm gets called during prerouting, so skb has
      no dst yet which might explain why this wasn't spotted earlier.
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Reported-by: NJohn Sperbeck <jsperbeck@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Tested-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      84379c9a
    • A
      netfilter: nf_conntrack: Fix possible possible crash on module loading. · 2045cdfa
      Andrey Ryabinin 提交于
      Loading the nf_conntrack module with doubled hashsize parameter, i.e.
      	  modprobe nf_conntrack hashsize=12345 hashsize=12345
      causes NULL-ptr deref.
      
      If 'hashsize' specified twice, the nf_conntrack_set_hashsize() function
      will be called also twice.
      The first nf_conntrack_set_hashsize() call will set the
      'nf_conntrack_htable_size' variable:
      
      	nf_conntrack_set_hashsize()
      		...
      		/* On boot, we can set this without any fancy locking. */
      		if (!nf_conntrack_htable_size)
      			return param_set_uint(val, kp);
      
      But on the second invocation, the nf_conntrack_htable_size is already set,
      so the nf_conntrack_set_hashsize() will take a different path and call
      the nf_conntrack_hash_resize() function. Which will crash on the attempt
      to dereference 'nf_conntrack_hash' pointer:
      
      	BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      	RIP: 0010:nf_conntrack_hash_resize+0x255/0x490 [nf_conntrack]
      	Call Trace:
      	 nf_conntrack_set_hashsize+0xcd/0x100 [nf_conntrack]
      	 parse_args+0x1f9/0x5a0
      	 load_module+0x1281/0x1a50
      	 __se_sys_finit_module+0xbe/0xf0
      	 do_syscall_64+0x7c/0x390
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fix this, by checking !nf_conntrack_hash instead of
      !nf_conntrack_htable_size. nf_conntrack_hash will be initialized only
      after the module loaded, so the second invocation of the
      nf_conntrack_set_hashsize() won't crash, it will just reinitialize
      nf_conntrack_htable_size again.
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2045cdfa
  4. 09 7月, 2018 6 次提交
  5. 08 7月, 2018 10 次提交
    • E
      tcp: cleanup copied_seq and urg_data in tcp_disconnect · 6508b678
      Eric Dumazet 提交于
      tcp_zerocopy_receive() relies on tcp_inq() to limit number of bytes
      requested by user.
      
      syzbot found that after tcp_disconnect(), tcp_inq() was returning
      a stale value (number of bytes in queue before the disconnect).
      
      Note that after this patch, ioctl(fd, SIOCINQ, &val) is also fixed
      and returns 0, so this might be a candidate for all known linux kernels.
      
      While we are at this, we probably also should clear urg_data to
      avoid other syzkaller reports after it discovers how to deal with
      urgent data.
      
      syzkaller repro :
      
      socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
      bind(3, {sa_family=AF_INET, sin_port=htons(20000), sin_addr=inet_addr("224.0.0.1")}, 16) = 0
      connect(3, {sa_family=AF_INET, sin_port=htons(20000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
      send(3, ..., 4096, 0) = 4096
      connect(3, {sa_family=AF_UNSPEC, sa_data="\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 128) = 0
      getsockopt(3, SOL_TCP, TCP_ZEROCOPY_RECEIVE, ..., [16]) = 0 // CRASH
      
      Fixes: 05255b82 ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6508b678
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 7f93d129
      David S. Miller 提交于
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf 2018-07-07
      
      The following pull-request contains BPF updates for your *net* tree.
      
      Plenty of fixes for different components:
      
      1) A set of critical fixes for sockmap and sockhash, from John Fastabend.
      
      2) fixes for several race conditions in af_xdp, from Magnus Karlsson.
      
      3) hash map refcnt fix, from Mauricio Vasquez.
      
      4) samples/bpf fixes, from Taeung Song.
      
      5) ifup+mtu check for xdp_redirect, from Toshiaki Makita.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f93d129
    • P
      ipfrag: really prevent allocation on netns exit · f6f2a4a2
      Paolo Abeni 提交于
      Setting the low threshold to 0 has no effect on frags allocation,
      we need to clear high_thresh instead.
      
      The code was pre-existent to commit 648700f7 ("inet: frags:
      use rhashtables for reassembly units"), but before the above,
      such assignment had a different role: prevent concurrent eviction
      from the worker and the netns cleanup helper.
      
      Fixes: 648700f7 ("inet: frags: use rhashtables for reassembly units")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6f2a4a2
    • L
      net: diag: Don't double-free TCP_NEW_SYN_RECV sockets in tcp_abort · acc2cf4e
      Lorenzo Colitti 提交于
      When tcp_diag_destroy closes a TCP_NEW_SYN_RECV socket, it first
      frees it by calling inet_csk_reqsk_queue_drop_and_and_put in
      tcp_abort, and then frees it again by calling sock_gen_put.
      
      Since tcp_abort only has one caller, and all the other codepaths
      in tcp_abort don't free the socket, just remove the free in that
      function.
      
      Cc: David Ahern <dsa@cumulusnetworks.com>
      Tested: passes Android sock_diag_test.py, which exercises this codepath
      Fixes: d7226c7a ("net: diag: Fix refcnt leak in error path destroying socket")
      Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acc2cf4e
    • D
      net/ipv4: Set oif in fib_compute_spec_dst · e7372197
      David Ahern 提交于
      Xin reported that icmp replies may not use the address on the device the
      echo request is received if the destination address is broadcast. Instead
      a route lookup is done without considering VRF context. Fix by setting
      oif in flow struct to the master device if it is enslaved. That directs
      the lookup to the VRF table. If the device is not enslaved, oif is still
      0 so no affect.
      
      Fixes: cd2fbe1b ("net: Use VRF device index for lookups on RX")
      Reported-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7372197
    • T
      xdp: XDP_REDIRECT should check IFF_UP and MTU · d8d7218a
      Toshiaki Makita 提交于
      Otherwise we end up with attempting to send packets from down devices
      or to send oversized packets, which may cause unexpected driver/device
      behaviour. Generic XDP has already done this check, so reuse the logic
      in native XDP.
      
      Fixes: 814abfab ("xdp: add bpf_redirect helper function")
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d8d7218a
    • A
      Merge branch 'sockhash-fixes' · 4fb126cb
      Alexei Starovoitov 提交于
      John Fastabend says:
      
      ====================
      First three patches resolve issues found while testing sockhash and
      reviewing code. Syzbot also found them about the same time as I was
      working on fixes. The main issue is in the sockhash path we reduced
      the scope of sk_callback lock but this meant we could get update and
      close running in parallel so fix that here.
      
      Then testing sk_msg and sk_skb programs together found that skb->dev
      is not always assigned and some of the helpers were depending on this
      to lookup max mtu. Fix this by using SKB_MAX_ALLOC when no MTU is
      available.
      
      Finally, Martin spotted that the sockmap code was still using the
      qdisc skb cb structure. But I was sure we had fixed this long ago.
      Looks like we missed it in a merge conflict resolution and then by
      chance data_end offset was the same in both structures so everything
      sort of continued to work even though it could break at any moment
      if the structs ever change. So redo the conversion and this time
      also convert the helpers.
      
      v2: fix '0 files changed' issue in patches
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      4fb126cb
    • J
      bpf: sockmap, convert bpf_compute_data_pointers to bpf_*_sk_skb · 0ea488ff
      John Fastabend 提交于
      In commit
      
        'bpf: bpf_compute_data uses incorrect cb structure' (8108a775)
      
      we added the routine bpf_compute_data_end_sk_skb() to compute the
      correct data_end values, but this has since been lost. In kernel
      v4.14 this was correct and the above patch was applied in it
      entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
      we lost the piece that renamed bpf_compute_data_pointers to the
      new function bpf_compute_data_end_sk_skb. This was done here,
      
      e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      
      When it conflicted with the following rename patch,
      
      6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      
      Finally, after a refactor I thought even the function
      bpf_compute_data_end_sk_skb() was no longer needed and it was
      erroneously removed.
      
      However, we never reverted the sk_skb_convert_ctx_access() usage of
      tcp_skb_cb which had been committed and survived the merge conflict.
      Here we fix this by adding back the helper and *_data_end_sk_skb()
      usage. Using the bpf_skc_data_end mapping is not correct because it
      expects a qdisc_skb_cb object but at the sock layer this is not the
      case. Even though it happens to work here because we don't overwrite
      any data in-use at the socket layer and the cb structure is cleared
      later this has potential to create some subtle issues. But, even
      more concretely the filter.c access check uses tcp_skb_cb.
      
      And by some act of chance though,
      
      struct bpf_skb_data_end {
              struct qdisc_skb_cb        qdisc_cb;             /*     0    28 */
      
              /* XXX 4 bytes hole, try to pack */
      
              void *                     data_meta;            /*    32     8 */
              void *                     data_end;             /*    40     8 */
      
              /* size: 48, cachelines: 1, members: 3 */
              /* sum members: 44, holes: 1, sum holes: 4 */
              /* last cacheline: 48 bytes */
      };
      
      and then tcp_skb_cb,
      
      struct tcp_skb_cb {
      	[...]
                      struct {
                              __u32      flags;                /*    24     4 */
                              struct sock * sk_redir;          /*    32     8 */
                              void *     data_end;             /*    40     8 */
                      } bpf;                                   /*          24 */
              };
      
      So when we use offset_of() to track down the byte offset we get 40 in
      either case and everything continues to work. Fix this mess and use
      correct structures its unclear how long this might actually work for
      until someone moves the structs around.
      Reported-by: NMartin KaFai Lau <kafai@fb.com>
      Fixes: e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      Fixes: 6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0ea488ff
    • J
      bpf: sockmap, consume_skb in close path · 7ebc14d5
      John Fastabend 提交于
      Currently, when a sock is closed and the bpf_tcp_close() callback is
      used we remove memory but do not free the skb. Call consume_skb() if
      the skb is attached to the buffer.
      
      Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com
      Fixes: 1aa12bdf ("bpf: sockmap, add sock close() hook to remove socks")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7ebc14d5
    • J
      bpf: sockhash, disallow bpf_tcp_close and update in parallel · 99ba2b5a
      John Fastabend 提交于
      After latest lock updates there is no longer anything preventing a
      close and recvmsg call running in parallel. Additionally, we can
      race update with close if we close a socket and simultaneously update
      if via the BPF userspace API (note the cgroup ops are already run
      with sock_lock held).
      
      To resolve this take sock_lock in close and update paths.
      
      Reported-by: syzbot+b680e42077a0d7c9a0c4@syzkaller.appspotmail.com
      Fixes: e9db4ef6 ("bpf: sockhash fix omitted bucket lock in sock_close")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      99ba2b5a