1. 13 7月, 2018 9 次提交
    • S
      net: Don't copy pfmemalloc flag in __copy_skb_header() · 8b700862
      Stefano Brivio 提交于
      The pfmemalloc flag indicates that the skb was allocated from
      the PFMEMALLOC reserves, and the flag is currently copied on skb
      copy and clone.
      
      However, an skb copied from an skb flagged with pfmemalloc
      wasn't necessarily allocated from PFMEMALLOC reserves, and on
      the other hand an skb allocated that way might be copied from an
      skb that wasn't.
      
      So we should not copy the flag on skb copy, and rather decide
      whether to allow an skb to be associated with sockets unrelated
      to page reclaim depending only on how it was allocated.
      
      Move the pfmemalloc flag before headers_start[0] using an
      existing 1-bit hole, so that __copy_skb_header() doesn't copy
      it.
      
      When cloning, we'll now take care of this flag explicitly,
      contravening to the warning comment of __skb_clone().
      
      While at it, restore the newline usage introduced by commit
      b1937227 ("net: reorganize sk_buff for faster
      __copy_skb_header()") to visually separate bytes used in
      bitfields after headers_start[0], that was gone after commit
      a9e419dc ("netfilter: merge ctinfo into nfct pointer storage
      area"), and describe the pfmemalloc flag in the kernel-doc
      structure comment.
      
      This doesn't change the size of sk_buff or cacheline boundaries,
      but consolidates the 15 bits hole before tc_index into a 2 bytes
      hole before csum, that could now be filled more easily.
      Reported-by: NPatrick Talbert <ptalbert@redhat.com>
      Fixes: c93bdd0e ("netvm: allow skb allocation to use PFMEMALLOC reserves")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b700862
    • D
      Merge branch 'sfc-filter-locking-fixes' · 1ff9c66b
      David S. Miller 提交于
      Bert Kenward says:
      
      ====================
      sfc: filter locking fixes
      
      Two fixes for sfc ef10 filter table locking. Initially spotted
      by lockdep, but one issue has also been seen in normal use.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ff9c66b
    • B
      sfc: hold filter_sem consistently during reset · 193f2003
      Bert Kenward 提交于
      We should take and release the filter_sem consistently during the
      reset process, in the same manner as the mac_lock and reset_lock.
      
      For lockdep consistency we also take the filter_sem for write around
      other calls to efx->type->init().
      
      Fixes: c2bebe37 ("sfc: give ef10 its own rwsem in the filter table instead of filter_lock")
      Signed-off-by: NBert Kenward <bkenward@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      193f2003
    • B
      sfc: avoid hang from nested use of the filter_sem · 1c56c099
      Bert Kenward 提交于
      In some situations we may end up calling down_read while already
      holding the semaphore for write, thus hanging. This has been seen
      when setting the MAC address for the interface. The hung task log
      in this situation includes this stack:
        down_read
        efx_ef10_filter_insert
        efx_ef10_filter_insert_addr_list
        efx_ef10_filter_vlan_sync_rx_mode
        efx_ef10_filter_add_vlan
        efx_ef10_filter_table_probe
        efx_ef10_set_mac_address
        efx_set_mac_address
        dev_set_mac_address
      
      In addition, lockdep rightly points out that nested calling of
      down_read is incorrect.
      
      Fixes: c2bebe37 ("sfc: give ef10 its own rwsem in the filter table instead of filter_lock")
      Tested-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NBert Kenward <bkenward@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c56c099
    • F
      net: systemport: Fix CRC forwarding check for SYSTEMPORT Lite · 9e3bff92
      Florian Fainelli 提交于
      SYSTEMPORT Lite reversed the logic compared to SYSTEMPORT, the
      GIB_FCS_STRIP bit is set when the Ethernet FCS is stripped, and that bit
      is not set by default. Fix the logic such that we properly check whether
      that bit is set or not and we don't forward an extra 4 bytes to the
      network stack.
      
      Fixes: 44a4524c ("net: systemport: Add support for SYSTEMPORT Lite")
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e3bff92
    • S
      tcp: allow user to create repair socket without window probes · 70b7ff13
      Stefan Baranoff 提交于
      Under rare conditions where repair code may be used it is possible that
      window probes are either unnecessary or undesired. If the user knows that
      window probes are not wanted or needed this change allows them to skip
      sending them when a socket comes out of repair.
      Signed-off-by: NStefan Baranoff <sbaranoff@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70b7ff13
    • S
      tcp: fix sequence numbers for repaired sockets re-using TIME-WAIT sockets · 21684dc4
      Stefan Baranoff 提交于
      This patch fixes a bug where the sequence numbers of a socket created using
      TCP repair functionality are lower than set after connect is called.
      This occurs when the repair socket overlaps with a TIME-WAIT socket and
      triggers the re-use code. The amount lower is equal to the number of times
      that a particular IP/port set is re-used and then put back into TIME-WAIT.
      Re-using the first time the sequence number is 1 lower, closing that socket
      and then re-opening (with repair) a new socket with the same addresses/ports
      puts the sequence number 2 lower than set via setsockopt. The third time is
      3 lower, etc. I have not tested what the limit of this acrewal is, if any.
      
      The fix is, if a socket is in repair mode, to respect the already set
      sequence number and timestamp when it would have already re-used the
      TIME-WAIT socket.
      Signed-off-by: NStefan Baranoff <sbaranoff@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21684dc4
    • J
      sch_fq_codel: zero q->flows_cnt when fq_codel_init fails · 83fe6b87
      Jacob Keller 提交于
      When fq_codel_init fails, qdisc_create_dflt will cleanup by using
      qdisc_destroy. This function calls the ->reset() op prior to calling the
      ->destroy() op.
      
      Unfortunately, during the failure flow for sch_fq_codel, the ->flows
      parameter is not initialized, so the fq_codel_reset function will null
      pointer dereference.
      
         kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
         kernel: IP: fq_codel_reset+0x58/0xd0 [sch_fq_codel]
         kernel: PGD 0 P4D 0
         kernel: Oops: 0000 [#1] SMP PTI
         kernel: Modules linked in: i40iw i40e(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc devlink ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert iscsi_target_mod sunrpc ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate iTCO_wdt iTCO_vendor_support intel_uncore ib_core intel_rapl_perf mei_me mei joydev i2c_i801 lpc_ich ioatdma shpchp wmi sch_fq_codel xfs libcrc32c mgag200 ixgbe drm_kms_helper isci ttm firewire_ohci
         kernel:  mdio drm igb libsas crc32c_intel firewire_core ptp pps_core scsi_transport_sas crc_itu_t dca i2c_algo_bit ipmi_si ipmi_devintf ipmi_msghandler [last unloaded: i40e]
         kernel: CPU: 10 PID: 4219 Comm: ip Tainted: G           OE    4.16.13custom-fq-codel-test+ #3
         kernel: Hardware name: Intel Corporation S2600CO/S2600CO, BIOS SE5C600.86B.02.05.0004.051120151007 05/11/2015
         kernel: RIP: 0010:fq_codel_reset+0x58/0xd0 [sch_fq_codel]
         kernel: RSP: 0018:ffffbfbf4c1fb620 EFLAGS: 00010246
         kernel: RAX: 0000000000000400 RBX: 0000000000000000 RCX: 00000000000005b9
         kernel: RDX: 0000000000000000 RSI: ffff9d03264a60c0 RDI: ffff9cfd17b31c00
         kernel: RBP: 0000000000000001 R08: 00000000000260c0 R09: ffffffffb679c3e9
         kernel: R10: fffff1dab06a0e80 R11: ffff9cfd163af800 R12: ffff9cfd17b31c00
         kernel: R13: 0000000000000001 R14: ffff9cfd153de600 R15: 0000000000000001
         kernel: FS:  00007fdec2f92800(0000) GS:ffff9d0326480000(0000) knlGS:0000000000000000
         kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         kernel: CR2: 0000000000000008 CR3: 0000000c1956a006 CR4: 00000000000606e0
         kernel: Call Trace:
         kernel:  qdisc_destroy+0x56/0x140
         kernel:  qdisc_create_dflt+0x8b/0xb0
         kernel:  mq_init+0xc1/0xf0
         kernel:  qdisc_create_dflt+0x5a/0xb0
         kernel:  dev_activate+0x205/0x230
         kernel:  __dev_open+0xf5/0x160
         kernel:  __dev_change_flags+0x1a3/0x210
         kernel:  dev_change_flags+0x21/0x60
         kernel:  do_setlink+0x660/0xdf0
         kernel:  ? down_trylock+0x25/0x30
         kernel:  ? xfs_buf_trylock+0x1a/0xd0 [xfs]
         kernel:  ? rtnl_newlink+0x816/0x990
         kernel:  ? _xfs_buf_find+0x327/0x580 [xfs]
         kernel:  ? _cond_resched+0x15/0x30
         kernel:  ? kmem_cache_alloc+0x20/0x1b0
         kernel:  ? rtnetlink_rcv_msg+0x200/0x2f0
         kernel:  ? rtnl_calcit.isra.30+0x100/0x100
         kernel:  ? netlink_rcv_skb+0x4c/0x120
         kernel:  ? netlink_unicast+0x19e/0x260
         kernel:  ? netlink_sendmsg+0x1ff/0x3c0
         kernel:  ? sock_sendmsg+0x36/0x40
         kernel:  ? ___sys_sendmsg+0x295/0x2f0
         kernel:  ? ebitmap_cmp+0x6d/0x90
         kernel:  ? dev_get_by_name_rcu+0x73/0x90
         kernel:  ? skb_dequeue+0x52/0x60
         kernel:  ? __inode_wait_for_writeback+0x7f/0xf0
         kernel:  ? bit_waitqueue+0x30/0x30
         kernel:  ? fsnotify_grab_connector+0x3c/0x60
         kernel:  ? __sys_sendmsg+0x51/0x90
         kernel:  ? do_syscall_64+0x74/0x180
         kernel:  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
         kernel: Code: 00 00 48 89 87 00 02 00 00 8b 87 a0 01 00 00 85 c0 0f 84 84 00 00 00 31 ed 48 63 dd 83 c5 01 48 c1 e3 06 49 03 9c 24 90 01 00 00 <48> 8b 73 08 48 8b 3b e8 6c 9a 4f f6 48 8d 43 10 48 c7 03 00 00
         kernel: RIP: fq_codel_reset+0x58/0xd0 [sch_fq_codel] RSP: ffffbfbf4c1fb620
         kernel: CR2: 0000000000000008
         kernel: ---[ end trace e81a62bede66274e ]---
      
      This is caused because flows_cnt is non-zero, but flows hasn't been
      initialized. fq_codel_init has left the private data in a partially
      initialized state.
      
      To fix this, reset flows_cnt to 0 when we fail to initialize.
      Additionally, to make the state more consistent, also cleanup the flows
      pointer when the allocation of backlogs fails.
      
      This fixes the NULL pointer dereference, since both the for-loop and
      memset in fq_codel_reset will be no-ops when flow_cnt is zero.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83fe6b87
    • D
      Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue · 35288486
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2018-07-12
      
      This series contains updates to ixgbe and e100/e1000 kernel documentation.
      
      Alex fixes ixgbe to ensure that we are more explicit about the ordering
      of updates to the receive address register (RAR) table.
      
      Dan Carpenter fixes an issue where we were reading one element beyond
      the end of the array.
      
      Mauro Carvalho Chehab fixes formatting issues in the e100.rst and
      e1000.rst that were causing errors during 'make htmldocs'.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35288486
  2. 12 7月, 2018 8 次提交
  3. 10 7月, 2018 11 次提交
    • T
      rhashtable: add restart routine in rhashtable_free_and_destroy() · 0026129c
      Taehee Yoo 提交于
      rhashtable_free_and_destroy() cancels re-hash deferred work
      then walks and destroys elements. at this moment, some elements can be
      still in future_tbl. that elements are not destroyed.
      
      test case:
      nft_rhash_destroy() calls rhashtable_free_and_destroy() to destroy
      all elements of sets before destroying sets and chains.
      But rhashtable_free_and_destroy() doesn't destroy elements of future_tbl.
      so that splat occurred.
      
      test script:
         %cat test.nft
         table ip aa {
      	   map map1 {
      		   type ipv4_addr : verdict;
      		   elements = {
      			   0 : jump a0,
      			   1 : jump a0,
      			   2 : jump a0,
      			   3 : jump a0,
      			   4 : jump a0,
      			   5 : jump a0,
      			   6 : jump a0,
      			   7 : jump a0,
      			   8 : jump a0,
      			   9 : jump a0,
      		}
      	   }
      	   chain a0 {
      	   }
         }
         flush ruleset
         table ip aa {
      	   map map1 {
      		   type ipv4_addr : verdict;
      		   elements = {
      			   0 : jump a0,
      			   1 : jump a0,
      			   2 : jump a0,
      			   3 : jump a0,
      			   4 : jump a0,
      			   5 : jump a0,
      			   6 : jump a0,
      			   7 : jump a0,
      			   8 : jump a0,
      			   9 : jump a0,
      		   }
      	   }
      	   chain a0 {
      	   }
         }
         flush ruleset
      
         %while :; do nft -f test.nft; done
      
      Splat looks like:
      [  200.795603] kernel BUG at net/netfilter/nf_tables_api.c:1363!
      [  200.806944] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      [  200.812253] CPU: 1 PID: 1582 Comm: nft Not tainted 4.17.0+ #24
      [  200.820297] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
      [  200.830309] RIP: 0010:nf_tables_chain_destroy.isra.34+0x62/0x240 [nf_tables]
      [  200.838317] Code: 43 50 85 c0 74 26 48 8b 45 00 48 8b 4d 08 ba 54 05 00 00 48 c7 c6 60 6d 29 c0 48 c7 c7 c0 65 29 c0 4c 8b 40 08 e8 58 e5 fd f8 <0f> 0b 48 89 da 48 b8 00 00 00 00 00 fc ff
      [  200.860366] RSP: 0000:ffff880118dbf4d0 EFLAGS: 00010282
      [  200.866354] RAX: 0000000000000061 RBX: ffff88010cdeaf08 RCX: 0000000000000000
      [  200.874355] RDX: 0000000000000061 RSI: 0000000000000008 RDI: ffffed00231b7e90
      [  200.882361] RBP: ffff880118dbf4e8 R08: ffffed002373bcfb R09: ffffed002373bcfa
      [  200.890354] R10: 0000000000000000 R11: ffffed002373bcfb R12: dead000000000200
      [  200.898356] R13: dead000000000100 R14: ffffffffbb62af38 R15: dffffc0000000000
      [  200.906354] FS:  00007fefc31fd700(0000) GS:ffff88011b800000(0000) knlGS:0000000000000000
      [  200.915533] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  200.922355] CR2: 0000557f1c8e9128 CR3: 0000000106880000 CR4: 00000000001006e0
      [  200.930353] Call Trace:
      [  200.932351]  ? nf_tables_commit+0x26f6/0x2c60 [nf_tables]
      [  200.939525]  ? nf_tables_setelem_notify.constprop.49+0x1a0/0x1a0 [nf_tables]
      [  200.947525]  ? nf_tables_delchain+0x6e0/0x6e0 [nf_tables]
      [  200.952383]  ? nft_add_set_elem+0x1700/0x1700 [nf_tables]
      [  200.959532]  ? nla_parse+0xab/0x230
      [  200.963529]  ? nfnetlink_rcv_batch+0xd06/0x10d0 [nfnetlink]
      [  200.968384]  ? nfnetlink_net_init+0x130/0x130 [nfnetlink]
      [  200.975525]  ? debug_show_all_locks+0x290/0x290
      [  200.980363]  ? debug_show_all_locks+0x290/0x290
      [  200.986356]  ? sched_clock_cpu+0x132/0x170
      [  200.990352]  ? find_held_lock+0x39/0x1b0
      [  200.994355]  ? sched_clock_local+0x10d/0x130
      [  200.999531]  ? memset+0x1f/0x40
      
      V2:
       - free all tables requested by Herbert Xu
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0026129c
    • D
      Merge branch 'bnxt_en-Bug-fixes' · 252dd176
      David S. Miller 提交于
      Michael Chan says:
      
      ====================
      bnxt_en: Bug fixes.
      
      These are bug fixes in error code paths, TC Flower VLAN TCI flow
      checking bug fix, proper filtering of Broadcast packets if IFF_BROADCAST
      is not set, and a bug fix in bnxt_get_max_rings() to return 0 ring
      parameters when the return value is -ENOMEM.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      252dd176
    • V
      bnxt_en: Fix for system hang if request_irq fails · c58387ab
      Vikas Gupta 提交于
      Fix bug in the error code path when bnxt_request_irq() returns failure.
      bnxt_disable_napi() should not be called in this error path because
      NAPI has not been enabled yet.
      
      Fixes: c0c050c5 ("bnxt_en: New Broadcom ethernet driver.")
      Signed-off-by: NVikas Gupta <vikas.gupta@broadcom.com>
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c58387ab
    • M
      bnxt_en: Do not modify max IRQ count after RDMA driver requests/frees IRQs. · 30f52947
      Michael Chan 提交于
      Calling bnxt_set_max_func_irqs() to modify the max IRQ count requested or
      freed by the RDMA driver is flawed.  The max IRQ count is checked when
      re-initializing the IRQ vectors and this can happen multiple times
      during ifup or ethtool -L.  If the max IRQ is reduced and the RDMA
      driver is operational, we may not initailize IRQs correctly.  This
      problem shows up on VFs with very small number of MSIX.
      
      There is no other logic that relies on the IRQ count excluding the ones
      used by RDMA.  So we fix it by just removing the call to subtract or
      add the IRQs used by RDMA.
      
      Fixes: a588e458 ("bnxt_en: Add interface to support RDMA driver.")
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30f52947
    • M
      bnxt_en: Support clearing of the IFF_BROADCAST flag. · 30e33848
      Michael Chan 提交于
      Currently, the driver assumes IFF_BROADCAST is always set and always sets
      the broadcast filter.  Modify the code to set or clear the broadcast
      filter according to the IFF_BROADCAST flag.
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30e33848
    • M
      bnxt_en: Always set output parameters in bnxt_get_max_rings(). · 78f058a4
      Michael Chan 提交于
      The current code returns -ENOMEM and does not bother to set the output
      parameters to 0 when no rings are available.  Some callers, such as
      bnxt_get_channels() will display garbage ring numbers when that happens.
      Fix it by always setting the output parameters.
      
      Fixes: 6e6c5a57 ("bnxt_en: Modify bnxt_get_max_rings() to support shared or non shared rings.")
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      78f058a4
    • M
      bnxt_en: Fix inconsistent BNXT_FLAG_AGG_RINGS logic. · 07f4fde5
      Michael Chan 提交于
      If there aren't enough RX rings available, the driver will attempt to
      use a single RX ring without the aggregation ring.  If that also
      fails, the BNXT_FLAG_AGG_RINGS flag is cleared but the other ring
      parameters are not set consistently to reflect that.  If more RX
      rings become available at the next open, the RX rings will be in
      an inconsistent state and may crash when freeing the RX rings.
      
      Fix it by restoring the BNXT_FLAG_AGG_RINGS if not enough RX rings are
      available to run without aggregation rings.
      
      Fixes: bdbd1eb5 ("bnxt_en: Handle no aggregation ring gracefully.")
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07f4fde5
    • V
      bnxt_en: Fix the vlan_tci exact match check. · e32d4e60
      Venkat Duvvuru 提交于
      It is possible that OVS may set don’t care for DEI/CFI bit in
      vlan_tci mask. Hence, checking for vlan_tci exact match will endup
      in a vlan flow rejection.
      
      This patch fixes the problem by checking for vlan_pcp and vid
      separately, instead of checking for the entire vlan_tci.
      
      Fixes: e85a9be9 (bnxt_en: do not allow wildcard matches for L2 flows)
      Signed-off-by: NVenkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e32d4e60
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · 26420d9c
      David S. Miller 提交于
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for your net tree:
      
      1) Missing module autoloadfor icmp and icmpv6 x_tables matches,
         from Florian Westphal.
      
      2) Possible non-linear access to TCP header from tproxy, from
         Mate Eckl.
      
      3) Do not allow rbtree to be used for single elements, this patch
         moves all set backend into one single module since such thing
         can only happen if hashtable module is explicitly blacklisted,
         which should not ever be done.
      
      4) Reject error and standard targets from nft_compat for sanity
         reasons, they are never used from there.
      
      5) Don't crash on double hashsize module parameter, from Andrey
         Ryabinin.
      
      6) Drop dst on skb before placing it in the fragmentation
         reassembly queue, from Florian Westphal.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26420d9c
    • F
      netfilter: ipv6: nf_defrag: drop skb dst before queueing · 84379c9a
      Florian Westphal 提交于
      Eric Dumazet reports:
       Here is a reproducer of an annoying bug detected by syzkaller on our production kernel
       [..]
       ./b78305423 enable_conntrack
       Then :
       sleep 60
       dmesg | tail -10
       [  171.599093] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  181.631024] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  191.687076] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  201.703037] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  211.711072] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  221.959070] unregister_netdevice: waiting for lo to become free. Usage count = 2
      
      Reproducer sends ipv6 fragment that hits nfct defrag via LOCAL_OUT hook.
      skb gets queued until frag timer expiry -- 1 minute.
      
      Normally nf_conntrack_reasm gets called during prerouting, so skb has
      no dst yet which might explain why this wasn't spotted earlier.
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Reported-by: NJohn Sperbeck <jsperbeck@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Tested-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      84379c9a
    • A
      netfilter: nf_conntrack: Fix possible possible crash on module loading. · 2045cdfa
      Andrey Ryabinin 提交于
      Loading the nf_conntrack module with doubled hashsize parameter, i.e.
      	  modprobe nf_conntrack hashsize=12345 hashsize=12345
      causes NULL-ptr deref.
      
      If 'hashsize' specified twice, the nf_conntrack_set_hashsize() function
      will be called also twice.
      The first nf_conntrack_set_hashsize() call will set the
      'nf_conntrack_htable_size' variable:
      
      	nf_conntrack_set_hashsize()
      		...
      		/* On boot, we can set this without any fancy locking. */
      		if (!nf_conntrack_htable_size)
      			return param_set_uint(val, kp);
      
      But on the second invocation, the nf_conntrack_htable_size is already set,
      so the nf_conntrack_set_hashsize() will take a different path and call
      the nf_conntrack_hash_resize() function. Which will crash on the attempt
      to dereference 'nf_conntrack_hash' pointer:
      
      	BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      	RIP: 0010:nf_conntrack_hash_resize+0x255/0x490 [nf_conntrack]
      	Call Trace:
      	 nf_conntrack_set_hashsize+0xcd/0x100 [nf_conntrack]
      	 parse_args+0x1f9/0x5a0
      	 load_module+0x1281/0x1a50
      	 __se_sys_finit_module+0xbe/0xf0
      	 do_syscall_64+0x7c/0x390
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fix this, by checking !nf_conntrack_hash instead of
      !nf_conntrack_htable_size. nf_conntrack_hash will be initialized only
      after the module loaded, so the second invocation of the
      nf_conntrack_set_hashsize() won't crash, it will just reinitialize
      nf_conntrack_htable_size again.
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2045cdfa
  4. 09 7月, 2018 6 次提交
  5. 08 7月, 2018 6 次提交
新手
引导
客服 返回
顶部