1. 29 11月, 2022 2 次提交
    • D
      xfs: drop write error injection is unfixable, remove it · 6e8af15c
      Dave Chinner 提交于
      With the changes to scan the page cache for dirty data to avoid data
      corruptions from partial write cleanup racing with other page cache
      operations, the drop writes error injection no longer works the same
      way it used to and causes xfs/196 to fail. This is because xfs/196
      writes to the file and populates the page cache before it turns on
      the error injection and starts failing -overwrites-.
      
      The result is that the original drop-writes code failed writes only
      -after- overwriting the data in the cache, followed by invalidates
      the cached data, then punching out the delalloc extent from under
      that data.
      
      On the surface, this looks fine. The problem is that page cache
      invalidation *doesn't guarantee that it removes anything from the
      page cache* and it doesn't change the dirty state of the folio. When
      block size == page size and we do page aligned IO (as xfs/196 does)
      everything happens to align perfectly and page cache invalidation
      removes the single page folios that span the written data. Hence the
      followup delalloc punch pass does not find cached data over that
      range and it can punch the extent out.
      
      IOWs, xfs/196 "works" for block size == page size with the new
      code. I say "works", because it actually only works for the case
      where IO is page aligned, and no data was read from disk before
      writes occur. Because the moment we actually read data first, the
      readahead code allocates multipage folios and suddenly the
      invalidate code goes back to zeroing subfolio ranges without
      changing dirty state.
      
      Hence, with multipage folios in play, block size == page size is
      functionally identical to block size < page size behaviour, and
      drop-writes is manifestly broken w.r.t to this case. Invalidation of
      a subfolio range doesn't result in the folio being removed from the
      cache, just the range gets zeroed. Hence after we've sequentially
      walked over a folio that we've dirtied (via write data) and then
      invalidated, we end up with a dirty folio full of zeroed data.
      
      And because the new code skips punching ranges that have dirty
      folios covering them, we end up leaving the delalloc range intact
      after failing all the writes. Hence failed writes now end up
      writing zeroes to disk in the cases where invalidation zeroes folios
      rather than removing them from cache.
      
      This is a fundamental change of behaviour that is needed to avoid
      the data corruption vectors that exist in the old write fail path,
      and it renders the drop-writes injection non-functional and
      unworkable as it stands.
      
      As it is, I think the error injection is also now unnecessary, as
      partial writes that need delalloc extent are going to be a lot more
      common with stale iomap detection in place. Hence this patch removes
      the drop-writes error injection completely. xfs/196 can remain for
      testing kernels that don't have this data corruption fix, but those
      that do will report:
      
      xfs/196 3s ... [not run] XFS error injection drop_writes unknown on this kernel.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      6e8af15c
    • D
      xfs: use iomap_valid method to detect stale cached iomaps · 304a68b9
      Dave Chinner 提交于
      Now that iomap supports a mechanism to validate cached iomaps for
      buffered write operations, hook it up to the XFS buffered write ops
      so that we can avoid data corruptions that result from stale cached
      iomaps. See:
      
      https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
      
      or the ->iomap_valid() introduction commit for exact details of the
      corruption vector.
      
      The validity cookie we store in the iomap is based on the type of
      iomap we return. It is expected that the iomap->flags we set in
      xfs_bmbt_to_iomap() is not perturbed by the iomap core and are
      returned to us in the iomap passed via the .iomap_valid() callback.
      This ensures that the validity cookie is always checking the correct
      inode fork sequence numbers to detect potential changes that affect
      the extent cached by the iomap.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      304a68b9
  2. 17 11月, 2022 1 次提交
    • L
      xfs: fix sb write verify for lazysbcount · 59f6ab40
      Long Li 提交于
      When lazysbcount is enabled, fsstress and loop mount/unmount test report
      the following problems:
      
      XFS (loop0): SB summary counter sanity check failed
      XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460,
      	xfs_sb block 0x0
      XFS (loop0): Unmount and run xfs_repair
      XFS (loop0): First 128 bytes of corrupted metadata buffer:
      00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
      00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
      00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
      00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
      00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
      00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
      00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
      XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
      	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
      XFS (loop0): Please unmount the filesystem and rectify the problem(s)
      XFS (loop0): log mount/recovery failed: error -117
      XFS (loop0): log mount failed
      
      This corruption will shutdown the file system and the file system will
      no longer be mountable. The following script can reproduce the problem,
      but it may take a long time.
      
       #!/bin/bash
      
       device=/dev/sda
       testdir=/mnt/test
       round=0
      
       function fail()
       {
      	 echo "$*"
      	 exit 1
       }
      
       mkdir -p $testdir
       while [ $round -lt 10000 ]
       do
      	 echo "******* round $round ********"
      	 mkfs.xfs -f $device
      	 mount $device $testdir || fail "mount failed!"
      	 fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null &
      	 sleep 4
      	 killall -w fsstress
      	 umount $testdir
      	 xfs_repair -e $device > /dev/null
      	 if [ $? -eq 2 ];then
      		 echo "ERR CODE 2: Dirty log exception during repair."
      		 exit 1
      	 fi
      	 round=$(($round+1))
       done
      
      With lazysbcount is enabled, There is no additional lock protection for
      reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the
      m_ifree, this will make the m_ifree greater than m_icount. For example,
      consider the following sequence and ifreedelta is postive:
      
       CPU0				 CPU1
       xfs_log_sb			 xfs_trans_unreserve_and_mod_sb
       ----------			 ------------------------------
       percpu_counter_sum(&mp->m_icount)
      				 percpu_counter_add_batch(&mp->m_icount,
      						idelta, XFS_ICOUNT_BATCH)
      				 percpu_counter_add(&mp->m_ifree, ifreedelta);
       percpu_counter_sum(&mp->m_ifree)
      
      After this, incorrect inode count (sb_ifree > sb_icount) will be writen to
      the log. In the subsequent writing of sb, incorrect inode count (sb_ifree >
      sb_icount) will fail to pass the boundary check in xfs_validate_sb_write()
      that cause the file system shutdown.
      
      When lazysbcount is enabled, we don't need to guarantee that Lazy sb
      counters are completely correct, but we do need to guarantee that sb_ifree
      <= sb_icount. On the other hand, the constraint that m_ifree <= m_icount
      must be satisfied any time that there /cannot/ be other threads allocating
      or freeing inode chunks. If the constraint is violated under these
      circumstances, sb_i{count,free} (the ondisk superblock inode counters)
      maybe incorrect and need to be marked sick at unmount, the count will
      be rebuilt on the next mount.
      
      Fixes: 8756a5af ("libxfs: add more bounds checking to sb sanity checks")
      Signed-off-by: NLong Li <leo.lilong@huawei.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      59f6ab40
  3. 31 10月, 2022 14 次提交
  4. 27 10月, 2022 1 次提交
  5. 21 10月, 2022 1 次提交
    • G
      xfs: fix exception caused by unexpected illegal bestcount in leaf dir · 13cf24e0
      Guo Xuenan 提交于
      For leaf dir, In most cases, there should be as many bestfree slots
      as the dir data blocks that can fit under i_size (except for [1]).
      
      Root cause is we don't examin the number bestfree slots, when the slots
      number less than dir data blocks, if we need to allocate new dir data
      block and update the bestfree array, we will use the dir block number as
      index to assign bestfree array, while we did not check the leaf buf
      boundary which may cause UAF or other memory access problem. This issue
      can also triggered with test cases xfs/473 from fstests.
      
      According to Dave Chinner & Darrick's suggestion, adding buffer verifier
      to detect this abnormal situation in time.
      Simplify the testcase for fstest xfs/554 [1]
      
      The error log is shown as follows:
      ==================================================================
      BUG: KASAN: use-after-free in xfs_dir2_leaf_addname+0x1995/0x1ac0
      Write of size 2 at addr ffff88810168b000 by task touch/1552
      CPU: 5 PID: 1552 Comm: touch Not tainted 6.0.0-rc3+ #101
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      1.13.0-1ubuntu1.1 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl+0x4d/0x66
       print_report.cold+0xf6/0x691
       kasan_report+0xa8/0x120
       xfs_dir2_leaf_addname+0x1995/0x1ac0
       xfs_dir_createname+0x58c/0x7f0
       xfs_create+0x7af/0x1010
       xfs_generic_create+0x270/0x5e0
       path_openat+0x270b/0x3450
       do_filp_open+0x1cf/0x2b0
       do_sys_openat2+0x46b/0x7a0
       do_sys_open+0xb7/0x130
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7fe4d9e9312b
      Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0
      75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00
      f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
      RSP: 002b:00007ffda4c16c20 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
      RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe4d9e9312b
      RDX: 0000000000000941 RSI: 00007ffda4c17f33 RDI: 00000000ffffff9c
      RBP: 00007ffda4c17f33 R08: 0000000000000000 R09: 0000000000000000
      R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941
      R13: 00007fe4d9f631a4 R14: 00007ffda4c17f33 R15: 0000000000000000
       </TASK>
      
      The buggy address belongs to the physical page:
      page:ffffea000405a2c0 refcount:0 mapcount:0 mapping:0000000000000000
      index:0x0 pfn:0x10168b
      flags: 0x2fffff80000000(node=0|zone=2|lastcpupid=0x1fffff)
      raw: 002fffff80000000 ffffea0004057788 ffffea000402dbc8 0000000000000000
      raw: 0000000000000000 0000000000170000 00000000ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88810168af00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       ffff88810168af80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      >ffff88810168b000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                         ^
       ffff88810168b080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff88810168b100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      ==================================================================
      Disabling lock debugging due to kernel taint
      00000000: 58 44 44 33 5b 53 35 c2 00 00 00 00 00 00 00 78
      XDD3[S5........x
      XFS (sdb): Internal error xfs_dir2_data_use_free at line 1200 of file
      fs/xfs/libxfs/xfs_dir2_data.c.  Caller
      xfs_dir2_data_use_free+0x28a/0xeb0
      CPU: 5 PID: 1552 Comm: touch Tainted: G    B              6.0.0-rc3+
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      1.13.0-1ubuntu1.1 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl+0x4d/0x66
       xfs_corruption_error+0x132/0x150
       xfs_dir2_data_use_free+0x198/0xeb0
       xfs_dir2_leaf_addname+0xa59/0x1ac0
       xfs_dir_createname+0x58c/0x7f0
       xfs_create+0x7af/0x1010
       xfs_generic_create+0x270/0x5e0
       path_openat+0x270b/0x3450
       do_filp_open+0x1cf/0x2b0
       do_sys_openat2+0x46b/0x7a0
       do_sys_open+0xb7/0x130
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7fe4d9e9312b
      Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0
      75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00
      f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
      RSP: 002b:00007ffda4c16c20 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
      RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe4d9e9312b
      RDX: 0000000000000941 RSI: 00007ffda4c17f46 RDI: 00000000ffffff9c
      RBP: 00007ffda4c17f46 R08: 0000000000000000 R09: 0000000000000001
      R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941
      R13: 00007fe4d9f631a4 R14: 00007ffda4c17f46 R15: 0000000000000000
       </TASK>
      XFS (sdb): Corruption detected. Unmount and run xfs_repair
      
      [1] https://lore.kernel.org/all/20220928095355.2074025-1-guoxuenan@huawei.com/Reviewed-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      13cf24e0
  6. 12 10月, 2022 2 次提交
    • J
      treewide: use get_random_u32() when possible · a251c17a
      Jason A. Donenfeld 提交于
      The prandom_u32() function has been a deprecated inline wrapper around
      get_random_u32() for several releases now, and compiles down to the
      exact same code. Replace the deprecated wrapper with a direct call to
      the real function. The same also applies to get_random_int(), which is
      just a wrapper around get_random_u32(). This was done as a basic find
      and replace.
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NYury Norov <yury.norov@gmail.com>
      Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
      Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
      Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt
      Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
      Acked-by: Helge Deller <deller@gmx.de> # for parisc
      Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      a251c17a
    • J
      treewide: use prandom_u32_max() when possible, part 1 · 81895a65
      Jason A. Donenfeld 提交于
      Rather than incurring a division or requesting too many random bytes for
      the given range, use the prandom_u32_max() function, which only takes
      the minimum required bytes from the RNG and avoids divisions. This was
      done mechanically with this coccinelle script:
      
      @basic@
      expression E;
      type T;
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      typedef u64;
      @@
      (
      - ((T)get_random_u32() % (E))
      + prandom_u32_max(E)
      |
      - ((T)get_random_u32() & ((E) - 1))
      + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
      |
      - ((u64)(E) * get_random_u32() >> 32)
      + prandom_u32_max(E)
      |
      - ((T)get_random_u32() & ~PAGE_MASK)
      + prandom_u32_max(PAGE_SIZE)
      )
      
      @multi_line@
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      identifier RAND;
      expression E;
      @@
      
      -       RAND = get_random_u32();
              ... when != RAND
      -       RAND %= (E);
      +       RAND = prandom_u32_max(E);
      
      // Find a potential literal
      @literal_mask@
      expression LITERAL;
      type T;
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      position p;
      @@
      
              ((T)get_random_u32()@p & (LITERAL))
      
      // Add one to the literal.
      @script:python add_one@
      literal << literal_mask.LITERAL;
      RESULT;
      @@
      
      value = None
      if literal.startswith('0x'):
              value = int(literal, 16)
      elif literal[0] in '123456789':
              value = int(literal, 10)
      if value is None:
              print("I don't know how to handle %s" % (literal))
              cocci.include_match(False)
      elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
              print("Skipping 0x%x for cleanup elsewhere" % (value))
              cocci.include_match(False)
      elif value & (value + 1) != 0:
              print("Skipping 0x%x because it's not a power of two minus one" % (value))
              cocci.include_match(False)
      elif literal.startswith('0x'):
              coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
      else:
              coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
      
      // Replace the literal mask with the calculated result.
      @plus_one@
      expression literal_mask.LITERAL;
      position literal_mask.p;
      expression add_one.RESULT;
      identifier FUNC;
      @@
      
      -       (FUNC()@p & (LITERAL))
      +       prandom_u32_max(RESULT)
      
      @collapse_ret@
      type T;
      identifier VAR;
      expression E;
      @@
      
       {
      -       T VAR;
      -       VAR = (E);
      -       return VAR;
      +       return E;
       }
      
      @drop_var@
      type T;
      identifier VAR;
      @@
      
       {
      -       T VAR;
              ... when != VAR
       }
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NYury Norov <yury.norov@gmail.com>
      Reviewed-by: NKP Singh <kpsingh@kernel.org>
      Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
      Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
      Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
      Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      81895a65
  7. 04 10月, 2022 2 次提交
  8. 19 9月, 2022 2 次提交
  9. 11 8月, 2022 1 次提交
    • H
      xfs: fix inode reservation space for removing transaction · 031d166f
      hexiaole 提交于
      In 'fs/xfs/libxfs/xfs_trans_resv.c', the comment for transaction of removing a
      directory entry writes:
      
      /* fs/xfs/libxfs/xfs_trans_resv.c begin */
      /*
       * For removing a directory entry we can modify:
       *    the parent directory inode: inode size
       *    the removed inode: inode size
      ...
      xfs_calc_remove_reservation(
              struct xfs_mount        *mp)
      {
              return XFS_DQUOT_LOGRES(mp) +
                      xfs_calc_iunlink_add_reservation(mp) +
                      max((xfs_calc_inode_res(mp, 1) +
      ...
      /* fs/xfs/libxfs/xfs_trans_resv.c end */
      
      There has 2 inode size of space to be reserverd, but the actual code
      for inode reservation space writes.
      
      There only count for 1 inode size to be reserved in
      'xfs_calc_inode_res(mp, 1)', rather than 2.
      Signed-off-by: Nhexiaole <hexiaole@kylinos.cn>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      [djwong: remove redundant code citations]
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      031d166f
  10. 23 7月, 2022 1 次提交
  11. 21 7月, 2022 3 次提交
    • D
      xfs: don't leak memory when attr fork loading fails · c78c2d09
      Darrick J. Wong 提交于
      I observed the following evidence of a memory leak while running xfs/399
      from the xfs fsck test suite (edited for brevity):
      
      XFS (sde): Metadata corruption detected at xfs_attr_shortform_verify_struct.part.0+0x7b/0xb0 [xfs], inode 0x1172 attr fork
      XFS: Assertion failed: ip->i_af.if_u1.if_data == NULL, file: fs/xfs/libxfs/xfs_inode_fork.c, line: 315
      ------------[ cut here ]------------
      WARNING: CPU: 2 PID: 91635 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
      CPU: 2 PID: 91635 Comm: xfs_scrub Tainted: G        W         5.19.0-rc7-xfsx #rc7 6e6475eb29fd9dda3181f81b7ca7ff961d277a40
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
      RIP: 0010:assfail+0x46/0x4a [xfs]
      Call Trace:
       <TASK>
       xfs_ifork_zap_attr+0x7c/0xb0
       xfs_iformat_attr_fork+0x86/0x110
       xfs_inode_from_disk+0x41d/0x480
       xfs_iget+0x389/0xd70
       xfs_bulkstat_one_int+0x5b/0x540
       xfs_bulkstat_iwalk+0x1e/0x30
       xfs_iwalk_ag_recs+0xd1/0x160
       xfs_iwalk_run_callbacks+0xb9/0x180
       xfs_iwalk_ag+0x1d8/0x2e0
       xfs_iwalk+0x141/0x220
       xfs_bulkstat+0x105/0x180
       xfs_ioc_bulkstat.constprop.0.isra.0+0xc5/0x130
       xfs_file_ioctl+0xa5f/0xef0
       __x64_sys_ioctl+0x82/0xa0
       do_syscall_64+0x2b/0x80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      This newly-added assertion checks that there aren't any incore data
      structures hanging off the incore fork when we're trying to reset its
      contents.  From the call trace, it is evident that iget was trying to
      construct an incore inode from the ondisk inode, but the attr fork
      verifier failed and we were trying to undo all the memory allocations
      that we had done earlier.
      
      The three assertions in xfs_ifork_zap_attr check that the caller has
      already called xfs_idestroy_fork, which clearly has not been done here.
      As the zap function then zeroes the pointers, we've effectively leaked
      the memory.
      
      The shortest change would have been to insert an extra call to
      xfs_idestroy_fork, but it makes more sense to bundle the _idestroy_fork
      call into _zap_attr, since all other callsites call _idestroy_fork
      immediately prior to calling _zap_attr.  IOWs, it eliminates one way to
      fail.
      
      Note: This change only applies cleanly to 2ed5b09b, since we just
      reworked the attr fork lifetime.  However, I think this memory leak has
      existed since 0f45a1b2, since the chain xfs_iformat_attr_fork ->
      xfs_iformat_local -> xfs_init_local_fork will allocate
      ifp->if_u1.if_data, but if xfs_ifork_verify_local_attr fails,
      xfs_iformat_attr_fork will free i_afp without freeing any of the stuff
      hanging off i_afp.  The solution for older kernels I think is to add the
      missing call to xfs_idestroy_fork just prior to calling kmem_cache_free.
      
      Found by fuzzing a.sfattr.hdr.totsize = lastbit in xfs/399.
      
      Fixes: 2ed5b09b ("xfs: make inode attribute forks a permanent part of struct xfs_inode")
      Probably-Fixes: 0f45a1b2 ("xfs: improve local fork verification")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      c78c2d09
    • D
      xfs: delete unnecessary NULL checks · 3f52e016
      Dan Carpenter 提交于
      These NULL check are no long needed after commit 2ed5b09b ("xfs:
      make inode attribute forks a permanent part of struct xfs_inode").
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      3f52e016
    • X
      xfs: fix comment for start time value of inode with bigtime enabled · fdbae121
      Xiaole He 提交于
      The 'ctime', 'mtime', and 'atime' for inode is the type of
      'xfs_timestamp_t', which is a 64-bit type:
      
      /* fs/xfs/libxfs/xfs_format.h begin */
      typedef __be64 xfs_timestamp_t;
      /* fs/xfs/libxfs/xfs_format.h end */
      
      When the 'bigtime' feature is disabled, this 64-bit type is splitted
      into two parts of 32-bit, one part is encoded for seconds since
      1970-01-01 00:00:00 UTC, the other part is encoded for nanoseconds
      above the seconds, this two parts are the type of
      'xfs_legacy_timestamp' and the min and max time value of this type are
      defined as macros 'XFS_LEGACY_TIME_MIN' and 'XFS_LEGACY_TIME_MAX':
      
      /* fs/xfs/libxfs/xfs_format.h begin */
      struct xfs_legacy_timestamp {
              __be32          t_sec;          /* timestamp seconds */
              __be32          t_nsec;         /* timestamp nanoseconds */
      };
       #define XFS_LEGACY_TIME_MIN     ((int64_t)S32_MIN)
       #define XFS_LEGACY_TIME_MAX     ((int64_t)S32_MAX)
      /* fs/xfs/libxfs/xfs_format.h end */
      /* include/linux/limits.h begin */
       #define U32_MAX         ((u32)~0U)
       #define S32_MAX         ((s32)(U32_MAX >> 1))
       #define S32_MIN         ((s32)(-S32_MAX - 1))
      /* include/linux/limits.h end */
      
      'XFS_LEGACY_TIME_MIN' is the min time value of the
      'xfs_legacy_timestamp', that is -(2^31) seconds relative to the
      1970-01-01 00:00:00 UTC, it can be converted to human-friendly time
      value by 'date' command:
      
      /* command begin */
      [root@~]# date --utc -d '@0' +'%Y-%m-%d %H:%M:%S'
      1970-01-01 00:00:00
      [root@~]# date --utc -d "@`echo '-(2^31)'|bc`" +'%Y-%m-%d %H:%M:%S'
      1901-12-13 20:45:52
      [root@~]#
      /* command end */
      
      When 'bigtime' feature is enabled, this 64-bit type becomes a 64-bit
      nanoseconds counter, with the start time value is the min time value of
      'xfs_legacy_timestamp'(start time means the value of 64-bit nanoseconds
      counter is 0). We have already caculated the min time value of
      'xfs_legacy_timestamp', that is 1901-12-13 20:45:52 UTC, but the comment
      for the start time value of inode with 'bigtime' feature enabled writes
      the value is 1901-12-31 20:45:52 UTC:
      
      /* fs/xfs/libxfs/xfs_format.h begin */
      /*
       * XFS Timestamps
       * ==============
       * When the bigtime feature is enabled, ondisk inode timestamps become an
       * unsigned 64-bit nanoseconds counter.  This means that the bigtime inode
       * timestamp epoch is the start of the classic timestamp range, which is
       * Dec 31 20:45:52 UTC 1901. ...
       ...
       */
      /* fs/xfs/libxfs/xfs_format.h end */
      
      That is a typo, and this patch corrects the typo, from 'Dec 31' to
      'Dec 13'.
      Suggested-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NXiaole He <hexiaole@kylinos.cn>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      fdbae121
  12. 14 7月, 2022 2 次提交
    • D
      xfs: double link the unlinked inode list · 2fd26cc0
      Dave Chinner 提交于
      Now we have forwards traversal via the incore inode in place, we now
      need to add back pointers to the incore inode to entirely replace
      the back reference cache. We use the same lookup semantics and
      constraints as for the forwards pointer lookups during unlinks, and
      so we can look up any inode in the unlinked list directly and update
      the list pointers, forwards or backwards, at any time.
      
      The only wrinkle in converting the unlinked list manipulations to
      use in-core previous pointers is that log recovery doesn't have the
      incore inode state built up so it can't just read in an inode and
      release it to finish off the unlink. Hence we need to modify the
      traversal in recovery to read one inode ahead before we
      release the inode at the head of the list. This populates the
      next->prev relationship sufficient to be able to replay the unlinked
      list and hence greatly simplify the runtime code.
      
      This recovery algorithm also requires that we actually remove inodes
      from the unlinked list one at a time as background inode
      inactivation will result in unlinked list removal racing with the
      building of the in-memory unlinked list state. We could serialise
      this by holding the AGI buffer lock when constructing the in memory
      state, but all that does is lockstep background processing with list
      building. It is much simpler to flush the inodegc immediately after
      releasing the inode so that it is unlinked immediately and there is
      no races present at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      2fd26cc0
    • D
      xfs: track the iunlink list pointer in the xfs_inode · 4fcc94d6
      Dave Chinner 提交于
      Having direct access to the i_next_unlinked pointer in unlinked
      inodes greatly simplifies the processing of inodes on the unlinked
      list. We no longer need to look up the inode buffer just to find
      next inode in the list if the xfs_inode is in memory. These
      improvements will be realised over upcoming patches as other
      dependencies on the inode buffer for unlinked list processing are
      removed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      4fcc94d6
  13. 13 7月, 2022 2 次提交
  14. 10 7月, 2022 4 次提交
    • D
      xfs: use XFS_IFORK_Q to determine the presence of an xattr fork · e45d7cb2
      Darrick J. Wong 提交于
      Modify xfs_ifork_ptr to return a NULL pointer if the caller asks for the
      attribute fork but i_forkoff is zero.  This eliminates the ambiguity
      between i_forkoff and i_af.if_present, which should make it easier to
      understand the lifetime of attr forks.
      
      While we're at it, remove the if_present checks around calls to
      xfs_idestroy_fork and xfs_ifork_zap_attr since they can both handle attr
      forks that have already been torn down.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      e45d7cb2
    • D
      xfs: make inode attribute forks a permanent part of struct xfs_inode · 2ed5b09b
      Darrick J. Wong 提交于
      Syzkaller reported a UAF bug a while back:
      
      ==================================================================
      BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
      Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958
      
      CPU: 2 PID: 2958 Comm: syz-executor262 Not tainted
      5.15.0-0.30.3-20220406_1406 #3
      Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29
      04/01/2014
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x82/0xa9 lib/dump_stack.c:106
       print_address_description.constprop.9+0x21/0x2d5 mm/kasan/report.c:256
       __kasan_report mm/kasan/report.c:442 [inline]
       kasan_report.cold.14+0x7f/0x11b mm/kasan/report.c:459
       xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
       xfs_attr_get+0x378/0x4c2 fs/xfs/libxfs/xfs_attr.c:159
       xfs_xattr_get+0xe3/0x150 fs/xfs/xfs_xattr.c:36
       __vfs_getxattr+0xdf/0x13d fs/xattr.c:399
       cap_inode_need_killpriv+0x41/0x5d security/commoncap.c:300
       security_inode_need_killpriv+0x4c/0x97 security/security.c:1408
       dentry_needs_remove_privs.part.28+0x21/0x63 fs/inode.c:1912
       dentry_needs_remove_privs+0x80/0x9e fs/inode.c:1908
       do_truncate+0xc3/0x1e0 fs/open.c:56
       handle_truncate fs/namei.c:3084 [inline]
       do_open fs/namei.c:3432 [inline]
       path_openat+0x30ab/0x396d fs/namei.c:3561
       do_filp_open+0x1c4/0x290 fs/namei.c:3588
       do_sys_openat2+0x60d/0x98c fs/open.c:1212
       do_sys_open+0xcf/0x13c fs/open.c:1228
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0x0
      RIP: 0033:0x7f7ef4bb753d
      Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48
      89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73
      01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48
      RSP: 002b:00007f7ef52c2ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
      RAX: ffffffffffffffda RBX: 0000000000404148 RCX: 00007f7ef4bb753d
      RDX: 00007f7ef4bb753d RSI: 0000000000000000 RDI: 0000000020004fc0
      RBP: 0000000000404140 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0030656c69662f2e
      R13: 00007ffd794db37f R14: 00007ffd794db470 R15: 00007f7ef52c2fc0
       </TASK>
      
      Allocated by task 2953:
       kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
       kasan_set_track mm/kasan/common.c:46 [inline]
       set_alloc_info mm/kasan/common.c:434 [inline]
       __kasan_slab_alloc+0x68/0x7c mm/kasan/common.c:467
       kasan_slab_alloc include/linux/kasan.h:254 [inline]
       slab_post_alloc_hook mm/slab.h:519 [inline]
       slab_alloc_node mm/slub.c:3213 [inline]
       slab_alloc mm/slub.c:3221 [inline]
       kmem_cache_alloc+0x11b/0x3eb mm/slub.c:3226
       kmem_cache_zalloc include/linux/slab.h:711 [inline]
       xfs_ifork_alloc+0x25/0xa2 fs/xfs/libxfs/xfs_inode_fork.c:287
       xfs_bmap_add_attrfork+0x3f2/0x9b1 fs/xfs/libxfs/xfs_bmap.c:1098
       xfs_attr_set+0xe38/0x12a7 fs/xfs/libxfs/xfs_attr.c:746
       xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
       __vfs_setxattr+0x11b/0x177 fs/xattr.c:180
       __vfs_setxattr_noperm+0x128/0x5e0 fs/xattr.c:214
       __vfs_setxattr_locked+0x1d4/0x258 fs/xattr.c:275
       vfs_setxattr+0x154/0x33d fs/xattr.c:301
       setxattr+0x216/0x29f fs/xattr.c:575
       __do_sys_fsetxattr fs/xattr.c:632 [inline]
       __se_sys_fsetxattr fs/xattr.c:621 [inline]
       __x64_sys_fsetxattr+0x243/0x2fe fs/xattr.c:621
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0x0
      
      Freed by task 2949:
       kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
       kasan_set_track+0x1c/0x21 mm/kasan/common.c:46
       kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360
       ____kasan_slab_free mm/kasan/common.c:366 [inline]
       ____kasan_slab_free mm/kasan/common.c:328 [inline]
       __kasan_slab_free+0xe2/0x10e mm/kasan/common.c:374
       kasan_slab_free include/linux/kasan.h:230 [inline]
       slab_free_hook mm/slub.c:1700 [inline]
       slab_free_freelist_hook mm/slub.c:1726 [inline]
       slab_free mm/slub.c:3492 [inline]
       kmem_cache_free+0xdc/0x3ce mm/slub.c:3508
       xfs_attr_fork_remove+0x8d/0x132 fs/xfs/libxfs/xfs_attr_leaf.c:773
       xfs_attr_sf_removename+0x5dd/0x6cb fs/xfs/libxfs/xfs_attr_leaf.c:822
       xfs_attr_remove_iter+0x68c/0x805 fs/xfs/libxfs/xfs_attr.c:1413
       xfs_attr_remove_args+0xb1/0x10d fs/xfs/libxfs/xfs_attr.c:684
       xfs_attr_set+0xf1e/0x12a7 fs/xfs/libxfs/xfs_attr.c:802
       xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
       __vfs_removexattr+0x106/0x16a fs/xattr.c:468
       cap_inode_killpriv+0x24/0x47 security/commoncap.c:324
       security_inode_killpriv+0x54/0xa1 security/security.c:1414
       setattr_prepare+0x1a6/0x897 fs/attr.c:146
       xfs_vn_change_ok+0x111/0x15e fs/xfs/xfs_iops.c:682
       xfs_vn_setattr_size+0x5f/0x15a fs/xfs/xfs_iops.c:1065
       xfs_vn_setattr+0x125/0x2ad fs/xfs/xfs_iops.c:1093
       notify_change+0xae5/0x10a1 fs/attr.c:410
       do_truncate+0x134/0x1e0 fs/open.c:64
       handle_truncate fs/namei.c:3084 [inline]
       do_open fs/namei.c:3432 [inline]
       path_openat+0x30ab/0x396d fs/namei.c:3561
       do_filp_open+0x1c4/0x290 fs/namei.c:3588
       do_sys_openat2+0x60d/0x98c fs/open.c:1212
       do_sys_open+0xcf/0x13c fs/open.c:1228
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0x0
      
      The buggy address belongs to the object at ffff88802cec9188
       which belongs to the cache xfs_ifork of size 40
      The buggy address is located 20 bytes inside of
       40-byte region [ffff88802cec9188, ffff88802cec91b0)
      The buggy address belongs to the page:
      page:00000000c3af36a1 refcount:1 mapcount:0 mapping:0000000000000000
      index:0x0 pfn:0x2cec9
      flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff)
      raw: 000fffffc0000200 ffffea00009d2580 0000000600000006 ffff88801a9ffc80
      raw: 0000000000000000 0000000080490049 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb
       ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc
      >ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb
                                  ^
       ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb
       ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb
      ==================================================================
      
      The root cause of this bug is the unlocked access to xfs_inode.i_afp
      from the getxattr code paths while trying to determine which ILOCK mode
      to use to stabilize the xattr data.  Unfortunately, the VFS does not
      acquire i_rwsem when vfs_getxattr (or listxattr) call into the
      filesystem, which means that getxattr can race with a removexattr that's
      tearing down the attr fork and crash:
      
      xfs_attr_set:                          xfs_attr_get:
      xfs_attr_fork_remove:                  xfs_ilock_attr_map_shared:
      
      xfs_idestroy_fork(ip->i_afp);
      kmem_cache_free(xfs_ifork_cache, ip->i_afp);
      
                                             if (ip->i_afp &&
      
      ip->i_afp = NULL;
      
                                                 xfs_need_iread_extents(ip->i_afp))
                                             <KABOOM>
      
      ip->i_forkoff = 0;
      
      Regrettably, the VFS is much more lax about i_rwsem and getxattr than
      is immediately obvious -- not only does it not guarantee that we hold
      i_rwsem, it actually doesn't guarantee that we *don't* hold it either.
      The getxattr system call won't acquire the lock before calling XFS, but
      the file capabilities code calls getxattr with and without i_rwsem held
      to determine if the "security.capabilities" xattr is set on the file.
      
      Fixing the VFS locking requires a treewide investigation into every code
      path that could touch an xattr and what i_rwsem state it expects or sets
      up.  That could take years or even prove impossible; fortunately, we
      can fix this UAF problem inside XFS.
      
      An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to
      ensure that i_forkoff is always zeroed before i_afp is set to null and
      changed the read paths to use smp_rmb before accessing i_forkoff and
      i_afp, which avoided these UAF problems.  However, the patch author was
      too busy dealing with other problems in the meantime, and by the time he
      came back to this issue, the situation had changed a bit.
      
      On a modern system with selinux, each inode will always have at least
      one xattr for the selinux label, so it doesn't make much sense to keep
      incurring the extra pointer dereference.  Furthermore, Allison's
      upcoming parent pointer patchset will also cause nearly every inode in
      the filesystem to have extended attributes.  Therefore, make the inode
      attribute fork structure part of struct xfs_inode, at a cost of 40 more
      bytes.
      
      This patch adds a clunky if_present field where necessary to maintain
      the existing logic of xattr fork null pointer testing in the existing
      codebase.  The next patch switches the logic over to XFS_IFORK_Q and it
      all goes away.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      2ed5b09b
    • D
      xfs: convert XFS_IFORK_PTR to a static inline helper · 732436ef
      Darrick J. Wong 提交于
      We're about to make this logic do a bit more, so convert the macro to a
      static inline function for better typechecking and fewer shouty macros.
      No functional changes here.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      732436ef
    • A
      xfs: removed useless condition in function xfs_attr_node_get · 0f38063d
      Andrey Strachuk 提交于
      At line 1561, variable "state" is being compared
      with NULL every loop iteration.
      
      -------------------------------------------------------------------
      1561	for (i = 0; state != NULL && i < state->path.active; i++) {
      1562		xfs_trans_brelse(args->trans, state->path.blk[i].bp);
      1563		state->path.blk[i].bp = NULL;
      1564	}
      -------------------------------------------------------------------
      
      However, it cannot be NULL.
      
      ----------------------------------------
      1546	state = xfs_da_state_alloc(args);
      ----------------------------------------
      
      xfs_da_state_alloc calls kmem_cache_zalloc. kmem_cache_zalloc is
      called with __GFP_NOFAIL flag and, therefore, it cannot return NULL.
      
      --------------------------------------------------------------------------
      	struct xfs_da_state *
      	xfs_da_state_alloc(
      	struct xfs_da_args	*args)
      	{
      		struct xfs_da_state	*state;
      
      		state = kmem_cache_zalloc(xfs_da_state_cache, GFP_NOFS | __GFP_NOFAIL);
      		state->args = args;
      		state->mp = args->dp->i_mount;
      		return state;
      	}
      --------------------------------------------------------------------------
      
      Found by Linux Verification Center (linuxtesting.org) with SVACE.
      Signed-off-by: NAndrey Strachuk <strochuk@ispras.ru>
      
      Fixes: 4d0cdd2b ("xfs: clean up xfs_attr_node_hasname")
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      0f38063d
  15. 07 7月, 2022 2 次提交