1. 19 2月, 2023 2 次提交
  2. 09 12月, 2022 11 次提交
  3. 01 12月, 2022 1 次提交
    • B
      ext4: add inode table check in __ext4_get_inode_loc to aovid possible infinite loop · eee22187
      Baokun Li 提交于
      In do_writepages, if the value returned by ext4_writepages is "-ENOMEM"
      and "wbc->sync_mode == WB_SYNC_ALL", retry until the condition is not met.
      
      In __ext4_get_inode_loc, if the bh returned by sb_getblk is NULL,
      the function returns -ENOMEM.
      
      In __getblk_slow, if the return value of grow_buffers is less than 0,
      the function returns NULL.
      
      When the three processes are connected in series like the following stack,
      an infinite loop may occur:
      
      do_writepages					<--- keep retrying
       ext4_writepages
        mpage_map_and_submit_extent
         mpage_map_one_extent
          ext4_map_blocks
           ext4_ext_map_blocks
            ext4_ext_handle_unwritten_extents
             ext4_ext_convert_to_initialized
              ext4_split_extent
               ext4_split_extent_at
                __ext4_ext_dirty
                 __ext4_mark_inode_dirty
                  ext4_reserve_inode_write
                   ext4_get_inode_loc
                    __ext4_get_inode_loc		<--- return -ENOMEM
                     sb_getblk
                      __getblk_gfp
                       __getblk_slow			<--- return NULL
                        grow_buffers
                         grow_dev_page		<--- return -ENXIO
                          ret = (block < end_block) ? 1 : -ENXIO;
      
      In this issue, bg_inode_table_hi is overwritten as an incorrect value.
      As a result, `block < end_block` cannot be met in grow_dev_page.
      Therefore, __ext4_get_inode_loc always returns '-ENOMEM' and do_writepages
      keeps retrying. As a result, the writeback process is in the D state due
      to an infinite loop.
      
      Add a check on inode table block in the __ext4_get_inode_loc function by
      referring to ext4_read_inode_bitmap to avoid this infinite loop.
      
      Cc: stable@kernel.org
      Signed-off-by: NBaokun Li <libaokun1@huawei.com>
      Reviewed-by: NRitesh Harjani (IBM) <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20220817132701.3015912-3-libaokun1@huawei.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      eee22187
  4. 30 11月, 2022 1 次提交
  5. 29 11月, 2022 1 次提交
    • Z
      ext4: silence the warning when evicting inode with dioread_nolock · bc12ac98
      Zhang Yi 提交于
      When evicting an inode with default dioread_nolock, it could be raced by
      the unwritten extents converting kworker after writeback some new
      allocated dirty blocks. It convert unwritten extents to written, the
      extents could be merged to upper level and free extent blocks, so it
      could mark the inode dirty again even this inode has been marked
      I_FREEING. But the inode->i_io_list check and warning in
      ext4_evict_inode() missing this corner case. Fortunately,
      ext4_evict_inode() will wait all extents converting finished before this
      check, so it will not lead to inode use-after-free problem, every thing
      is OK besides this warning. The WARN_ON_ONCE was originally designed
      for finding inode use-after-free issues in advance, but if we add
      current dioread_nolock case in, it will become not quite useful, so fix
      this warning by just remove this check.
      
       ======
       WARNING: CPU: 7 PID: 1092 at fs/ext4/inode.c:227
       ext4_evict_inode+0x875/0xc60
       ...
       RIP: 0010:ext4_evict_inode+0x875/0xc60
       ...
       Call Trace:
        <TASK>
        evict+0x11c/0x2b0
        iput+0x236/0x3a0
        do_unlinkat+0x1b4/0x490
        __x64_sys_unlinkat+0x4c/0xb0
        do_syscall_64+0x3b/0x90
        entry_SYSCALL_64_after_hwframe+0x46/0xb0
       RIP: 0033:0x7fa933c1115b
       ======
      
      rm                          kworker
                                  ext4_end_io_end()
      vfs_unlink()
       ext4_unlink()
                                   ext4_convert_unwritten_io_end_vec()
                                    ext4_convert_unwritten_extents()
                                     ext4_map_blocks()
                                      ext4_ext_map_blocks()
                                       ext4_ext_try_to_merge_up()
                                        __mark_inode_dirty()
                                         check !I_FREEING
                                         locked_inode_to_wb_and_lock_list()
       iput()
        iput_final()
         evict()
          ext4_evict_inode()
           truncate_inode_pages_final() //wait release io_end
                                          inode_io_list_move_locked()
                                   ext4_release_io_end()
           trigger WARN_ON_ONCE()
      
      Cc: stable@kernel.org
      Fixes: ceff86fd ("ext4: Avoid freeing inodes on dirty list")
      Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20220629112647.4141034-1-yi.zhang@huawei.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      bc12ac98
  6. 19 10月, 2022 1 次提交
    • C
      fs: pass dentry to set acl method · 138060ba
      Christian Brauner 提交于
      The current way of setting and getting posix acls through the generic
      xattr interface is error prone and type unsafe. The vfs needs to
      interpret and fixup posix acls before storing or reporting it to
      userspace. Various hacks exist to make this work. The code is hard to
      understand and difficult to maintain in it's current form. Instead of
      making this work by hacking posix acls through xattr handlers we are
      building a dedicated posix acl api around the get and set inode
      operations. This removes a lot of hackiness and makes the codepaths
      easier to maintain. A lot of background can be found in [1].
      
      Since some filesystem rely on the dentry being available to them when
      setting posix acls (e.g., 9p and cifs) they cannot rely on set acl inode
      operation. But since ->set_acl() is required in order to use the generic
      posix acl xattr handlers filesystems that do not implement this inode
      operation cannot use the handler and need to implement their own
      dedicated posix acl handlers.
      
      Update the ->set_acl() inode method to take a dentry argument. This
      allows all filesystems to rely on ->set_acl().
      
      As far as I can tell all codepaths can be switched to rely on the dentry
      instead of just the inode. Note that the original motivation for passing
      the dentry separate from the inode instead of just the dentry in the
      xattr handlers was because of security modules that call
      security_d_instantiate(). This hook is called during
      d_instantiate_new(), d_add(), __d_instantiate_anon(), and
      d_splice_alias() to initialize the inode's security context and possibly
      to set security.* xattrs. Since this only affects security.* xattrs this
      is completely irrelevant for posix acls.
      
      Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      138060ba
  7. 01 10月, 2022 3 次提交
    • J
      ext4: fix i_version handling in ext4 · a642c2c0
      Jeff Layton 提交于
      ext4 currently updates the i_version counter when the atime is updated
      during a read. This is less than ideal as it can cause unnecessary cache
      invalidations with NFSv4 and unnecessary remeasurements for IMA.
      
      The increment in ext4_mark_iloc_dirty is also problematic since it can
      corrupt the i_version counter for ea_inodes. We aren't bumping the file
      times in ext4_mark_iloc_dirty, so changing the i_version there seems
      wrong, and is the cause of both problems.
      
      Remove that callsite and add increments to the setattr, setxattr and
      ioctl codepaths, at the same times that we update the ctime. The
      i_version bump that already happens during timestamp updates should take
      care of the rest.
      
      In ext4_move_extents, increment the i_version on both inodes, and also
      add in missing ctime updates.
      
      [ Some minor updates since we've already enabled the i_version counter
        unconditionally already via another patch series. -- TYT ]
      
      Cc: stable@kernel.org
      Cc: Lukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Link: https://lore.kernel.org/r/20220908172448.208585-3-jlayton@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      a642c2c0
    • J
      ext4: place buffer head allocation before handle start · d1052d23
      Jinke Han 提交于
      In our product environment, we encounter some jbd hung waiting handles to
      stop while several writters were doing memory reclaim for buffer head
      allocation in delay alloc write path. Ext4 do buffer head allocation with
      holding transaction handle which may be blocked too long if the reclaim
      works not so smooth. According to our bcc trace, the reclaim time in
      buffer head allocation can reach 258s and the jbd transaction commit also
      take almost the same time meanwhile. Except for these extreme cases,
      we often see several seconds delays for cgroup memory reclaim on our
      servers. This is more likely to happen considering docker environment.
      
      One thing to note, the allocation of buffer heads is as often as page
      allocation or more often when blocksize less than page size. Just like
      page cache allocation, we should also place the buffer head allocation
      before startting the handle.
      
      Cc: stable@kernel.org
      Signed-off-by: NJinke Han <hanjinke.666@bytedance.com>
      Link: https://lore.kernel.org/r/20220903012429.22555-1-hanjinke.666@bytedance.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      d1052d23
    • J
      ext4: unconditionally enable the i_version counter · 1ff20307
      Jeff Layton 提交于
      The original i_version implementation was pretty expensive, requiring a
      log flush on every change. Because of this, it was gated behind a mount
      option (implemented via the MS_I_VERSION mountoption flag).
      
      Commit ae5e165d (fs: new API for handling inode->i_version) made the
      i_version flag much less expensive, so there is no longer a performance
      penalty from enabling it. xfs and btrfs already enable it
      unconditionally when the on-disk format can support it.
      
      Have ext4 ignore the SB_I_VERSION flag, and just enable it
      unconditionally.  While we're in here, mark the i_version mount
      option Opt_removed.
      
      [ Removed leftover bits of i_version from ext4_apply_options() since it
        now can't ever be set in ctx->mask_s_flags -- lczerner ]
      
      Cc: stable@kernel.org
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Benjamin Coddington <bcodding@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20220824160349.39664-3-lczerner@redhat.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      1ff20307
  8. 30 9月, 2022 1 次提交
  9. 12 9月, 2022 1 次提交
  10. 03 8月, 2022 6 次提交
  11. 29 6月, 2022 2 次提交
  12. 27 6月, 2022 3 次提交
    • C
      attr: port attribute changes to new types · b27c82e1
      Christian Brauner 提交于
      Now that we introduced new infrastructure to increase the type safety
      for filesystems supporting idmapped mounts port the first part of the
      vfs over to them.
      
      This ports the attribute changes codepaths to rely on the new better
      helpers using a dedicated type.
      
      Before this change we used to take a shortcut and place the actual
      values that would be written to inode->i_{g,u}id into struct iattr. This
      had the advantage that we moved idmappings mostly out of the picture
      early on but it made reasoning about changes more difficult than it
      should be.
      
      The filesystem was never explicitly told that it dealt with an idmapped
      mount. The transition to the value that needed to be stored in
      inode->i_{g,u}id appeared way too early and increased the probability of
      bugs in various codepaths.
      
      We know place the same value in struct iattr no matter if this is an
      idmapped mount or not. The vfs will only deal with type safe
      vfs{g,u}id_t. This makes it massively safer to perform permission checks
      as the type will tell us what checks we need to perform and what helpers
      we need to use.
      
      Fileystems raising FS_ALLOW_IDMAP can't simply write ia_vfs{g,u}id to
      inode->i_{g,u}id since they are different types. Instead they need to
      use the dedicated vfs{g,u}id_to_k{g,u}id() helpers that map the
      vfs{g,u}id into the filesystem.
      
      The other nice effect is that filesystems like overlayfs don't need to
      care about idmappings explicitly anymore and can simply set up struct
      iattr accordingly directly.
      
      Link: https://lore.kernel.org/lkml/CAHk-=win6+ahs1EwLkcq8apqLi_1wXFWbrPf340zYEhObpz4jA@mail.gmail.com [1]
      Link: https://lore.kernel.org/r/20220621141454.2914719-9-brauner@kernel.org
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: NSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      b27c82e1
    • C
      quota: port quota helpers mount ids · 71e7b535
      Christian Brauner 提交于
      Port the is_quota_modification() and dqout_transfer() helper to type
      safe vfs{g,u}id_t. Since these helpers are only called by a few
      filesystems don't introduce a new helper but simply extend the existing
      helpers to pass down the mount's idmapping.
      
      Note, that this is a non-functional change, i.e. nothing will have
      happened here or at the end of this series to how quota are done! This
      a change necessary because we will at the end of this series make
      ownership changes easier to reason about by keeping the original value
      in struct iattr for both non-idmapped and idmapped mounts.
      
      For now we always pass the initial idmapping which makes the idmapping
      functions these helpers call nops.
      
      This is done because we currently always pass the actual value to be
      written to i_{g,u}id via struct iattr. While this allowed us to treat
      the {g,u}id values in struct iattr as values that can be directly
      written to inode->i_{g,u}id it also increases the potential for
      confusion for filesystems.
      
      Now that we are have dedicated types to prevent this confusion we will
      ultimately only map the value from the idmapped mount into a filesystem
      value that can be written to inode->i_{g,u}id when the filesystem
      actually updates the inode. So pass down the initial idmapping until we
      finished that conversion at which point we pass down the mount's
      idmapping.
      
      Since struct iattr uses an anonymous union with overlapping types as
      supported by the C standard, filesystems that haven't converted to
      ia_vfs{g,u}id won't see any difference and things will continue to work
      as before. In other words, no functional changes intended with this
      change.
      
      Link: https://lore.kernel.org/r/20220621141454.2914719-7-brauner@kernel.org
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      71e7b535
    • C
      fs: port to iattr ownership update helpers · 35faf310
      Christian Brauner 提交于
      Earlier we introduced new helpers to abstract ownership update and
      remove code duplication. This converts all filesystems supporting
      idmapped mounts to make use of these new helpers.
      
      For now we always pass the initial idmapping which makes the idmapping
      functions these helpers call nops.
      
      This is done because we currently always pass the actual value to be
      written to i_{g,u}id via struct iattr. While this allowed us to treat
      the {g,u}id values in struct iattr as values that can be directly
      written to inode->i_{g,u}id it also increases the potential for
      confusion for filesystems.
      
      Now that we are have dedicated types to prevent this confusion we will
      ultimately only map the value from the idmapped mount into a filesystem
      value that can be written to inode->i_{g,u}id when the filesystem
      actually updates the inode. So pass down the initial idmapping until we
      finished that conversion at which point we pass down the mount's
      idmapping.
      
      No functional changes intended.
      
      Link: https://lore.kernel.org/r/20220621141454.2914719-6-brauner@kernel.org
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: NSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      35faf310
  13. 17 6月, 2022 1 次提交
    • J
      ext4: improve write performance with disabled delalloc · 8d5459c1
      Jan Kara 提交于
      When delayed allocation is disabled (either through mount option or
      because we are running low on free space), ext4_write_begin() allocates
      blocks with EXT4_GET_BLOCKS_IO_CREATE_EXT flag. With this flag extent
      merging is disabled and since ext4_write_begin() is called for each page
      separately, we end up with a *lot* of 1 block extents in the extent tree
      and following writeback is writing 1 block at a time which results in
      very poor write throughput (4 MB/s instead of 200 MB/s). These days when
      ext4_get_block_unwritten() is used only by ext4_write_begin(),
      ext4_page_mkwrite() and inline data conversion, we can safely allow
      extent merging to happen from these paths since following writeback will
      happen on different boundaries anyway. So use
      EXT4_GET_BLOCKS_CREATE_UNRIT_EXT instead which restores the performance.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20220520111402.4252-1-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      8d5459c1
  14. 18 5月, 2022 4 次提交
    • Y
      ext4: remove duplicated #include of dax.h in inode.c · b10b6278
      Yang Li 提交于
      Fix following includecheck warning:
      ./fs/ext4/inode.c: linux/dax.h is included more than once.
      Reported-by: NAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: NYang Li <yang.lee@linux.alibaba.com>
      Link: https://lore.kernel.org/r/20220504225025.44753-1-yang.lee@linux.alibaba.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      b10b6278
    • B
      ext4: fix race condition between ext4_write and ext4_convert_inline_data · f87c7a4b
      Baokun Li 提交于
      Hulk Robot reported a BUG_ON:
       ==================================================================
       EXT4-fs error (device loop3): ext4_mb_generate_buddy:805: group 0,
       block bitmap and bg descriptor inconsistent: 25 vs 31513 free clusters
       kernel BUG at fs/ext4/ext4_jbd2.c:53!
       invalid opcode: 0000 [#1] SMP KASAN PTI
       CPU: 0 PID: 25371 Comm: syz-executor.3 Not tainted 5.10.0+ #1
       RIP: 0010:ext4_put_nojournal fs/ext4/ext4_jbd2.c:53 [inline]
       RIP: 0010:__ext4_journal_stop+0x10e/0x110 fs/ext4/ext4_jbd2.c:116
       [...]
       Call Trace:
        ext4_write_inline_data_end+0x59a/0x730 fs/ext4/inline.c:795
        generic_perform_write+0x279/0x3c0 mm/filemap.c:3344
        ext4_buffered_write_iter+0x2e3/0x3d0 fs/ext4/file.c:270
        ext4_file_write_iter+0x30a/0x11c0 fs/ext4/file.c:520
        do_iter_readv_writev+0x339/0x3c0 fs/read_write.c:732
        do_iter_write+0x107/0x430 fs/read_write.c:861
        vfs_writev fs/read_write.c:934 [inline]
        do_pwritev+0x1e5/0x380 fs/read_write.c:1031
       [...]
       ==================================================================
      
      Above issue may happen as follows:
                 cpu1                     cpu2
      __________________________|__________________________
      do_pwritev
       vfs_writev
        do_iter_write
         ext4_file_write_iter
          ext4_buffered_write_iter
           generic_perform_write
            ext4_da_write_begin
                                 vfs_fallocate
                                  ext4_fallocate
                                   ext4_convert_inline_data
                                    ext4_convert_inline_data_nolock
                                     ext4_destroy_inline_data_nolock
                                      clear EXT4_STATE_MAY_INLINE_DATA
                                     ext4_map_blocks
                                      ext4_ext_map_blocks
                                       ext4_mb_new_blocks
                                        ext4_mb_regular_allocator
                                         ext4_mb_good_group_nolock
                                          ext4_mb_init_group
                                           ext4_mb_init_cache
                                            ext4_mb_generate_buddy  --> error
             ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)
                                      ext4_restore_inline_data
                                       set EXT4_STATE_MAY_INLINE_DATA
             ext4_block_write_begin
            ext4_da_write_end
             ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)
             ext4_write_inline_data_end
              handle=NULL
              ext4_journal_stop(handle)
               __ext4_journal_stop
                ext4_put_nojournal(handle)
                 ref_cnt = (unsigned long)handle
                 BUG_ON(ref_cnt == 0)  ---> BUG_ON
      
      The lock held by ext4_convert_inline_data is xattr_sem, but the lock
      held by generic_perform_write is i_rwsem. Therefore, the two locks can
      be concurrent.
      
      To solve above issue, we add inode_lock() for ext4_convert_inline_data().
      At the same time, move ext4_convert_inline_data() in front of
      ext4_punch_hole(), remove similar handling from ext4_punch_hole().
      
      Fixes: 0c8d414f ("ext4: let fallocate handle inline data correctly")
      Cc: stable@vger.kernel.org
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NBaokun Li <libaokun1@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20220428134031.4153381-1-libaokun1@huawei.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      f87c7a4b
    • Z
      ext4: convert symlink external data block mapping to bdev · 6493792d
      Zhang Yi 提交于
      Symlink's external data block is one kind of metadata block, and now
      that almost all ext4 metadata block's page cache (e.g. directory blocks,
      quota blocks...) belongs to bdev backing inode except the symlink. It
      is essentially worked in data=journal mode like other regular file's
      data block because probably in order to make it simple for generic VFS
      code handling symlinks or some other historical reasons, but the logic
      of creating external data block in ext4_symlink() is complicated. and it
      also make things confused if user do not want to let the filesystem
      worked in data=journal mode. This patch convert the final exceptional
      case and make things clean, move the mapping of the symlink's external
      data block to bdev like any other metadata block does.
      Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
      Link: https://lore.kernel.org/r/20220424140936.1898920-3-yi.zhang@huawei.com
      6493792d
    • Z
      ext4: add nowait mode for ext4_getblk() · 9558cf14
      Zhang Yi 提交于
      Current ext4_getblk() might sleep if some resources are not valid or
      could be race with a concurrent extents modifing procedure. So we
      cannot call ext4_getblk() and ext4_map_blocks() to get map blocks in
      the atomic context in some fast path (e.g. the upcoming procedure of
      getting symlink external block in the RCU context), even if the map
      extents have already been check and cached.
      Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
      Link: https://lore.kernel.org/r/20220424140936.1898920-2-yi.zhang@huawei.com
      9558cf14
  15. 12 5月, 2022 1 次提交
    • Y
      ext4: fix warning in ext4_handle_inode_extension · f4534c9f
      Ye Bin 提交于
      We got issue as follows:
      EXT4-fs error (device loop0) in ext4_reserve_inode_write:5741: Out of memory
      EXT4-fs error (device loop0): ext4_setattr:5462: inode #13: comm syz-executor.0: mark_inode_dirty error
      EXT4-fs error (device loop0) in ext4_setattr:5519: Out of memory
      EXT4-fs error (device loop0): ext4_ind_map_blocks:595: inode #13: comm syz-executor.0: Can't allocate blocks for non-extent mapped inodes with bigalloc
      ------------[ cut here ]------------
      WARNING: CPU: 1 PID: 4361 at fs/ext4/file.c:301 ext4_file_write_iter+0x11c9/0x1220
      Modules linked in:
      CPU: 1 PID: 4361 Comm: syz-executor.0 Not tainted 5.10.0+ #1
      RIP: 0010:ext4_file_write_iter+0x11c9/0x1220
      RSP: 0018:ffff924d80b27c00 EFLAGS: 00010282
      RAX: ffffffff815a3379 RBX: 0000000000000000 RCX: 000000003b000000
      RDX: ffff924d81601000 RSI: 00000000000009cc RDI: 00000000000009cd
      RBP: 000000000000000d R08: ffffffffbc5a2c6b R09: 0000902e0e52a96f
      R10: ffff902e2b7c1b40 R11: ffff902e2b7c1b40 R12: 000000000000000a
      R13: 0000000000000001 R14: ffff902e0e52aa10 R15: ffffffffffffff8b
      FS:  00007f81a7f65700(0000) GS:ffff902e3bc80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffffffffff600400 CR3: 000000012db88001 CR4: 00000000003706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       do_iter_readv_writev+0x2e5/0x360
       do_iter_write+0x112/0x4c0
       do_pwritev+0x1e5/0x390
       __x64_sys_pwritev2+0x7e/0xa0
       do_syscall_64+0x37/0x50
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Above issue may happen as follows:
      Assume
      inode.i_size=4096
      EXT4_I(inode)->i_disksize=4096
      
      step 1: set inode->i_isize = 8192
      ext4_setattr
        if (attr->ia_size != inode->i_size)
          EXT4_I(inode)->i_disksize = attr->ia_size;
          rc = ext4_mark_inode_dirty
             ext4_reserve_inode_write
                ext4_get_inode_loc
                  __ext4_get_inode_loc
                    sb_getblk --> return -ENOMEM
         ...
         if (!error)  ->will not update i_size
           i_size_write(inode, attr->ia_size);
      Now:
      inode.i_size=4096
      EXT4_I(inode)->i_disksize=8192
      
      step 2: Direct write 4096 bytes
      ext4_file_write_iter
       ext4_dio_write_iter
         iomap_dio_rw ->return error
       if (extend)
         ext4_handle_inode_extension
           WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
      ->Then trigger warning.
      
      To solve above issue, if mark inode dirty failed in ext4_setattr just
      set 'EXT4_I(inode)->i_disksize' with old value.
      Signed-off-by: NYe Bin <yebin10@huawei.com>
      Link: https://lore.kernel.org/r/20220326065351.761952-1-yebin10@huawei.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      f4534c9f
  16. 10 5月, 2022 1 次提交