1. 26 4月, 2019 4 次提交
    • G
      ext4: Support case-insensitive file name lookups · b886ee3e
      Gabriel Krisman Bertazi 提交于
      This patch implements the actual support for case-insensitive file name
      lookups in ext4, based on the feature bit and the encoding stored in the
      superblock.
      
      A filesystem that has the casefold feature set is able to configure
      directories with the +F (EXT4_CASEFOLD_FL) attribute, enabling lookups
      to succeed in that directory in a case-insensitive fashion, i.e: match
      a directory entry even if the name used by userspace is not a byte per
      byte match with the disk name, but is an equivalent case-insensitive
      version of the Unicode string.  This operation is called a
      case-insensitive file name lookup.
      
      The feature is configured as an inode attribute applied to directories
      and inherited by its children.  This attribute can only be enabled on
      empty directories for filesystems that support the encoding feature,
      thus preventing collision of file names that only differ by case.
      
      * dcache handling:
      
      For a +F directory, Ext4 only stores the first equivalent name dentry
      used in the dcache. This is done to prevent unintentional duplication of
      dentries in the dcache, while also allowing the VFS code to quickly find
      the right entry in the cache despite which equivalent string was used in
      a previous lookup, without having to resort to ->lookup().
      
      d_hash() of casefolded directories is implemented as the hash of the
      casefolded string, such that we always have a well-known bucket for all
      the equivalencies of the same string. d_compare() uses the
      utf8_strncasecmp() infrastructure, which handles the comparison of
      equivalent, same case, names as well.
      
      For now, negative lookups are not inserted in the dcache, since they
      would need to be invalidated anyway, because we can't trust missing file
      dentries.  This is bad for performance but requires some leveraging of
      the vfs layer to fix.  We can live without that for now, and so does
      everyone else.
      
      * on-disk data:
      
      Despite using a specific version of the name as the internal
      representation within the dcache, the name stored and fetched from the
      disk is a byte-per-byte match with what the user requested, making this
      implementation 'name-preserving'. i.e. no actual information is lost
      when writing to storage.
      
      DX is supported by modifying the hashes used in +F directories to make
      them case/encoding-aware.  The new disk hashes are calculated as the
      hash of the full casefolded string, instead of the string directly.
      This allows us to efficiently search for file names in the htree without
      requiring the user to provide an exact name.
      
      * Dealing with invalid sequences:
      
      By default, when a invalid UTF-8 sequence is identified, ext4 will treat
      it as an opaque byte sequence, ignoring the encoding and reverting to
      the old behavior for that unique file.  This means that case-insensitive
      file name lookup will not work only for that file.  An optional bit can
      be set in the superblock telling the filesystem code and userspace tools
      to enforce the encoding.  When that optional bit is set, any attempt to
      create a file name using an invalid UTF-8 sequence will fail and return
      an error to userspace.
      
      * Normalization algorithm:
      
      The UTF-8 algorithms used to compare strings in ext4 is implemented
      lives in fs/unicode, and is based on a previous version developed by
      SGI.  It implements the Canonical decomposition (NFD) algorithm
      described by the Unicode specification 12.1, or higher, combined with
      the elimination of ignorable code points (NFDi) and full
      case-folding (CF) as documented in fs/unicode/utf8_norm.c.
      
      NFD seems to be the best normalization method for EXT4 because:
      
        - It has a lower cost than NFC/NFKC (which requires
          decomposing to NFD as an intermediary step)
        - It doesn't eliminate important semantic meaning like
          compatibility decompositions.
      
      Although:
      
        - This implementation is not completely linguistic accurate, because
        different languages have conflicting rules, which would require the
        specialization of the filesystem to a given locale, which brings all
        sorts of problems for removable media and for users who use more than
        one language.
      Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.co.uk>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      b886ee3e
    • G
      ext4: include charset encoding information in the superblock · c83ad55e
      Gabriel Krisman Bertazi 提交于
      Support for encoding is considered an incompatible feature, since it has
      potential to create collisions of file names in existing filesystems.
      If the feature flag is not enabled, the entire filesystem will operate
      on opaque byte sequences, respecting the original behavior.
      
      The s_encoding field stores a magic number indicating the encoding
      format and version used globally by file and directory names in the
      filesystem.  The s_encoding_flags defines policies for using the charset
      encoding, like how to handle invalid sequences.  The magic number is
      mapped to the exact charset table, but the mapping is specific to ext4.
      Since we don't have any commitment to support old encodings, the only
      encoding I am supporting right now is utf8-12.1.0.
      
      The current implementation prevents the user from enabling encoding and
      per-directory encryption on the same filesystem at the same time.  The
      incompatibility between these features lies in how we do efficient
      directory searches when we cannot be sure the encryption of the user
      provided fname will match the actual hash stored in the disk without
      decrypting every directory entry, because of normalization cases.  My
      quickest solution is to simply block the concurrent use of these
      features for now, and enable it later, once we have a better solution.
      Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.co.uk>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      c83ad55e
    • K
      ext4: actually request zeroing of inode table after grow · 310a997f
      Kirill Tkhai 提交于
      It is never possible, that number of block groups decreases,
      since only online grow is supported.
      
      But after a growing occured, we have to zero inode tables
      for just created new block groups.
      
      Fixes: 19c5246d ("ext4: add new online resize interface")
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      310a997f
    • K
      4b99faa2
  2. 25 4月, 2019 2 次提交
  3. 10 4月, 2019 2 次提交
  4. 08 4月, 2019 1 次提交
    • A
      ext4: use BUG() instead of BUG_ON(1) · 1e83bc81
      Arnd Bergmann 提交于
      BUG_ON(1) leads to bogus warnings from clang when
      CONFIG_PROFILE_ANNOTATED_BRANCHES is set:
      
       fs/ext4/inode.c:544:4: error: variable 'retval' is used uninitialized whenever 'if' condition is false
            [-Werror,-Wsometimes-uninitialized]
                              BUG_ON(1);
                              ^~~~~~~~~
       include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON'
                                         ^~~~~~~~~~~~~~~~~~~
       include/linux/compiler.h:48:23: note: expanded from macro 'unlikely'
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       fs/ext4/inode.c:591:6: note: uninitialized use occurs here
              if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
                  ^~~~~~
       fs/ext4/inode.c:544:4: note: remove the 'if' if its condition is always true
                              BUG_ON(1);
                              ^
       include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON'
                                     ^
       fs/ext4/inode.c:502:12: note: initialize the variable 'retval' to silence this warning
      
      Change it to BUG() so clang can see that this code path can never
      continue.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      1e83bc81
  5. 07 4月, 2019 3 次提交
  6. 24 3月, 2019 1 次提交
  7. 23 3月, 2019 2 次提交
    • Z
      ext4: cleanup bh release code in ext4_ind_remove_space() · 5e86bdda
      zhangyi (F) 提交于
      Currently, we are releasing the indirect buffer where we are done with
      it in ext4_ind_remove_space(), so we can see the brelse() and
      BUFFER_TRACE() everywhere.  It seems fragile and hard to read, and we
      may probably forget to release the buffer some day.  This patch cleans
      up the code by putting of the code which releases the buffers to the
      end of the function.
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      5e86bdda
    • Z
      ext4: brelse all indirect buffer in ext4_ind_remove_space() · 674a2b27
      zhangyi (F) 提交于
      All indirect buffers get by ext4_find_shared() should be released no
      mater the branch should be freed or not. But now, we forget to release
      the lower depth indirect buffers when removing space from the same
      higher depth indirect block. It will lead to buffer leak and futher
      more, it may lead to quota information corruption when using old quota,
      consider the following case.
      
       - Create and mount an empty ext4 filesystem without extent and quota
         features,
       - quotacheck and enable the user & group quota,
       - Create some files and write some data to them, and then punch hole
         to some files of them, it may trigger the buffer leak problem
         mentioned above.
       - Disable quota and run quotacheck again, it will create two new
         aquota files and write the checked quota information to them, which
         probably may reuse the freed indirect block(the buffer and page
         cache was not freed) as data block.
       - Enable quota again, it will invoke
         vfs_load_quota_inode()->invalidate_bdev() to try to clean unused
         buffers and pagecache. Unfortunately, because of the buffer of quota
         data block is still referenced, quota code cannot read the up to date
         quota info from the device and lead to quota information corruption.
      
      This problem can be reproduced by xfstests generic/231 on ext3 file
      system or ext4 file system without extent and quota features.
      
      This patch fix this problem by releasing the missing indirect buffers,
      in ext4_ind_remove_space().
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      674a2b27
  8. 15 3月, 2019 6 次提交
    • L
      ext4: report real fs size after failed resize · 6c732840
      Lukas Czerner 提交于
      Currently when the file system resize using ext4_resize_fs() fails it
      will report into log that "resized filesystem to <requested block
      count>".  However this may not be true in the case of failure.  Use the
      current block count as returned by ext4_blocks_count() to report the
      block count.
      
      Additionally, report a warning that "error occurred during file system
      resize"
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      6c732840
    • L
      ext4: add missing brelse() in add_new_gdb_meta_bg() · d64264d6
      Lukas Czerner 提交于
      Currently in add_new_gdb_meta_bg() there is a missing brelse of gdb_bh
      in case ext4_journal_get_write_access() fails.
      Additionally kvfree() is missing in the same error path. Fix it by
      moving the ext4_journal_get_write_access() before the ext4 sb update as
      Ted suggested and release n_group_desc and gdb_bh in case it fails.
      
      Fixes: 61a9c11e ("ext4: add missing brelse() add_new_gdb_meta_bg()'s error path")
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      d64264d6
    • J
      ext4: remove useless ext4_pin_inode() · 7cf77140
      Jason Yan 提交于
      This function is never used from the beginning (and is commented out);
      let's remove it.
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      7cf77140
    • J
      ext4: avoid panic during forced reboot · 1dc1097f
      Jan Kara 提交于
      When admin calls "reboot -f" - i.e., does a hard system reboot by
      directly calling reboot(2) - ext4 filesystem mounted with errors=panic
      can panic the system. This happens because the underlying device gets
      disabled without unmounting the filesystem and thus some syscall running
      in parallel to reboot(2) can result in the filesystem getting IO errors.
      
      This is somewhat surprising to the users so try improve the behavior by
      switching to errors=remount-ro behavior when the system is running
      reboot(2).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      1dc1097f
    • L
      ext4: fix data corruption caused by unaligned direct AIO · 372a03e0
      Lukas Czerner 提交于
      Ext4 needs to serialize unaligned direct AIO because the zeroing of
      partial blocks of two competing unaligned AIOs can result in data
      corruption.
      
      However it decides not to serialize if the potentially unaligned aio is
      past i_size with the rationale that no pending writes are possible past
      i_size. Unfortunately if the i_size is not block aligned and the second
      unaligned write lands past i_size, but still into the same block, it has
      the potential of corrupting the previous unaligned write to the same
      block.
      
      This is (very simplified) reproducer from Frank
      
          // 41472 = (10 * 4096) + 512
          // 37376 = 41472 - 4096
      
          ftruncate(fd, 41472);
          io_prep_pwrite(iocbs[0], fd, buf[0], 4096, 37376);
          io_prep_pwrite(iocbs[1], fd, buf[1], 4096, 41472);
      
          io_submit(io_ctx, 1, &iocbs[1]);
          io_submit(io_ctx, 1, &iocbs[2]);
      
          io_getevents(io_ctx, 2, 2, events, NULL);
      
      Without this patch the 512B range from 40960 up to the start of the
      second unaligned write (41472) is going to be zeroed overwriting the data
      written by the first write. This is a data corruption.
      
      00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
      *
      00009200  30 30 30 30 30 30 30 30  30 30 30 30 30 30 30 30
      *
      0000a000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
      *
      0000a200  31 31 31 31 31 31 31 31  31 31 31 31 31 31 31 31
      
      With this patch the data corruption is avoided because we will recognize
      the unaligned_aio and wait for the unwritten extent conversion.
      
      00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
      *
      00009200  30 30 30 30 30 30 30 30  30 30 30 30 30 30 30 30
      *
      0000a200  31 31 31 31 31 31 31 31  31 31 31 31 31 31 31 31
      *
      0000b200
      Reported-by: NFrank Sorenson <fsorenso@redhat.com>
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Fixes: e9e3bcec ("ext4: serialize unaligned asynchronous DIO")
      Cc: stable@vger.kernel.org
      372a03e0
    • J
      ext4: fix NULL pointer dereference while journal is aborted · fa30dde3
      Jiufei Xue 提交于
      We see the following NULL pointer dereference while running xfstests
      generic/475:
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      PGD 8000000c84bad067 P4D 8000000c84bad067 PUD c84e62067 PMD 0
      Oops: 0000 [#1] SMP PTI
      CPU: 7 PID: 9886 Comm: fsstress Kdump: loaded Not tainted 5.0.0-rc8 #10
      RIP: 0010:ext4_do_update_inode+0x4ec/0x760
      ...
      Call Trace:
      ? jbd2_journal_get_write_access+0x42/0x50
      ? __ext4_journal_get_write_access+0x2c/0x70
      ? ext4_truncate+0x186/0x3f0
      ext4_mark_iloc_dirty+0x61/0x80
      ext4_mark_inode_dirty+0x62/0x1b0
      ext4_truncate+0x186/0x3f0
      ? unmap_mapping_pages+0x56/0x100
      ext4_setattr+0x817/0x8b0
      notify_change+0x1df/0x430
      do_truncate+0x5e/0x90
      ? generic_permission+0x12b/0x1a0
      
      This is triggered because the NULL pointer handle->h_transaction was
      dereferenced in function ext4_update_inode_fsync_trans().
      I found that the h_transaction was set to NULL in jbd2__journal_restart
      but failed to attached to a new transaction while the journal is aborted.
      
      Fix this by checking the handle before updating the inode.
      
      Fixes: b436b9be ("ext4: Wait for proper transaction commit on fsync")
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: stable@kernel.org
      fa30dde3
  9. 01 3月, 2019 1 次提交
    • E
      ext4: fix bigalloc cluster freeing when hole punching under load · 7bd75230
      Eric Whitney 提交于
      Ext4 may not free clusters correctly when punching holes in bigalloc
      file systems under high load conditions.  If it's not possible to
      extend and restart the journal in ext4_ext_rm_leaf() when preparing to
      remove blocks from a punched region, a retry of the entire punch
      operation is triggered in ext4_ext_remove_space().  This causes a
      partial cluster to be set to the first cluster in the extent found to
      the right of the punched region.  However, if the punch operation
      prior to the retry had made enough progress to delete one or more
      extents and a partial cluster candidate for freeing had already been
      recorded, the retry would overwrite the partial cluster.  The loss of
      this information makes it impossible to correctly free the original
      partial cluster in all cases.
      
      This bug can cause generic/476 to fail when run as part of
      xfstests-bld's bigalloc and bigalloc_1k test cases.  The failure is
      reported when e2fsck detects bad iblocks counts greater than expected
      in units of whole clusters and also detects a number of negative block
      bitmap differences equal to the iblocks discrepancy in cluster units.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      7bd75230
  10. 22 2月, 2019 4 次提交
  11. 21 2月, 2019 2 次提交
  12. 15 2月, 2019 2 次提交
  13. 12 2月, 2019 1 次提交
    • J
      ext4: fix crash during online resizing · f96c3ac8
      Jan Kara 提交于
      When computing maximum size of filesystem possible with given number of
      group descriptor blocks, we forget to include s_first_data_block into
      the number of blocks. Thus for filesystems with non-zero
      s_first_data_block it can happen that computed maximum filesystem size
      is actually lower than current filesystem size which confuses the code
      and eventually leads to a BUG_ON in ext4_alloc_group_tables() hitting on
      flex_gd->count == 0. The problem can be reproduced like:
      
      truncate -s 100g /tmp/image
      mkfs.ext4 -b 1024 -E resize=262144 /tmp/image 32768
      mount -t ext4 -o loop /tmp/image /mnt
      resize2fs /dev/loop0 262145
      resize2fs /dev/loop0 300000
      
      Fix the problem by properly including s_first_data_block into the
      computed number of filesystem blocks.
      
      Fixes: 1c6bd717 "ext4: convert file system to meta_bg if needed..."
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      f96c3ac8
  14. 11 2月, 2019 8 次提交
  15. 01 2月, 2019 1 次提交
    • T
      Revert "ext4: use ext4_write_inode() when fsyncing w/o a journal" · 8fdd60f2
      Theodore Ts'o 提交于
      This reverts commit ad211f3e.
      
      As Jan Kara pointed out, this change was unsafe since it means we lose
      the call to sync_mapping_buffers() in the nojournal case.  The
      original point of the commit was avoid taking the inode mutex (since
      it causes a lockdep warning in generic/113); but we need the mutex in
      order to call sync_mapping_buffers().
      
      The real fix to this problem was discussed here:
      
      https://lore.kernel.org/lkml/20181025150540.259281-4-bvanassche@acm.org
      
      The proposed patch was to fix a syzbot complaint, but the problem can
      also demonstrated via "kvm-xfstests -c nojournal generic/113".
      Multiple solutions were discused in the e-mail thread, but none have
      landed in the kernel as of this writing.  Anyway, commit
      ad211f3e is absolutely the wrong way to suppress the lockdep, so
      revert it.
      
      Fixes: ad211f3e ("ext4: use ext4_write_inode() when fsyncing w/o a journal")
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reported: Jan Kara <jack@suse.cz>
      8fdd60f2