1. 12 4月, 2021 1 次提交
  2. 21 3月, 2021 1 次提交
    • H
      ext4: fix rename whiteout with fast commit · 8210bb29
      Harshad Shirwadkar 提交于
      This patch adds rename whiteout support in fast commits. Note that the
      whiteout object that gets created is actually char device. Which
      imples, the function ext4_inode_journal_mode(struct inode *inode)
      would return "JOURNAL_DATA" for this inode. This has a consequence in
      fast commit code that it will make creation of the whiteout object a
      fast-commit ineligible behavior and thus will fall back to full
      commits. With this patch, this can be observed by running fast commits
      with rename whiteout and seeing the stats generated by ext4_fc_stats
      tracepoint as follows:
      
      ext4_fc_stats: dev 254:32 fc ineligible reasons:
      XATTR:0, CROSS_RENAME:0, JOURNAL_FLAG_CHANGE:0, NO_MEM:0, SWAP_BOOT:0,
      RESIZE:0, RENAME_DIR:0, FALLOC_RANGE:0, INODE_JOURNAL_DATA:16;
      num_commits:6, ineligible: 6, numblks: 3
      
      So in short, this patch guarantees that in case of rename whiteout, we
      fall back to full commits.
      
      Amir mentioned that instead of creating a new whiteout object for
      every rename, we can create a static whiteout object with irrelevant
      nlink. That will make fast commits to not fall back to full
      commit. But until this happens, this patch will ensure correctness by
      falling back to full commits.
      
      Fixes: 8016e29f ("ext4: fast commit recovery path")
      Cc: stable@kernel.org
      Signed-off-by: NHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Link: https://lore.kernel.org/r/20210316221921.1124955-1-harshadshirwadkar@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      8210bb29
  3. 07 3月, 2021 1 次提交
    • E
      ext4: shrink race window in ext4_should_retry_alloc() · efc61345
      Eric Whitney 提交于
      When generic/371 is run on kvm-xfstests using 5.10 and 5.11 kernels, it
      fails at significant rates on the two test scenarios that disable
      delayed allocation (ext3conv and data_journal) and force actual block
      allocation for the fallocate and pwrite functions in the test.  The
      failure rate on 5.10 for both ext3conv and data_journal on one test
      system typically runs about 85%.  On 5.11, the failure rate on ext3conv
      sometimes drops to as low as 1% while the rate on data_journal
      increases to nearly 100%.
      
      The observed failures are largely due to ext4_should_retry_alloc()
      cutting off block allocation retries when s_mb_free_pending (used to
      indicate that a transaction in progress will free blocks) is 0.
      However, free space is usually available when this occurs during runs
      of generic/371.  It appears that a thread attempting to allocate
      blocks is just missing transaction commits in other threads that
      increase the free cluster count and reset s_mb_free_pending while
      the allocating thread isn't running.  Explicitly testing for free space
      availability avoids this race.
      
      The current code uses a post-increment operator in the conditional
      expression that determines whether the retry limit has been exceeded.
      This means that the conditional expression uses the value of the
      retry counter before it's increased, resulting in an extra retry cycle.
      The current code actually retries twice before hitting its retry limit
      rather than once.
      
      Increasing the retry limit to 3 from the current actual maximum retry
      count of 2 in combination with the change described above reduces the
      observed failure rate to less that 0.1% on both ext3conv and
      data_journal with what should be limited impact on users sensitive to
      the overhead caused by retries.
      
      A per filesystem percpu counter exported via sysfs is added to allow
      users or developers to track the number of times the retry limit is
      exceeded without resorting to debugging methods.  This should provide
      some insight into worst case retry behavior.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Link: https://lore.kernel.org/r/20210218151132.19678-1-enwlinux@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      efc61345
  4. 24 1月, 2021 2 次提交
    • C
      ext4: support idmapped mounts · 14f3db55
      Christian Brauner 提交于
      Enable idmapped mounts for ext4. All dedicated helpers we need for this
      exist. So this basically just means we're passing down the
      user_namespace argument from the VFS methods to the relevant helpers.
      
      Let's create simple example where we idmap an ext4 filesystem:
      
       root@f2-vm:~# truncate -s 5G ext4.img
      
       root@f2-vm:~# mkfs.ext4 ./ext4.img
       mke2fs 1.45.5 (07-Jan-2020)
       Discarding device blocks: done
       Creating filesystem with 1310720 4k blocks and 327680 inodes
       Filesystem UUID: 3fd91794-c6ca-4b0f-9964-289a000919cf
       Superblock backups stored on blocks:
               32768, 98304, 163840, 229376, 294912, 819200, 884736
      
       Allocating group tables: done
       Writing inode tables: done
       Creating journal (16384 blocks): done
       Writing superblocks and filesystem accounting information: done
      
       root@f2-vm:~# losetup -f --show ./ext4.img
       /dev/loop0
      
       root@f2-vm:~# mount /dev/loop0 /mnt
      
       root@f2-vm:~# ls -al /mnt/
       total 24
       drwxr-xr-x  3 root root  4096 Oct 28 13:34 .
       drwxr-xr-x 30 root root  4096 Oct 28 13:22 ..
       drwx------  2 root root 16384 Oct 28 13:34 lost+found
      
       # Let's create an idmapped mount at /idmapped1 where we map uid and gid
       # 0 to uid and gid 1000
       root@f2-vm:/# ./mount-idmapped --map-mount b:0:1000:1 /mnt/ /idmapped1/
      
       root@f2-vm:/# ls -al /idmapped1/
       total 24
       drwxr-xr-x  3 ubuntu ubuntu  4096 Oct 28 13:34 .
       drwxr-xr-x 30 root   root    4096 Oct 28 13:22 ..
       drwx------  2 ubuntu ubuntu 16384 Oct 28 13:34 lost+found
      
       # Let's create an idmapped mount at /idmapped2 where we map uid and gid
       # 0 to uid and gid 2000
       root@f2-vm:/# ./mount-idmapped --map-mount b:0:2000:1 /mnt/ /idmapped2/
      
       root@f2-vm:/# ls -al /idmapped2/
       total 24
       drwxr-xr-x  3 2000 2000  4096 Oct 28 13:34 .
       drwxr-xr-x 31 root root  4096 Oct 28 13:39 ..
       drwx------  2 2000 2000 16384 Oct 28 13:34 lost+found
      
      Let's create another example where we idmap the rootfs filesystem
      without a mapping for uid 0 and gid 0:
      
       # Create an idmapped mount of for a full POSIX range of rootfs under
       # /mnt but without a mapping for uid 0 to reduce attack surface
      
       root@f2-vm:/# ./mount-idmapped --map-mount b:1:1:65536 / /mnt/
      
       # Since we don't have a mapping for uid and gid 0 all files owned by
       # uid and gid 0 should show up as uid and gid 65534:
       root@f2-vm:/# ls -al /mnt/
       total 664
       drwxr-xr-x 31 nobody nogroup   4096 Oct 28 13:39 .
       drwxr-xr-x 31 root   root      4096 Oct 28 13:39 ..
       lrwxrwxrwx  1 nobody nogroup      7 Aug 25 07:44 bin -> usr/bin
       drwxr-xr-x  4 nobody nogroup   4096 Oct 28 13:17 boot
       drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:48 dev
       drwxr-xr-x 81 nobody nogroup   4096 Oct 28 04:00 etc
       drwxr-xr-x  4 nobody nogroup   4096 Oct 28 04:00 home
       lrwxrwxrwx  1 nobody nogroup      7 Aug 25 07:44 lib -> usr/lib
       lrwxrwxrwx  1 nobody nogroup      9 Aug 25 07:44 lib32 -> usr/lib32
       lrwxrwxrwx  1 nobody nogroup      9 Aug 25 07:44 lib64 -> usr/lib64
       lrwxrwxrwx  1 nobody nogroup     10 Aug 25 07:44 libx32 -> usr/libx32
       drwx------  2 nobody nogroup  16384 Aug 25 07:47 lost+found
       drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 media
       drwxr-xr-x 31 nobody nogroup   4096 Oct 28 13:39 mnt
       drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 opt
       drwxr-xr-x  2 nobody nogroup   4096 Apr 15  2020 proc
       drwx--x--x  6 nobody nogroup   4096 Oct 28 13:34 root
       drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:46 run
       lrwxrwxrwx  1 nobody nogroup      8 Aug 25 07:44 sbin -> usr/sbin
       drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 srv
       drwxr-xr-x  2 nobody nogroup   4096 Apr 15  2020 sys
       drwxrwxrwt 10 nobody nogroup   4096 Oct 28 13:19 tmp
       drwxr-xr-x 14 nobody nogroup   4096 Oct 20 13:00 usr
       drwxr-xr-x 12 nobody nogroup   4096 Aug 25 07:45 var
      
       # Since we do have a mapping for uid and gid 1000 all files owned by
       # uid and gid 1000 should simply show up as uid and gid 1000:
       root@f2-vm:/# ls -al /mnt/home/ubuntu/
       total 40
       drwxr-xr-x 3 ubuntu ubuntu  4096 Oct 28 00:43 .
       drwxr-xr-x 4 nobody nogroup 4096 Oct 28 04:00 ..
       -rw------- 1 ubuntu ubuntu  2936 Oct 28 12:26 .bash_history
       -rw-r--r-- 1 ubuntu ubuntu   220 Feb 25  2020 .bash_logout
       -rw-r--r-- 1 ubuntu ubuntu  3771 Feb 25  2020 .bashrc
       -rw-r--r-- 1 ubuntu ubuntu   807 Feb 25  2020 .profile
       -rw-r--r-- 1 ubuntu ubuntu     0 Oct 16 16:11 .sudo_as_admin_successful
       -rw------- 1 ubuntu ubuntu  1144 Oct 28 00:43 .viminfo
      
      Link: https://lore.kernel.org/r/20210121131959.646623-39-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-ext4@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      14f3db55
    • C
      fs: make helpers idmap mount aware · 549c7297
      Christian Brauner 提交于
      Extend some inode methods with an additional user namespace argument. A
      filesystem that is aware of idmapped mounts will receive the user
      namespace the mount has been marked with. This can be used for
      additional permission checking and also to enable filesystems to
      translate between uids and gids if they need to. We have implemented all
      relevant helpers in earlier patches.
      
      As requested we simply extend the exisiting inode method instead of
      introducing new ones. This is a little more code churn but it's mostly
      mechanical and doesnt't leave us with additional inode methods.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      549c7297
  5. 23 12月, 2020 2 次提交
  6. 18 12月, 2020 2 次提交
  7. 03 12月, 2020 3 次提交
  8. 20 11月, 2020 1 次提交
  9. 12 11月, 2020 1 次提交
  10. 07 11月, 2020 4 次提交
  11. 29 10月, 2020 2 次提交
  12. 22 10月, 2020 5 次提交
  13. 18 10月, 2020 9 次提交
  14. 22 9月, 2020 1 次提交
    • E
      fscrypt: handle test_dummy_encryption in more logical way · ac4acb1f
      Eric Biggers 提交于
      The behavior of the test_dummy_encryption mount option is that when a
      new file (or directory or symlink) is created in an unencrypted
      directory, it's automatically encrypted using a dummy encryption policy.
      That's it; in particular, the encryption (or lack thereof) of existing
      files (or directories or symlinks) doesn't change.
      
      Unfortunately the implementation of test_dummy_encryption is a bit weird
      and confusing.  When test_dummy_encryption is enabled and a file is
      being created in an unencrypted directory, we set up an encryption key
      (->i_crypt_info) for the directory.  This isn't actually used to do any
      encryption, however, since the directory is still unencrypted!  Instead,
      ->i_crypt_info is only used for inheriting the encryption policy.
      
      One consequence of this is that the filesystem ends up providing a
      "dummy context" (policy + nonce) instead of a "dummy policy".  In
      commit ed318a6c ("fscrypt: support test_dummy_encryption=v2"), I
      mistakenly thought this was required.  However, actually the nonce only
      ends up being used to derive a key that is never used.
      
      Another consequence of this implementation is that it allows for
      'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
      case that can be forgotten about.  For example, currently
      FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
      dummy encryption policy when the filesystem is mounted with
      test_dummy_encryption.  That seems like the wrong thing to do, since
      again, the directory itself is not actually encrypted.
      
      Therefore, switch to a more logical and maintainable implementation
      where the dummy encryption policy inheritance is done without setting up
      keys for unencrypted directories.  This involves:
      
      - Adding a function fscrypt_policy_to_inherit() which returns the
        encryption policy to inherit from a directory.  This can be a real
        policy, a dummy policy, or no policy.
      
      - Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
        with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.
      
      - Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
        of an inode.
      Acked-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Acked-by: NJeff Layton <jlayton@kernel.org>
      Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.orgSigned-off-by: NEric Biggers <ebiggers@google.com>
      ac4acb1f
  15. 20 8月, 2020 1 次提交
    • B
      ext4: limit the length of per-inode prealloc list · 27bc446e
      brookxu 提交于
      In the scenario of writing sparse files, the per-inode prealloc list may
      be very long, resulting in high overhead for ext4_mb_use_preallocated().
      To circumvent this problem, we limit the maximum length of per-inode
      prealloc list to 512 and allow users to modify it.
      
      After patching, we observed that the sys ratio of cpu has dropped, and
      the system throughput has increased significantly. We created a process
      to write the sparse file, and the running time of the process on the
      fixed kernel was significantly reduced, as follows:
      
      Running time on unfixed kernel:
      [root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
      real    0m2.051s
      user    0m0.008s
      sys     0m2.026s
      
      Running time on fixed kernel:
      [root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
      real    0m0.471s
      user    0m0.004s
      sys     0m0.395s
      Signed-off-by: NChunguang Xu <brookxu@tencent.com>
      Link: https://lore.kernel.org/r/d7a98178-056b-6db5-6bce-4ead23f4a257@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      27bc446e
  16. 19 8月, 2020 1 次提交
  17. 08 8月, 2020 3 次提交