1. 27 8月, 2021 1 次提交
  2. 08 7月, 2021 2 次提交
  3. 01 7月, 2021 1 次提交
    • Y
      ext4: fix WARN_ON_ONCE(!buffer_uptodate) after an error writing the superblock · 558d6450
      Ye Bin 提交于
      If a writeback of the superblock fails with an I/O error, the buffer
      is marked not uptodate.  However, this can cause a WARN_ON to trigger
      when we attempt to write superblock a second time.  (Which might
      succeed this time, for cerrtain types of block devices such as iSCSI
      devices over a flaky network.)
      
      Try to detect this case in flush_stashed_error_work(), and also change
      __ext4_handle_dirty_metadata() so we always set the uptodate flag, not
      just in the nojournal case.
      
      Before this commit, this problem can be repliciated via:
      
      1. dmsetup  create dust1 --table  '0 2097152 dust /dev/sdc 0 4096'
      2. mount  /dev/mapper/dust1  /home/test
      3. dmsetup message dust1 0 addbadblock 0 10
      4. cd /home/test
      5. echo "XXXXXXX" > t
      
      After a few seconds, we got following warning:
      
      [   80.654487] end_buffer_async_write: bh=0xffff88842f18bdd0
      [   80.656134] Buffer I/O error on dev dm-0, logical block 0, lost async page write
      [   85.774450] EXT4-fs error (device dm-0): ext4_check_bdev_write_error:193: comm kworker/u16:8: Error while async write back metadata
      [   91.415513] mark_buffer_dirty: bh=0xffff88842f18bdd0
      [   91.417038] ------------[ cut here ]------------
      [   91.418450] WARNING: CPU: 1 PID: 1944 at fs/buffer.c:1092 mark_buffer_dirty.cold+0x1c/0x5e
      [   91.440322] Call Trace:
      [   91.440652]  __jbd2_journal_temp_unlink_buffer+0x135/0x220
      [   91.441354]  __jbd2_journal_unfile_buffer+0x24/0x90
      [   91.441981]  __jbd2_journal_refile_buffer+0x134/0x1d0
      [   91.442628]  jbd2_journal_commit_transaction+0x249a/0x3240
      [   91.443336]  ? put_prev_entity+0x2a/0x200
      [   91.443856]  ? kjournald2+0x12e/0x510
      [   91.444324]  kjournald2+0x12e/0x510
      [   91.444773]  ? woken_wake_function+0x30/0x30
      [   91.445326]  kthread+0x150/0x1b0
      [   91.445739]  ? commit_timeout+0x20/0x20
      [   91.446258]  ? kthread_flush_worker+0xb0/0xb0
      [   91.446818]  ret_from_fork+0x1f/0x30
      [   91.447293] ---[ end trace 66f0b6bf3d1abade ]---
      Signed-off-by: NYe Bin <yebin10@huawei.com>
      Link: https://lore.kernel.org/r/20210615090537.3423231-1-yebin10@huawei.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      558d6450
  4. 30 6月, 2021 1 次提交
  5. 24 6月, 2021 2 次提交
  6. 23 6月, 2021 1 次提交
  7. 17 6月, 2021 3 次提交
  8. 06 6月, 2021 1 次提交
    • A
      ext4: fix memory leak in ext4_fill_super · afd09b61
      Alexey Makhalov 提交于
      Buffer head references must be released before calling kill_bdev();
      otherwise the buffer head (and its page referenced by b_data) will not
      be freed by kill_bdev, and subsequently that bh will be leaked.
      
      If blocksizes differ, sb_set_blocksize() will kill current buffers and
      page cache by using kill_bdev(). And then super block will be reread
      again but using correct blocksize this time. sb_set_blocksize() didn't
      fully free superblock page and buffer head, and being busy, they were
      not freed and instead leaked.
      
      This can easily be reproduced by calling an infinite loop of:
      
        systemctl start <ext4_on_lvm>.mount, and
        systemctl stop <ext4_on_lvm>.mount
      
      ... since systemd creates a cgroup for each slice which it mounts, and
      the bh leak get amplified by a dying memory cgroup that also never
      gets freed, and memory consumption is much more easily noticed.
      
      Fixes: ce40733c ("ext4: Check for return value from sb_set_blocksize")
      Fixes: ac27a0ec ("ext4: initial copy of files from ext3")
      Link: https://lore.kernel.org/r/20210521075533.95732-1-amakhalov@vmware.comSigned-off-by: NAlexey Makhalov <amakhalov@vmware.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      afd09b61
  9. 19 4月, 2021 1 次提交
  10. 10 4月, 2021 4 次提交
  11. 09 4月, 2021 3 次提交
    • H
      ext4: make prefetch_block_bitmaps default · 21175ca4
      Harshad Shirwadkar 提交于
      Block bitmap prefetching is needed for these allocator optimization
      data structures to get populated and provide better group scanning
      order. So, turn it on bu default. prefetch_block_bitmaps mount option
      is now marked as removed and a new option no_prefetch_block_bitmaps is
      added to disable block bitmap prefetching.
      Signed-off-by: NHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Link: https://lore.kernel.org/r/20210401172129.189766-8-harshadshirwadkar@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      21175ca4
    • H
      ext4: improve cr 0 / cr 1 group scanning · 196e402a
      Harshad Shirwadkar 提交于
      Instead of traversing through groups linearly, scan groups in specific
      orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
      largest free order >= the order of the request. So, with this patch,
      we maintain lists for each possible order and insert each group into a
      list based on the largest free order in its buddy bitmap. During cr 0
      allocation, we traverse these lists in the increasing order of largest
      free orders. This allows us to find a group with the best available cr
      0 match in constant time. If nothing can be found, we fallback to cr 1
      immediately.
      
      At CR1, the story is slightly different. We want to traverse in the
      order of increasing average fragment size. For CR1, we maintain a rb
      tree of groupinfos which is sorted by average fragment size. Instead
      of traversing linearly, at CR1, we traverse in the order of increasing
      average fragment size, starting at the most optimal group. This brings
      down cr 1 search complexity to log(num groups).
      
      For cr >= 2, we just perform the linear search as before. Also, in
      case of lock contention, we intermittently fallback to linear search
      even in CR 0 and CR 1 cases. This allows us to proceed during the
      allocation path even in case of high contention.
      
      There is an opportunity to do optimization at CR2 too. That's because
      at CR2 we only consider groups where bb_free counter (number of free
      blocks) is greater than the request extent size. That's left as future
      work.
      
      All the changes introduced in this patch are protected under a new
      mount option "mb_optimize_scan".
      
      With this patchset, following experiment was performed:
      
      Created a highly fragmented disk of size 65TB. The disk had no
      contiguous 2M regions. Following command was run consecutively for 3
      times:
      
      time dd if=/dev/urandom of=file bs=2M count=10
      
      Here are the results with and without cr 0/1 optimizations introduced
      in this patch:
      
      |---------+------------------------------+---------------------------|
      |         | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
      |---------+------------------------------+---------------------------|
      | 1st run | 5m1.871s                     | 2m47.642s                 |
      | 2nd run | 2m28.390s                    | 0m0.611s                  |
      | 3rd run | 2m26.530s                    | 0m1.255s                  |
      |---------+------------------------------+---------------------------|
      Signed-off-by: NHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      196e402a
    • H
      ext4: add ability to return parsed options from parse_options · b237e304
      Harshad Shirwadkar 提交于
      Before this patch, the function parse_options() was returning
      journal_devnum and journal_ioprio variables to the caller. This patch
      generalizes that interface to allow parse_options to return any parsed
      options to return back to the caller. In this patch series, it gets
      used to capture the value of "mb_optimize_scan=%u" mount option.
      Signed-off-by: NHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Reviewed-by: NRitesh Harjani <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20210401172129.189766-3-harshadshirwadkar@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      b237e304
  12. 06 4月, 2021 1 次提交
  13. 21 3月, 2021 1 次提交
  14. 07 3月, 2021 1 次提交
    • E
      ext4: shrink race window in ext4_should_retry_alloc() · efc61345
      Eric Whitney 提交于
      When generic/371 is run on kvm-xfstests using 5.10 and 5.11 kernels, it
      fails at significant rates on the two test scenarios that disable
      delayed allocation (ext3conv and data_journal) and force actual block
      allocation for the fallocate and pwrite functions in the test.  The
      failure rate on 5.10 for both ext3conv and data_journal on one test
      system typically runs about 85%.  On 5.11, the failure rate on ext3conv
      sometimes drops to as low as 1% while the rate on data_journal
      increases to nearly 100%.
      
      The observed failures are largely due to ext4_should_retry_alloc()
      cutting off block allocation retries when s_mb_free_pending (used to
      indicate that a transaction in progress will free blocks) is 0.
      However, free space is usually available when this occurs during runs
      of generic/371.  It appears that a thread attempting to allocate
      blocks is just missing transaction commits in other threads that
      increase the free cluster count and reset s_mb_free_pending while
      the allocating thread isn't running.  Explicitly testing for free space
      availability avoids this race.
      
      The current code uses a post-increment operator in the conditional
      expression that determines whether the retry limit has been exceeded.
      This means that the conditional expression uses the value of the
      retry counter before it's increased, resulting in an extra retry cycle.
      The current code actually retries twice before hitting its retry limit
      rather than once.
      
      Increasing the retry limit to 3 from the current actual maximum retry
      count of 2 in combination with the change described above reduces the
      observed failure rate to less that 0.1% on both ext3conv and
      data_journal with what should be limited impact on users sensitive to
      the overhead caused by retries.
      
      A per filesystem percpu counter exported via sysfs is added to allow
      users or developers to track the number of times the retry limit is
      exceeded without resorting to debugging methods.  This should provide
      some insight into worst case retry behavior.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Link: https://lore.kernel.org/r/20210218151132.19678-1-enwlinux@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      efc61345
  15. 03 2月, 2021 2 次提交
  16. 28 1月, 2021 1 次提交
  17. 24 1月, 2021 1 次提交
    • C
      ext4: support idmapped mounts · 14f3db55
      Christian Brauner 提交于
      Enable idmapped mounts for ext4. All dedicated helpers we need for this
      exist. So this basically just means we're passing down the
      user_namespace argument from the VFS methods to the relevant helpers.
      
      Let's create simple example where we idmap an ext4 filesystem:
      
       root@f2-vm:~# truncate -s 5G ext4.img
      
       root@f2-vm:~# mkfs.ext4 ./ext4.img
       mke2fs 1.45.5 (07-Jan-2020)
       Discarding device blocks: done
       Creating filesystem with 1310720 4k blocks and 327680 inodes
       Filesystem UUID: 3fd91794-c6ca-4b0f-9964-289a000919cf
       Superblock backups stored on blocks:
               32768, 98304, 163840, 229376, 294912, 819200, 884736
      
       Allocating group tables: done
       Writing inode tables: done
       Creating journal (16384 blocks): done
       Writing superblocks and filesystem accounting information: done
      
       root@f2-vm:~# losetup -f --show ./ext4.img
       /dev/loop0
      
       root@f2-vm:~# mount /dev/loop0 /mnt
      
       root@f2-vm:~# ls -al /mnt/
       total 24
       drwxr-xr-x  3 root root  4096 Oct 28 13:34 .
       drwxr-xr-x 30 root root  4096 Oct 28 13:22 ..
       drwx------  2 root root 16384 Oct 28 13:34 lost+found
      
       # Let's create an idmapped mount at /idmapped1 where we map uid and gid
       # 0 to uid and gid 1000
       root@f2-vm:/# ./mount-idmapped --map-mount b:0:1000:1 /mnt/ /idmapped1/
      
       root@f2-vm:/# ls -al /idmapped1/
       total 24
       drwxr-xr-x  3 ubuntu ubuntu  4096 Oct 28 13:34 .
       drwxr-xr-x 30 root   root    4096 Oct 28 13:22 ..
       drwx------  2 ubuntu ubuntu 16384 Oct 28 13:34 lost+found
      
       # Let's create an idmapped mount at /idmapped2 where we map uid and gid
       # 0 to uid and gid 2000
       root@f2-vm:/# ./mount-idmapped --map-mount b:0:2000:1 /mnt/ /idmapped2/
      
       root@f2-vm:/# ls -al /idmapped2/
       total 24
       drwxr-xr-x  3 2000 2000  4096 Oct 28 13:34 .
       drwxr-xr-x 31 root root  4096 Oct 28 13:39 ..
       drwx------  2 2000 2000 16384 Oct 28 13:34 lost+found
      
      Let's create another example where we idmap the rootfs filesystem
      without a mapping for uid 0 and gid 0:
      
       # Create an idmapped mount of for a full POSIX range of rootfs under
       # /mnt but without a mapping for uid 0 to reduce attack surface
      
       root@f2-vm:/# ./mount-idmapped --map-mount b:1:1:65536 / /mnt/
      
       # Since we don't have a mapping for uid and gid 0 all files owned by
       # uid and gid 0 should show up as uid and gid 65534:
       root@f2-vm:/# ls -al /mnt/
       total 664
       drwxr-xr-x 31 nobody nogroup   4096 Oct 28 13:39 .
       drwxr-xr-x 31 root   root      4096 Oct 28 13:39 ..
       lrwxrwxrwx  1 nobody nogroup      7 Aug 25 07:44 bin -> usr/bin
       drwxr-xr-x  4 nobody nogroup   4096 Oct 28 13:17 boot
       drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:48 dev
       drwxr-xr-x 81 nobody nogroup   4096 Oct 28 04:00 etc
       drwxr-xr-x  4 nobody nogroup   4096 Oct 28 04:00 home
       lrwxrwxrwx  1 nobody nogroup      7 Aug 25 07:44 lib -> usr/lib
       lrwxrwxrwx  1 nobody nogroup      9 Aug 25 07:44 lib32 -> usr/lib32
       lrwxrwxrwx  1 nobody nogroup      9 Aug 25 07:44 lib64 -> usr/lib64
       lrwxrwxrwx  1 nobody nogroup     10 Aug 25 07:44 libx32 -> usr/libx32
       drwx------  2 nobody nogroup  16384 Aug 25 07:47 lost+found
       drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 media
       drwxr-xr-x 31 nobody nogroup   4096 Oct 28 13:39 mnt
       drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 opt
       drwxr-xr-x  2 nobody nogroup   4096 Apr 15  2020 proc
       drwx--x--x  6 nobody nogroup   4096 Oct 28 13:34 root
       drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:46 run
       lrwxrwxrwx  1 nobody nogroup      8 Aug 25 07:44 sbin -> usr/sbin
       drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 srv
       drwxr-xr-x  2 nobody nogroup   4096 Apr 15  2020 sys
       drwxrwxrwt 10 nobody nogroup   4096 Oct 28 13:19 tmp
       drwxr-xr-x 14 nobody nogroup   4096 Oct 20 13:00 usr
       drwxr-xr-x 12 nobody nogroup   4096 Aug 25 07:45 var
      
       # Since we do have a mapping for uid and gid 1000 all files owned by
       # uid and gid 1000 should simply show up as uid and gid 1000:
       root@f2-vm:/# ls -al /mnt/home/ubuntu/
       total 40
       drwxr-xr-x 3 ubuntu ubuntu  4096 Oct 28 00:43 .
       drwxr-xr-x 4 nobody nogroup 4096 Oct 28 04:00 ..
       -rw------- 1 ubuntu ubuntu  2936 Oct 28 12:26 .bash_history
       -rw-r--r-- 1 ubuntu ubuntu   220 Feb 25  2020 .bash_logout
       -rw-r--r-- 1 ubuntu ubuntu  3771 Feb 25  2020 .bashrc
       -rw-r--r-- 1 ubuntu ubuntu   807 Feb 25  2020 .profile
       -rw-r--r-- 1 ubuntu ubuntu     0 Oct 16 16:11 .sudo_as_admin_successful
       -rw------- 1 ubuntu ubuntu  1144 Oct 28 00:43 .viminfo
      
      Link: https://lore.kernel.org/r/20210121131959.646623-39-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-ext4@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      14f3db55
  18. 23 12月, 2020 6 次提交
  19. 18 12月, 2020 7 次提交