1. 27 7月, 2018 15 次提交
    • A
      f2fs: use timespec64 for inode timestamps · 24b81dfc
      Arnd Bergmann 提交于
      The on-disk representation and the vfs both use 64-bit tv_sec values,
      so let's change the last missing piece in the middle.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      24b81dfc
    • C
      f2fs: fix to wait on page writeback before updating page · 6aead161
      Chao Yu 提交于
      In error path of f2fs_move_rehashed_dirents, inode page could be writeback
      state, so we should wait on inode page writeback before updating it.
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      6aead161
    • J
      f2fs: assign REQ_RAHEAD to bio for ->readpages · e2e59414
      Jaegeuk Kim 提交于
      As Jens reported, we'd better assign REQ_RAHEAD to bio by the fact that
      ->readpages is called only from read-ahead.
      
      In Documentation/filesystems/vfs.txt,
      
      readpages: called by the VM to read pages associated with the address_space
        	object. This is essentially just a vector version of
        	readpage.  Instead of just one page, several pages are
        	requested.
      	readpages is only used for read-ahead, so read errors are
        	ignored.  If anything goes wrong, feel free to give up.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      e2e59414
    • Y
      f2fs: fix a hungtask problem caused by congestion_wait · 2a63531a
      Yunlei He 提交于
      This patch fix hungtask problem which can be reproduced as follow:
      
      Thread 0~3:
      while true
      do
              touch /xxx/test/file_xxx
      done
      
      Thread 4 write a new checkpoint every three seconds.
      
      In the meantime, fio start 16 threads for randwrite.
      
      With my debug info, cycles num will exceed 1000 in function
      f2fs_sync_dirty_inodes, and most of cycle will be dropped
      into congestion_wait() and sleep more than 20ms. Cycles num
      reduced to 3 with this patch.
      Signed-off-by: NYunlei He <heyunlei@huawei.com>
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      2a63531a
    • D
      f2fs: Fix uninitialized return in f2fs_ioc_shutdown() · 2a96d8ad
      Dan Carpenter 提交于
      "ret" can be uninitialized on the success path when "in ==
      F2FS_GOING_DOWN_FULLSYNC".
      
      Fixes: 60b2b4ee ("f2fs: Fix deadlock in shutdown ioctl")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      2a96d8ad
    • J
      f2fs: don't issue discard commands in online discard is on · 5a615492
      Jaegeuk Kim 提交于
      Actually, we don't need to issue discard commands, if discard is on, as
      mentioned in the comment.
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      5a615492
    • C
      f2fs: fix to propagate return value of scan_nat_page() · e2374015
      Chao Yu 提交于
      As Anatoly Trosinenko reported in bugzilla:
      
      How to reproduce:
      1. Compile the 73fcb1a3 version of the kernel using the config attached
      2. Unpack and mount the attached filesystem image as F2FS
      3. The kernel will BUG() on mount (BUGs are explicitly enabled in config)
      
      [    2.233612] F2FS-fs (sda): Found nat_bits in checkpoint
      [    2.248422] ------------[ cut here ]------------
      [    2.248857] kernel BUG at fs/f2fs/node.c:1967!
      [    2.249760] invalid opcode: 0000 [#1] SMP NOPTI
      [    2.250219] Modules linked in:
      [    2.251848] CPU: 0 PID: 944 Comm: mount Not tainted 4.17.0-rc5+ #1
      [    2.252331] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [    2.253305] RIP: 0010:build_free_nids+0x337/0x3f0
      [    2.253672] RSP: 0018:ffffae7fc0857c50 EFLAGS: 00000246
      [    2.254080] RAX: 00000000ffffffff RBX: 0000000000000123 RCX: 0000000000000001
      [    2.254638] RDX: ffff9aa7063d5c00 RSI: 0000000000000122 RDI: ffff9aa705852e00
      [    2.255190] RBP: ffff9aa705852e00 R08: 0000000000000001 R09: ffff9aa7059090c0
      [    2.255719] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9aa705852e00
      [    2.256242] R13: ffff9aa7063ad000 R14: ffff9aa705919000 R15: 0000000000000123
      [    2.256809] FS:  00000000023078c0(0000) GS:ffff9aa707800000(0000) knlGS:0000000000000000
      [    2.258654] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    2.259153] CR2: 00000000005511ae CR3: 0000000005872000 CR4: 00000000000006f0
      [    2.259801] Call Trace:
      [    2.260583]  build_node_manager+0x5cd/0x600
      [    2.260963]  f2fs_fill_super+0x66a/0x17c0
      [    2.261300]  ? f2fs_commit_super+0xe0/0xe0
      [    2.261622]  mount_bdev+0x16e/0x1a0
      [    2.261899]  mount_fs+0x30/0x150
      [    2.262398]  vfs_kern_mount.part.28+0x4f/0xf0
      [    2.262743]  do_mount+0x5d0/0xc60
      [    2.263010]  ? _copy_from_user+0x37/0x60
      [    2.263313]  ? memdup_user+0x39/0x60
      [    2.263692]  ksys_mount+0x7b/0xd0
      [    2.263960]  __x64_sys_mount+0x1c/0x20
      [    2.264268]  do_syscall_64+0x43/0xf0
      [    2.264560]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [    2.265095] RIP: 0033:0x48d31a
      [    2.265502] RSP: 002b:00007ffc6fe60a08 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
      [    2.266089] RAX: ffffffffffffffda RBX: 0000000000008000 RCX: 000000000048d31a
      [    2.266607] RDX: 00007ffc6fe62fa5 RSI: 00007ffc6fe62f9d RDI: 00007ffc6fe62f94
      [    2.267130] RBP: 00000000023078a0 R08: 0000000000000000 R09: 0000000000000000
      [    2.267670] R10: 0000000000008000 R11: 0000000000000246 R12: 0000000000000000
      [    2.268192] R13: 0000000000000000 R14: 00007ffc6fe60c78 R15: 0000000000000000
      [    2.268767] Code: e8 5f c3 ff ff 83 c3 01 41 83 c7 01 81 fb c7 01 00 00 74 48 44 39 7d 04 76 42 48 63 c3 48 8d 04 c0 41 8b 44 06 05 83 f8 ff 75 c1 <0f> 0b 49 8b 45 50 48 8d b8 b0 00 00 00 e8 37 59 69 00 b9 01 00
      [    2.270434] RIP: build_free_nids+0x337/0x3f0 RSP: ffffae7fc0857c50
      [    2.271426] ---[ end trace ab20c06cd3c8fde4 ]---
      
      During loading NAT entries, we will do sanity check, once the entry info
      is corrupted, it will cause BUG_ON directly to protect user data from
      being overwrited.
      
      In this case, it will be better to just return failure on mount() instead
      of panic, so that user can get hint from kmsg and try fsck for recovery
      immediately rather than after an abnormal reboot.
      
      https://bugzilla.kernel.org/show_bug.cgi?id=199769Reported-by: NAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      e2374015
    • W
      f2fs: support in-memory inode checksum when checking consistency · 54c55c4e
      Weichao Guo 提交于
      Enable in-memory inode checksum to protect metadata blocks from
      in-memory scribbles when checking consistency, which has no
      performance requirements.
      Signed-off-by: NWeichao Guo <guoweichao@huawei.com>
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      54c55c4e
    • C
      f2fs: fix error path of fill_super · 4e423832
      Chao Yu 提交于
      In fill_super, if root inode's attribute is incorrect, we need to
      call f2fs_destroy_stats to release stats memory.
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      4e423832
    • C
      f2fs: relocate readdir_ra configure initialization · 4cac90d5
      Chao Yu 提交于
      readdir_ra is sysfs configuration instead of mount option, so it should
      not be initialized in default_options(), otherwise after remount, it can
      be reset to be enabled which may not as user wish, so let's move it to
      f2fs_tuning_parameters().
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      4cac90d5
    • C
      f2fs: move s_res{u,g}id initialization to default_options() · 0aa7e0f8
      Chao Yu 提交于
      Let default_options() initialize s_res{u,g}id with default value like
      other options.
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      0aa7e0f8
    • C
      f2fs: don't acquire orphan ino during recovery · 76a45e3c
      Chao Yu 提交于
      During orphan inode recovery, checkpoint should never succeed due to
      SBI_POR_DOING flag, so we don't need acquire orphan ino which only be
      used by checkpoint.
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      76a45e3c
    • J
      f2fs: avoid potential deadlock in f2fs_sbi_store · a1933c09
      Jaegeuk Kim 提交于
      [  155.018460] ======================================================
      [  155.021431] WARNING: possible circular locking dependency detected
      [  155.024339] 4.18.0-rc3+ #5 Tainted: G           OE
      [  155.026879] ------------------------------------------------------
      [  155.029783] umount/2901 is trying to acquire lock:
      [  155.032187] 00000000c4282f1f (kn->count#130){++++}, at: kernfs_remove+0x1f/0x30
      [  155.035439]
      [  155.035439] but task is already holding lock:
      [  155.038892] 0000000056e4307b (&type->s_umount_key#41){++++}, at: deactivate_super+0x33/0x50
      [  155.042602]
      [  155.042602] which lock already depends on the new lock.
      [  155.042602]
      [  155.047465]
      [  155.047465] the existing dependency chain (in reverse order) is:
      [  155.051354]
      [  155.051354] -> #1 (&type->s_umount_key#41){++++}:
      [  155.054768]        f2fs_sbi_store+0x61/0x460 [f2fs]
      [  155.057083]        kernfs_fop_write+0x113/0x1a0
      [  155.059277]        __vfs_write+0x36/0x180
      [  155.061250]        vfs_write+0xbe/0x1b0
      [  155.063179]        ksys_write+0x55/0xc0
      [  155.065068]        do_syscall_64+0x60/0x1b0
      [  155.067071]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  155.069529]
      [  155.069529] -> #0 (kn->count#130){++++}:
      [  155.072421]        __kernfs_remove+0x26f/0x2e0
      [  155.074452]        kernfs_remove+0x1f/0x30
      [  155.076342]        kobject_del.part.5+0xe/0x40
      [  155.078354]        f2fs_put_super+0x12d/0x290 [f2fs]
      [  155.080500]        generic_shutdown_super+0x6c/0x110
      [  155.082655]        kill_block_super+0x21/0x50
      [  155.084634]        kill_f2fs_super+0x9c/0xc0 [f2fs]
      [  155.086726]        deactivate_locked_super+0x3f/0x70
      [  155.088826]        cleanup_mnt+0x3b/0x70
      [  155.090584]        task_work_run+0x93/0xc0
      [  155.092367]        exit_to_usermode_loop+0xf0/0x100
      [  155.094466]        do_syscall_64+0x162/0x1b0
      [  155.096312]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  155.098603]
      [  155.098603] other info that might help us debug this:
      [  155.098603]
      [  155.102418]  Possible unsafe locking scenario:
      [  155.102418]
      [  155.105134]        CPU0                    CPU1
      [  155.107037]        ----                    ----
      [  155.108910]   lock(&type->s_umount_key#41);
      [  155.110674]                                lock(kn->count#130);
      [  155.113010]                                lock(&type->s_umount_key#41);
      [  155.115608]   lock(kn->count#130);
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      a1933c09
    • J
      f2fs: indicate shutdown f2fs to allow unmount successfully · 83a3bfdb
      Jaegeuk Kim 提交于
      Once we shutdown f2fs, we have to flush stale pages in order to unmount
      the system. In order to make stable, we need to stop fault injection as well.
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      83a3bfdb
    • J
      f2fs: keep meta pages in cp_error state · af697c0f
      Jaegeuk Kim 提交于
      It turns out losing meta pages in shutdown period makes f2fs very unstable
      so that I could see many unexpected error conditions.
      
      Let's keep meta pages for fault injection and sudden power-off tests.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      af697c0f
  2. 15 7月, 2018 2 次提交
    • J
      f2fs: do checkpoint in kill_sb · 1cb50f87
      Jaegeuk Kim 提交于
      When unmounting f2fs in force mode, we can get it stuck by io_schedule()
      by some pending IOs in meta_inode.
      
      io_schedule+0xd/0x30
      wait_on_page_bit_common+0xc6/0x130
      __filemap_fdatawait_range+0xbd/0x100
      filemap_fdatawait_keep_errors+0x15/0x40
      sync_inodes_sb+0x1cf/0x240
      sync_filesystem+0x52/0x90
      generic_shutdown_super+0x1d/0x110
      kill_f2fs_super+0x28/0x80 [f2fs]
      deactivate_locked_super+0x35/0x60
      cleanup_mnt+0x36/0x70
      task_work_run+0x79/0xa0
      exit_to_usermode_loop+0x62/0x70
      do_syscall_64+0xdb/0xf0
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
      0xffffffffffffffff
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      1cb50f87
    • J
      f2fs: allow wrong configured dio to buffered write · 8a56dd96
      Jaegeuk Kim 提交于
      This fixes to support dio having unaligned buffers as buffered writes.
      
      xfs_io -f -d -c "pwrite 0 512" $testfile
       -> okay
      
      xfs_io -f -d -c "pwrite 1 512" $testfile
       -> EINVAL
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      8a56dd96
  3. 12 7月, 2018 1 次提交
  4. 06 7月, 2018 2 次提交
    • L
      Fix up non-directory creation in SGID directories · 0fa3ecd8
      Linus Torvalds 提交于
      sgid directories have special semantics, making newly created files in
      the directory belong to the group of the directory, and newly created
      subdirectories will also become sgid.  This is historically used for
      group-shared directories.
      
      But group directories writable by non-group members should not imply
      that such non-group members can magically join the group, so make sure
      to clear the sgid bit on non-directories for non-members (but remember
      that sgid without group execute means "mandatory locking", just to
      confuse things even more).
      Reported-by: NJann Horn <jannh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fa3ecd8
    • L
      autofs: rename 'autofs' module back to 'autofs4' · d02d21ea
      Linus Torvalds 提交于
      It turns out that systemd has a bug: it wants to load the autofs module
      early because of some initialization ordering with udev, and it doesn't
      do that correctly.  Everywhere else it does the proper "look up module
      name" that does the proper alias resolution, but in that early code, it
      just uses a hardcoded "autofs4" for the module name.
      
      The result of that is that as of commit a2225d93 ("autofs: remove
      left-over autofs4 stubs"), you get
      
          systemd[1]: Failed to insert module 'autofs4': No such file or directory
      
      in the system logs, and a lack of module loading.  All this despite the
      fact that we had very clearly marked 'autofs4' as an alias for this
      module.
      
      What's so ridiculous about this is that literally everything else does
      the module alias handling correctly, including really old versions of
      systemd (that just used 'modprobe' to do this), and even all the other
      systemd module loading code.
      
      Only that special systemd early module load code is broken, hardcoding
      the module names for not just 'autofs4', but also "ipv6", "unix",
      "ip_tables" and "virtio_rng".  Very annoying.
      
      Instead of creating an _additional_ separate compatibility 'autofs4'
      module, just rely on the fact that everybody else gets this right, and
      just call the module 'autofs4' for compatibility reasons, with 'autofs'
      as the alias name.
      
      That will allow the systemd people to fix their bugs, adding the proper
      alias handling, and maybe even fix the name of the module to be just
      "autofs" (so that they can _test_ the alias handling).  And eventually,
      we can revert this silly compatibility hack.
      
      See also
      
          https://github.com/systemd/systemd/issues/9501
          https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=902946
      
      for the systemd bug reports upstream and in the Debian bug tracker
      respectively.
      
      Fixes: a2225d93 ("autofs: remove left-over autofs4 stubs")
      Reported-by: NBen Hutchings <ben@decadent.org.uk>
      Reported-by: NMichael Biebl <biebl@debian.org>
      Cc: Ian Kent <raven@themaw.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d02d21ea
  5. 04 7月, 2018 1 次提交
  6. 29 6月, 2018 1 次提交
    • L
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds 提交于
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43
  7. 28 6月, 2018 4 次提交
    • F
      Btrfs: fix mount failure when qgroup rescan is in progress · e4e7ede7
      Filipe Manana 提交于
      If a power failure happens while the qgroup rescan kthread is running,
      the next mount operation will always fail. This is because of a recent
      regression that makes qgroup_rescan_init() incorrectly return -EINVAL
      when we are mounting the filesystem (through btrfs_read_qgroup_config()).
      This causes the -EINVAL error to be returned regardless of any qgroup
      flags being set instead of returning the error only when neither of
      the flags BTRFS_QGROUP_STATUS_FLAG_RESCAN nor BTRFS_QGROUP_STATUS_FLAG_ON
      are set.
      
      A test case for fstests follows up soon.
      
      Fixes: 9593bf49 ("btrfs: qgroup: show more meaningful qgroup_rescan_init error message")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e4e7ede7
    • C
      Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion · 717beb96
      Chris Mason 提交于
      The vm_fault_t conversion commit introduced a ret2 variable for tracking
      the integer return values from internal btrfs functions.  It was
      sometimes returning VM_FAULT_LOCKED for pages that were actually invalid
      and had been removed from the radix.  Something like this:
      
          ret2 = btrfs_delalloc_reserve_space() // returns zero on success
      
          lock_page(page)
          if (page->mapping != inode->i_mapping)
      	goto out_unlock;
      
      ...
      
      out_unlock:
          if (!ret2) {
      	    ...
      	    return VM_FAULT_LOCKED;
          }
      
      This ends up triggering this WARNING in btrfs_destroy_inode()
          WARN_ON(BTRFS_I(inode)->block_rsv.size);
      
      xfstests generic/095 was able to reliably reproduce the errors.
      
      Since out_unlock: is only used for errors, this fix moves it below the
      if (!ret2) check we use to return VM_FAULT_LOCKED for success.
      
      Fixes: a528a241 (btrfs: change return type of btrfs_page_mkwrite to vm_fault_t)
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      717beb96
    • Q
      btrfs: quota: Set rescan progress to (u64)-1 if we hit last leaf · 6f7de19e
      Qu Wenruo 提交于
      Commit ff3d27a0 ("btrfs: qgroup: Finish rescan when hit the last leaf
      of extent tree") added a new exit for rescan finish.
      
      However after finishing quota rescan, we set
      fs_info->qgroup_rescan_progress to (u64)-1 before we exit through the
      original exit path.
      While we missed that assignment of (u64)-1 in the new exit path.
      
      The end result is, the quota status item doesn't have the same value.
      (-1 vs the last bytenr + 1)
      Although it doesn't affect quota accounting, it's still better to keep
      the original behavior.
      Reported-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Fixes: ff3d27a0 ("btrfs: qgroup: Finish rescan when hit the last leaf of extent tree")
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6f7de19e
    • C
      proc: add proc_seq_release · 877f919e
      Chunyu Hu 提交于
      kmemleak reported some memory leak on reading proc files. After adding
      some debug lines, find that proc_seq_fops is using seq_release as
      release handler, which won't handle the free of 'private' field of
      seq_file, while in fact the open handler proc_seq_open could create
      the private data with __seq_open_private when state_size is greater
      than zero. So after reading files created with proc_create_seq_private,
      such as /proc/timer_list and /proc/vmallocinfo, the private mem of a
      seq_file is not freed. Fix it by adding the paired proc_seq_release
      as the default release handler of proc_seq_ops instead of seq_release.
      
      Fixes: 44414d82 ("proc: introduce proc_create_seq_private")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      CC: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      877f919e
  8. 27 6月, 2018 1 次提交
  9. 25 6月, 2018 8 次提交
    • D
      xfs: fix fdblocks accounting w/ RMAPBT per-AG reservation · d8cb5e42
      Darrick J. Wong 提交于
      In __xfs_ag_resv_init we incorrectly calculate the amount by which to
      decrease fdblocks when reserving blocks for the rmapbt.  Because rmapbt
      allocations do not decrease fdblocks, we must decrease fdblocks by the
      entire size of the requested reservation in order to achieve our goal of
      always having enough free blocks to satisfy an rmapbt expansion.
      
      This is in contrast to the refcountbt/finobt, which /do/ subtract from
      fdblocks whenever they allocate a block.  For this allocation type we
      preserve the existing behavior where we decrease fdblocks only by the
      requested reservation minus the size of the existing tree.
      
      This fixes the problem where the available block counts reported by
      statfs change across a remount if there had been an rmapbt size change
      since mount time.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      d8cb5e42
    • D
      xfs: ensure post-EOF zeroing happens after zeroing part of a file · e53c4b59
      Darrick J. Wong 提交于
      If a user asks us to zero_range part of a file, the end of the range is
      EOF, and not aligned to a page boundary, invoke writeback of the EOF
      page to ensure that the post-EOF part of the page is zeroed.  This
      ensures that we don't expose stale memory contents via mmap, if in a
      clumsy manner.
      
      Found by running generic/127 when it runs zero_range and mapread at EOF
      one after the other.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      e53c4b59
    • D
      xfs: fix off-by-one error in xfs_rtalloc_query_range · a3a374bf
      Darrick J. Wong 提交于
      In commit 8ad560d2 ("xfs: strengthen rtalloc query range checks")
      we strengthened the input parameter checks in the rtbitmap range query
      function, but introduced an off-by-one error in the process.  The call
      to xfs_rtfind_forw deals with the high key being rextents, but we clamp
      the high key to rextents - 1.  This causes the returned results to stop
      one block short of the end of the rtdev, which is incorrect.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      a3a374bf
    • D
      xfs: fix uninitialized field in rtbitmap fsmap backend · 232d0a24
      Darrick J. Wong 提交于
      Initialize the extent count field of the high key so that when we use
      the high key to synthesize an 'unknown owner' record (i.e. used space
      record) at the end of the queried range we have a field with which to
      compute rm_blockcount.  This is not strictly necessary because the
      synthesizer never uses the rm_blockcount field, but we can shut up the
      static code analysis anyway.
      
      Coverity-id: 1437358
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      232d0a24
    • D
      xfs: recheck reflink state after grabbing ILOCK_SHARED for a write · 5bd88d15
      Darrick J. Wong 提交于
      The reflink iflag could have changed since the earlier unlocked check,
      so if we got ILOCK_SHARED for a write and but we're now a reflink inode
      we have to switch to ILOCK_EXCL and relock.
      
      This helps us avoid blowing lock assertions in things like generic/166:
      
      XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/xfs_reflink.c, line: 383
      WARNING: CPU: 1 PID: 24707 at fs/xfs/xfs_message.c:104 assfail+0x25/0x30 [xfs]
      Modules linked in: deadline_iosched dm_snapshot dm_bufio ext4 mbcache jbd2 dm_flakey xfs libcrc32c dax_pmem device_dax nd_pmem sch_fq_codel af_packet [last unloaded: scsi_debug]
      CPU: 1 PID: 24707 Comm: xfs_io Not tainted 4.18.0-rc1-djw #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
      RIP: 0010:assfail+0x25/0x30 [xfs]
      Code: ff 0f 0b c3 90 66 66 66 66 90 48 89 f1 41 89 d0 48 c7 c6 e8 ef 1b a0 48 89 fa 31 ff e8 54 f9 ff ff 80 3d fd ba 0f 00 00 75 03 <0f> 0b c3 0f 0b 66 0f 1f 44 00 00 66 66 66 66 90 48 63 f6 49 89 f9
      RSP: 0018:ffffc90006423ad8 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff880030b65e80 RCX: 0000000000000000
      RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffa01b0447
      RBP: ffffc90006423c10 R08: 0000000000000000 R09: 0000000000000000
      R10: ffff88003d43fc30 R11: f000000000000000 R12: ffff880077cda000
      R13: 0000000000000000 R14: ffffc90006423c30 R15: ffffc90006423bf9
      FS:  00007feba8986800(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000000138ab58 CR3: 000000003d40a000 CR4: 00000000000006a0
      Call Trace:
       xfs_reflink_allocate_cow+0x24c/0x3d0 [xfs]
       xfs_file_iomap_begin+0x6d2/0xeb0 [xfs]
       ? iomap_to_fiemap+0x80/0x80
       iomap_apply+0x5e/0x130
       iomap_dio_rw+0x2e0/0x400
       ? iomap_to_fiemap+0x80/0x80
       ? xfs_file_dio_aio_write+0x133/0x4a0 [xfs]
       xfs_file_dio_aio_write+0x133/0x4a0 [xfs]
       xfs_file_write_iter+0x7b/0xb0 [xfs]
       __vfs_write+0x16f/0x1f0
       vfs_write+0xc8/0x1c0
       ksys_pwrite64+0x74/0x90
       do_syscall_64+0x56/0x180
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      5bd88d15
    • D
      xfs: don't allow insert-range to shift extents past the maximum offset · f62cb48e
      Darrick J. Wong 提交于
      Zorro Lang reports that generic/485 blows an assert on a filesystem with
      512 byte blocks.  The test tries to fallocate a post-eof extent at the
      maximum file size and calls insert range to shift the extents right by
      two blocks.  On a 512b block filesystem this causes startoff to overflow
      the 54-bit startoff field, leading to the assert.
      
      Therefore, always check the rightmost extent to see if it would overflow
      prior to invoking the insert range machinery.
      
      Reported-by: zlang@redhat.com
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200137Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      f62cb48e
    • D
      xfs: don't trip over negative free space in xfs_reserve_blocks · aafe12ce
      Darrick J. Wong 提交于
      If we somehow end up with a filesystem that has fewer free blocks than
      the blocks set aside to avoid ENOSPC deadlocks, it's possible that the
      free space calculation in xfs_reserve_blocks will spit out a negative
      number (because percpu_counter_sum returns s64).  We fail to notice
      this negative number and set fdblks_delta to it.  Now we increment
      fdblocks(!) and the unsigned type of m_resblks means that we end up
      setting a ridiculously huge m_resblks reservation.
      
      Avoid this comedy of errors by detecting the negative free space and
      returning -ENOSPC.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      aafe12ce
    • D
      xfs: allow empty transactions while frozen · 10ee2526
      Darrick J. Wong 提交于
      In commit e89c0413 ("xfs: implement the GETFSMAP ioctl") we
      created the ability to obtain empty transactions.  These transactions
      have no log or block reservations and therefore can't modify anything.
      Since they're also NO_WRITECOUNT they can run while the fs is frozen,
      so we don't need to WARN_ON about that usage.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      10ee2526
  10. 22 6月, 2018 5 次提交
    • F
      Btrfs: fix return value on rename exchange failure · c5b4a50b
      Filipe Manana 提交于
      If we failed during a rename exchange operation after starting/joining a
      transaction, we would end up replacing the return value, stored in the
      local 'ret' variable, with the return value from btrfs_end_transaction().
      So this could end up returning 0 (success) to user space despite the
      operation having failed and aborted the transaction, because if there are
      multiple tasks having a reference on the transaction at the time
      btrfs_end_transaction() is called by the rename exchange, that function
      returns 0 (otherwise it returns -EIO and not the original error value).
      So fix this by not overwriting the return value on error after getting
      a transaction handle.
      
      Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c5b4a50b
    • D
      xfs: xfs_iflush_abort() can be called twice on cluster writeback failure · e53946db
      Dave Chinner 提交于
      When a corrupt inode is detected during xfs_iflush_cluster, we can
      get a shutdown ASSERT failure like this:
      
      XFS (pmem1): Metadata corruption detected at xfs_symlink_shortform_verify+0x5c/0xa0, inode 0x86627 data fork
      XFS (pmem1): Unmount and run xfs_repair
      XFS (pmem1): xfs_do_force_shutdown(0x8) called from line 3372 of file fs/xfs/xfs_inode.c.  Return address = ffffffff814f4116
      XFS (pmem1): Corruption of in-memory data detected.  Shutting down filesystem
      XFS (pmem1): xfs_do_force_shutdown(0x1) called from line 222 of file fs/xfs/libxfs/xfs_defer.c.  Return address = ffffffff814a8a88
      XFS (pmem1): xfs_do_force_shutdown(0x1) called from line 222 of file fs/xfs/libxfs/xfs_defer.c.  Return address = ffffffff814a8ef9
      XFS (pmem1): Please umount the filesystem and rectify the problem(s)
      XFS: Assertion failed: xfs_isiflocked(ip), file: fs/xfs/xfs_inode.h, line: 258
      .....
      Call Trace:
       xfs_iflush_abort+0x10a/0x110
       xfs_iflush+0xf3/0x390
       xfs_inode_item_push+0x126/0x1e0
       xfsaild+0x2c5/0x890
       kthread+0x11c/0x140
       ret_from_fork+0x24/0x30
      
      Essentially, xfs_iflush_abort() has been called twice on the
      original inode that that was flushed. This happens because the
      inode has been flushed to teh buffer successfully via
      xfs_iflush_int(), and so when another inode is detected as corrupt
      in xfs_iflush_cluster, the buffer is marked stale and EIO, and
      iodone callbacks are run on it.
      
      Running the iodone callbacks walks across the original inode and
      calls xfs_iflush_abort() on it. When xfs_iflush_cluster() returns
      to xfs_iflush(), it runs the error path for that function, and that
      calls xfs_iflush_abort() on the inode a second time, leading to the
      above assert failure as the inode is not flush locked anymore.
      
      This bug has been there a long time.
      
      The simple fix would be to just avoid calling xfs_iflush_abort() in
      xfs_iflush() if we've got a failure from xfs_iflush_cluster().
      However, xfs_iflush_cluster() has magic delwri buffer handling that
      means it may or may not have run IO completion on the buffer, and
      hence sometimes we have to call xfs_iflush_abort() from
      xfs_iflush(), and sometimes we shouldn't.
      
      After reading through all the error paths and the delwri buffer
      code, it's clear that the error handling in xfs_iflush_cluster() is
      unnecessary. If the buffer is delwri, it leaves it on the delwri
      list so that when the delwri list is submitted it sees a shutdown
      fliesystem in xfs_buf_submit() and that marks the buffer stale, EIO
      and runs IO completion. i.e. exactly what xfs+iflush_cluster() does
      when it's not a delwri buffer. Further, marking a buffer stale
      clears the _XBF_DELWRI_Q flag on the buffer, which means when
      submission of the buffer occurs, it just skips over it and releases
      it.
      
      IOWs, the error handling in xfs_iflush_cluster doesn't need to care
      if the buffer is already on a the delwri queue or not - it just
      needs to mark the buffer stale, EIO and run completions. That means
      we can just use the easy fix for xfs_iflush() to avoid the double
      abort.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      e53946db
    • D
      xfs: More robust inode extent count validation · 23fcb334
      Dave Chinner 提交于
      When the inode is in extent format, it can't have more extents that
      fit in the inode fork. We don't currenty check this, and so this
      corruption goes unnoticed by the inode verifiers. This can lead to
      crashes operating on invalid in-memory structures.
      
      Attempts to access such a inode will now error out in the verifier
      rather than allowing modification operations to proceed.
      Reported-by: NWen Xu <wen.xu@gatech.edu>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      [darrick: fix a typedef, add some braces and breaks to shut up compiler warnings]
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      23fcb334
    • C
      xfs: simplify xfs_bmap_punch_delalloc_range · e2ac8363
      Christoph Hellwig 提交于
      Instead of using xfs_bmapi_read to find delalloc extents and then punch
      them out using xfs_bunmapi, opencode the loop to iterate over the extents
      and call xfs_bmap_del_extent_delay directly.  This both simplifies the
      code and reduces the number of extent tree lookups required.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      e2ac8363
    • L
      btrfs: fix invalid-free in btrfs_extent_same · 22883ddc
      Lu Fengqi 提交于
      If this condition ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
      		   (BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM))
      is hit, we will go to free the uninitialized cmp.src_pages and
      cmp.dst_pages.
      
      Fixes: 67b07bd4 ("Btrfs: reuse cmp workspace in EXTENT_SAME ioctl")
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      22883ddc