1. 27 4月, 2016 4 次提交
  2. 15 4月, 2016 15 次提交
    • J
      f2fs: flush dirty pages before starting atomic writes · c27753d6
      Jaegeuk Kim 提交于
      If somebody wrote some data before atomic writes, we should flush them in order
      to handle atomic data in a right period.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      c27753d6
    • J
      f2fs: don't invalidate atomic page if successful · 63c52d78
      Jaegeuk Kim 提交于
      If we committed atomic write successfully, we don't need to invalidate pages.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      63c52d78
    • J
      f2fs: give -E2BIG for no space in xattr · 58457f1c
      Jaegeuk Kim 提交于
      This patch returns -E2BIG if there is no space to add an xattr entry.
      This should fix generic/026 in xfstests as well.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      58457f1c
    • J
      f2fs: remove redundant condition check · 4da7bf5a
      Jaegeuk Kim 提交于
      This patch resolves the redundant condition check reported by David.
      Reported-by: NDavid Binderman <dcb314@hotmail.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      4da7bf5a
    • J
      f2fs: unset atomic/volatile flag in f2fs_release_file · 26dc3d44
      Jaegeuk Kim 提交于
      The atomic/volatile operation should be done in pair of start and commit
      ioctl.
      For example, if a killed process remains open-ended atomic operation, we should
      drop its flag as well as its atomic data. Otherwise, if sqlite initiates another
      operation which doesn't require atomic writes, it will lose every data, since
      f2fs still treats with them as atomic writes; nobody will trigger its commit.
      Reported-by: NMiao Xie <miaoxie@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      26dc3d44
    • J
      f2fs: fix dropping inmemory pages in a wrong time · de5307e4
      Jaegeuk Kim 提交于
      When one reader closes its file while the other writer is doing atomic writes,
      f2fs_release_file drops atomic data resulting in an empty commit.
      This patch fixes this wrong commit problem by checking openess of the file.
      
       Process0                       Process1
       				open file
       start atomic write
       write data
       read data
      				close file
      				f2fs_release_file()
      				clear atomic data
       commit atomic write
      Reported-by: NMiao Xie <miaoxie@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      de5307e4
    • J
      f2fs: add BUG_ON to avoid unnecessary flow · ff373558
      Jaegeuk Kim 提交于
      This patch adds BUG_ON instead of retrying loop.
      In the case of node pages, we already got this inode page, but unlocked it.
      By the fact that we don't truncate any node pages in operations, the page's
      mapping should be unchangeable.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      ff373558
    • J
      f2fs: use PGP_LOCK to check its truncation · 4a6de50d
      Jaegeuk Kim 提交于
      Previously, after trylock_page is succeeded, it doesn't check its mapping.
      In order to fix that, we can just give PGP_LOCK to pagecache_get_page.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      4a6de50d
    • C
      f2fs: fix to convert inline directory correctly · 675f10bd
      Chao Yu 提交于
      With below serials, we will lose parts of dirents:
      
      1) mount f2fs with inline_dentry option
      2) echo 1 > /sys/fs/f2fs/sdX/dir_level
      3) mkdir dir
      4) touch 180 files named [1-180] in dir
      5) touch 181 in dir
      6) echo 3 > /proc/sys/vm/drop_caches
      7) ll dir
      
      ls: cannot access 2: No such file or directory
      ls: cannot access 4: No such file or directory
      ls: cannot access 5: No such file or directory
      ls: cannot access 6: No such file or directory
      ls: cannot access 8: No such file or directory
      ls: cannot access 9: No such file or directory
      ...
      total 360
      drwxr-xr-x 2 root root 4096 Feb 19 15:12 ./
      drwxr-xr-x 3 root root 4096 Feb 19 15:11 ../
      -rw-r--r-- 1 root root    0 Feb 19 15:12 1
      -rw-r--r-- 1 root root    0 Feb 19 15:12 10
      -rw-r--r-- 1 root root    0 Feb 19 15:12 100
      -????????? ? ?    ?       ?            ? 101
      -????????? ? ?    ?       ?            ? 102
      -????????? ? ?    ?       ?            ? 103
      ...
      
      The reason is: when doing the inline dir conversion, we didn't consider
      that directory has hierarchical hash structure which can be configured
      through sysfs interface 'dir_level'.
      
      By default, dir_level of directory inode is 0, it means we have one bucket
      in hash table located in first level, all dirents will be hashed in this
      bucket, so it has no problem for us to do the duplication simply between
      inline dentry page and converted normal dentry page.
      
      However, if we configured dir_level with the value N (greater than 0), it
      will expand the bucket number of first level hash table by 2^N - 1, it
      hashs dirents into different buckets according their hash value, if we
      still move all dirents to first bucket, it makes incorrent locating for
      inline dirents, the result is, although we can iterate all dirents through
      ->readdir, we can't stat some of them in ->lookup which based on hash
      table searching.
      
      This patch fixes this issue by rehashing dirents into correct position
      when converting inline directory.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      675f10bd
    • J
      f2fs: show current mount status · 8c11a53f
      Jaegeuk Kim 提交于
      This patch remains the current mount status to f2fs status info.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      8c11a53f
    • J
      f2fs: treat as a normal umount when remounting ro · faa0e55b
      Jaegeuk Kim 提交于
      When user remounts f2fs as read-only, we can mark the checkpoint as umount.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      faa0e55b
    • J
      f2fs: give -EINVAL for norecovery and rw mount · 6781eabb
      Jaegeuk Kim 提交于
      Once detecting something to recover, f2fs should stop mounting, given norecovery
      and rw mount options.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      6781eabb
    • J
      f2fs: recover superblock at RW remounts · df728b0f
      Jaegeuk Kim 提交于
      This patch adds a sbi flag, SBI_NEED_SB_WRITE, which indicates it needs to
      recover superblock when (re)mounting as RW. This is set only when f2fs is
      mounted as RO.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      df728b0f
    • J
      f2fs: give RO message when recovering superblock · f2353d7b
      Jaegeuk Kim 提交于
      When one of superblocks is missing, f2fs recovers it with the valid one.
      But, even if f2fs is mounted as RO, we'd better notify that too.
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      f2353d7b
    • L
      Make file credentials available to the seqfile interfaces · 34dbbcdb
      Linus Torvalds 提交于
      A lot of seqfile users seem to be using things like %pK that uses the
      credentials of the current process, but that is actually completely
      wrong for filesystem interfaces.
      
      The unix semantics for permission checking files is to check permissions
      at _open_ time, not at read or write time, and that is not just a small
      detail: passing off stdin/stdout/stderr to a suid application and making
      the actual IO happen in privileged context is a classic exploit
      technique.
      
      So if we want to be able to look at permissions at read time, we need to
      use the file open credentials, not the current ones.  Normal file
      accesses can just use "f_cred" (or any of the helper functions that do
      that, like file_ns_capable()), but the seqfile interfaces do not have
      any such options.
      
      It turns out that seq_file _does_ save away the user_ns information of
      the file, though.  Since user_ns is just part of the full credential
      information, replace that special case with saving off the cred pointer
      instead, and suddenly seq_file has all the permission information it
      needs.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34dbbcdb
  3. 13 4月, 2016 4 次提交
  4. 11 4月, 2016 1 次提交
    • L
      Revert "ext4: allow readdir()'s of large empty directories to be interrupted" · 9f2394c9
      Linus Torvalds 提交于
      This reverts commit 1028b55b.
      
      It's broken: it makes ext4 return an error at an invalid point, causing
      the readdir wrappers to write the the position of the last successful
      directory entry into the position field, which means that the next
      readdir will now return that last successful entry _again_.
      
      You can only return fatal errors (that terminate the readdir directory
      walk) from within the filesystem readdir functions, the "normal" errors
      (that happen when the readdir buffer fills up, for example) happen in
      the iterorator where we know the position of the actual failing entry.
      
      I do have a very different patch that does the "signal_pending()"
      handling inside the iterator function where it is allowable, but while
      that one passes all the sanity checks, I screwed up something like four
      times while emailing it out, so I'm not going to commit it today.
      
      So my track record is not good enough, and the stars will have to align
      better before that one gets committed.  And it would be good to get some
      review too, of course, since celestial alignments are always an iffy
      debugging model.
      
      IOW, let's just revert the commit that caused the problem for now.
      Reported-by: NGreg Thelen <gthelen@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f2394c9
  5. 09 4月, 2016 7 次提交
  6. 07 4月, 2016 1 次提交
    • F
      Btrfs: fix file/data loss caused by fsync after rename and new inode · 56f23fdb
      Filipe Manana 提交于
      If we rename an inode A (be it a file or a directory), create a new
      inode B with the old name of inode A and under the same parent directory,
      fsync inode B and then power fail, at log tree replay time we end up
      removing inode A completely. If inode A is a directory then all its files
      are gone too.
      
      Example scenarios where this happens:
      This is reproducible with the following steps, taken from a couple of
      test cases written for fstests which are going to be submitted upstream
      soon:
      
         # Scenario 1
      
         mkfs.btrfs -f /dev/sdc
         mount /dev/sdc /mnt
         mkdir -p /mnt/a/x
         echo "hello" > /mnt/a/x/foo
         echo "world" > /mnt/a/x/bar
         sync
         mv /mnt/a/x /mnt/a/y
         mkdir /mnt/a/x
         xfs_io -c fsync /mnt/a/x
         <power failure happens>
      
         The next time the fs is mounted, log tree replay happens and
         the directory "y" does not exist nor do the files "foo" and
         "bar" exist anywhere (neither in "y" nor in "x", nor the root
         nor anywhere).
      
         # Scenario 2
      
         mkfs.btrfs -f /dev/sdc
         mount /dev/sdc /mnt
         mkdir /mnt/a
         echo "hello" > /mnt/a/foo
         sync
         mv /mnt/a/foo /mnt/a/bar
         echo "world" > /mnt/a/foo
         xfs_io -c fsync /mnt/a/foo
         <power failure happens>
      
         The next time the fs is mounted, log tree replay happens and the
         file "bar" does not exists anymore. A file with the name "foo"
         exists and it matches the second file we created.
      
      Another related problem that does not involve file/data loss is when a
      new inode is created with the name of a deleted snapshot and we fsync it:
      
         mkfs.btrfs -f /dev/sdc
         mount /dev/sdc /mnt
         mkdir /mnt/testdir
         btrfs subvolume snapshot /mnt /mnt/testdir/snap
         btrfs subvolume delete /mnt/testdir/snap
         rmdir /mnt/testdir
         mkdir /mnt/testdir
         xfs_io -c fsync /mnt/testdir # or fsync some file inside /mnt/testdir
         <power failure>
      
         The next time the fs is mounted the log replay procedure fails because
         it attempts to delete the snapshot entry (which has dir item key type
         of BTRFS_ROOT_ITEM_KEY) as if it were a regular (non-root) entry,
         resulting in the following error that causes mount to fail:
      
         [52174.510532] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257
         [52174.512570] ------------[ cut here ]------------
         [52174.513278] WARNING: CPU: 12 PID: 28024 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x178/0x351 [btrfs]()
         [52174.514681] BTRFS: Transaction aborted (error -2)
         [52174.515630] Modules linked in: btrfs dm_flakey dm_mod overlay crc32c_generic ppdev xor raid6_pq acpi_cpufreq parport_pc tpm_tis sg parport tpm evdev i2c_piix4 proc
         [52174.521568] CPU: 12 PID: 28024 Comm: mount Tainted: G        W       4.5.0-rc6-btrfs-next-27+ #1
         [52174.522805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
         [52174.524053]  0000000000000000 ffff8801df2a7710 ffffffff81264e93 ffff8801df2a7758
         [52174.524053]  0000000000000009 ffff8801df2a7748 ffffffff81051618 ffffffffa03591cd
         [52174.524053]  00000000fffffffe ffff88015e6e5000 ffff88016dbc3c88 ffff88016dbc3c88
         [52174.524053] Call Trace:
         [52174.524053]  [<ffffffff81264e93>] dump_stack+0x67/0x90
         [52174.524053]  [<ffffffff81051618>] warn_slowpath_common+0x99/0xb2
         [52174.524053]  [<ffffffffa03591cd>] ? __btrfs_unlink_inode+0x178/0x351 [btrfs]
         [52174.524053]  [<ffffffff81051679>] warn_slowpath_fmt+0x48/0x50
         [52174.524053]  [<ffffffffa03591cd>] __btrfs_unlink_inode+0x178/0x351 [btrfs]
         [52174.524053]  [<ffffffff8118f5e9>] ? iput+0xb0/0x284
         [52174.524053]  [<ffffffffa0359fe8>] btrfs_unlink_inode+0x1c/0x3d [btrfs]
         [52174.524053]  [<ffffffffa038631e>] check_item_in_log+0x1fe/0x29b [btrfs]
         [52174.524053]  [<ffffffffa0386522>] replay_dir_deletes+0x167/0x1cf [btrfs]
         [52174.524053]  [<ffffffffa038739e>] fixup_inode_link_count+0x289/0x2aa [btrfs]
         [52174.524053]  [<ffffffffa038748a>] fixup_inode_link_counts+0xcb/0x105 [btrfs]
         [52174.524053]  [<ffffffffa038a5ec>] btrfs_recover_log_trees+0x258/0x32c [btrfs]
         [52174.524053]  [<ffffffffa03885b2>] ? replay_one_extent+0x511/0x511 [btrfs]
         [52174.524053]  [<ffffffffa034f288>] open_ctree+0x1dd4/0x21b9 [btrfs]
         [52174.524053]  [<ffffffffa032b753>] btrfs_mount+0x97e/0xaed [btrfs]
         [52174.524053]  [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf
         [52174.524053]  [<ffffffff8117bafa>] mount_fs+0x67/0x131
         [52174.524053]  [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde
         [52174.524053]  [<ffffffffa032af81>] btrfs_mount+0x1ac/0xaed [btrfs]
         [52174.524053]  [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf
         [52174.524053]  [<ffffffff8108c262>] ? lockdep_init_map+0xb9/0x1b3
         [52174.524053]  [<ffffffff8117bafa>] mount_fs+0x67/0x131
         [52174.524053]  [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde
         [52174.524053]  [<ffffffff8119590f>] do_mount+0x8a6/0x9e8
         [52174.524053]  [<ffffffff811358dd>] ? strndup_user+0x3f/0x59
         [52174.524053]  [<ffffffff81195c65>] SyS_mount+0x77/0x9f
         [52174.524053]  [<ffffffff814935d7>] entry_SYSCALL_64_fastpath+0x12/0x6b
         [52174.561288] ---[ end trace 6b53049efb1a3ea6 ]---
      
      Fix this by forcing a transaction commit when such cases happen.
      This means we check in the commit root of the subvolume tree if there
      was any other inode with the same reference when the inode we are
      fsync'ing is a new inode (created in the current transaction).
      
      Test cases for fstests, covering all the scenarios given above, were
      submitted upstream for fstests:
      
        * fstests: generic test for fsync after renaming directory
          https://patchwork.kernel.org/patch/8694281/
      
        * fstests: generic test for fsync after renaming file
          https://patchwork.kernel.org/patch/8694301/
      
        * fstests: add btrfs test for fsync after snapshot deletion
          https://patchwork.kernel.org/patch/8670671/
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      56f23fdb
  7. 05 4月, 2016 2 次提交
    • K
      mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage · ea1754a0
      Kirill A. Shutemov 提交于
      Mostly direct substitution with occasional adjustment or removing
      outdated comments.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea1754a0
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  8. 04 4月, 2016 6 次提交