1. 03 5月, 2016 17 次提交
    • A
      parallel lookups: actual switch to rwsem · 9902af79
      Al Viro 提交于
      ta-da!
      
      The main issue is the lack of down_write_killable(), so the places
      like readdir.c switched to plain inode_lock(); once killable
      variants of rwsem primitives appear, that'll be dealt with.
      
      lockdep side also might need more work
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9902af79
    • A
      parallel lookups machinery, part 4 (and last) · d9171b93
      Al Viro 提交于
      If we *do* run into an in-lookup match, we need to wait for it to
      cease being in-lookup.  Fortunately, we do have unused space in
      in-lookup dentries - d_lru is never looked at until it stops being
      in-lookup.
      
      So we can stash a pointer to wait_queue_head from stack frame of
      the caller of ->lookup().  Some precautions are needed while
      waiting, but it's not that hard - we do hold a reference to dentry
      we are waiting for, so it can't go away.  If it's found to be
      in-lookup the wait_queue_head is still alive and will remain so
      at least while ->d_lock is held.  Moreover, the condition we
      are waiting for becomes true at the same point where everything
      on that wq gets woken up, so we can just add ourselves to the
      queue once.
      
      d_alloc_parallel() gets a pointer to wait_queue_head_t from its
      caller; lookup_slow() adjusted, d_add_ci() taught to use
      d_alloc_parallel() if the dentry passed to it happens to be
      in-lookup one (i.e. if it's been called from the parallel lookup).
      
      That's pretty much it - all that remains is to switch ->i_mutex
      to rwsem and have lookup_slow() take it shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d9171b93
    • A
      parallel lookups machinery, part 3 · 94bdd655
      Al Viro 提交于
      We will need to be able to check if there is an in-lookup
      dentry with matching parent/name.  Right now it's impossible,
      but as soon as start locking directories shared such beasts
      will appear.
      
      Add a secondary hash for locating those.  Hash chains go through
      the same space where d_alias will be once it's not in-lookup anymore.
      Search is done under the same bitlock we use for modifications -
      with the primary hash we can rely on d_rehash() into the wrong
      chain being the worst that could happen, but here the pointers are
      buggered once it's removed from the chain.  On the other hand,
      the chains are not going to be long and normally we'll end up
      adding to the chain anyway.  That allows us to avoid bothering with
      ->d_lock when doing the comparisons - everything is stable until
      removed from chain.
      
      New helper: d_alloc_parallel().  Right now it allocates, verifies
      that no hashed and in-lookup matches exist and adds to in-lookup
      hash.
      
      Returns ERR_PTR() for error, hashed match (in the unlikely case it's
      been found) or new dentry.  In-lookup matches trigger BUG() for
      now; that will change in the next commit when we introduce waiting
      for ongoing lookup to finish.  Note that in-lookup matches won't be
      possible until we actually go for shared locking.
      
      lookup_slow() switched to use of d_alloc_parallel().
      
      Again, these commits are separated only for making it easier to
      review.  All this machinery will start doing something useful only
      when we go for shared locking; it's just that the combination is
      too large for my taste.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      94bdd655
    • A
      parallel lookups machinery, part 2 · 84e710da
      Al Viro 提交于
      We'll need to verify that there's neither a hashed nor in-lookup
      dentry with desired parent/name before adding to in-lookup set.
      
      One possible solution would be to hold the parent's ->d_lock through
      both checks, but while the in-lookup set is relatively small at any
      time, dcache is not.  And holding the parent's ->d_lock through
      something like __d_lookup_rcu() would suck too badly.
      
      So we leave the parent's ->d_lock alone, which means that we watch
      out for the following scenario:
      	* we verify that there's no hashed match
      	* existing in-lookup match gets hashed by another process
      	* we verify that there's no in-lookup matches and decide
      that everything's fine.
      
      Solution: per-directory kinda-sorta seqlock, bumped around the times
      we hash something that used to be in-lookup or move (and hash)
      something in place of in-lookup.  Then the above would turn into
      	* read the counter
      	* do dcache lookup
      	* if no matches found, check for in-lookup matches
      	* if there had been none of those either, check if the
      counter has changed; repeat if it has.
      
      The "kinda-sorta" part is due to the fact that we don't have much spare
      space in inode.  There is a spare word (shared with i_bdev/i_cdev/i_pipe),
      so the counter part is not a problem, but spinlock is a different story.
      
      We could use the parent's ->d_lock, and it would be less painful in
      terms of contention, for __d_add() it would be rather inconvenient to
      grab; we could do that (using lock_parent()), but...
      
      Fortunately, we can get serialization on the counter itself, and it
      might be a good idea in general; we can use cmpxchg() in a loop to
      get from even to odd and smp_store_release() from odd to even.
      
      This commit adds the counter and updating logics; the readers will be
      added in the next commit.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      84e710da
    • A
      beginning of transition to parallel lookups - marking in-lookup dentries · 85c7f810
      Al Viro 提交于
      marked as such when (would be) parallel lookup is about to pass them
      to actual ->lookup(); unmarked when
      	* __d_add() is about to make it hashed, positive or not.
      	* __d_move() (from d_splice_alias(), directly or via
      __d_unalias()) puts a preexisting dentry in its place
      	* in caller of ->lookup() if it has escaped all of the
      above.  Bug (WARN_ON, actually) if it reaches the final dput()
      or d_instantiate() while still marked such.
      
      As the result, we are guaranteed that for as long as the flag is
      set, dentry will
      	* remain negative unhashed with positive refcount
      	* never have its ->d_alias looked at
      	* never have its ->d_lru looked at
      	* never have its ->d_parent and ->d_name changed
      
      Right now we have at most one such for any given parent directory.
      With parallel lookups that restriction will weaken to
      	* only exist when parent is locked shared
      	* at most one with given (parent,name) pair (comparison of
      names is according to ->d_compare())
      	* only exist when there's no hashed dentry with the same
      (parent,name)
      
      Transition will take the next several commits; unfortunately, we'll
      only be able to switch to rwsem at the end of this series.  The
      reason for not making it a single patch is to simplify review.
      
      New primitives: d_in_lookup() (a predicate checking if dentry is in
      the in-lookup state) and d_lookup_done() (tells the system that
      we are done with lookup and if it's still marked as in-lookup, it
      should cease to be such).
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      85c7f810
    • A
      __d_add(): don't drop/regain ->d_lock · 0568d705
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0568d705
    • A
      1936386e
    • A
      nfs: missing wakeup in nfs_unblock_sillyrename() · d2caaa0a
      Al Viro 提交于
      will be needed as soon as lookups are not serialized by ->i_mutex
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d2caaa0a
    • A
      make ext2_get_page() and friends work without external serialization · be5b82db
      Al Viro 提交于
      Right now ext2_get_page() (and its analogues in a bunch of other filesystems)
      relies upon the directory being locked - the way it sets and tests Checked and
      Error bits would be racy without that.  Switch to a slightly different scheme,
      _not_ setting Checked in case of failure.  That way the logics becomes
      	if Checked => OK
      	else if Error => fail
      	else if !validate => fail
      	else => OK
      with validation setting Checked or Error on success and failure resp. and
      returning which one had happened.  Equivalent to the current logics, but unlike
      the current logics not sensitive to the order of set_bit, test_bit getting
      reordered by CPU, etc.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      be5b82db
    • A
      ovl_lookup_real(): use lookup_one_len_unlocked() · b9e1d435
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b9e1d435
    • A
      reconnect_one(): use lookup_one_len_unlocked() · 383d4e8a
      Al Viro 提交于
      ... and explain the non-obvious logics in case when lookup yields
      a different dentry.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      383d4e8a
    • A
      reiserfs: open-code reiserfs_mutex_lock_safe() in reiserfs_unpack() · 1ae1f3f6
      Al Viro 提交于
      ... and have it use inode_lock()
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1ae1f3f6
    • A
      orangefs: don't open-code inode_lock/inode_unlock · 5ecfcb26
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5ecfcb26
    • A
      ocfs2: don't open-code inode_lock/inode_unlock · 7b9743eb
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7b9743eb
    • A
      configfs_detach_prep(): make sure that wait_mutex won't go away · 48f35b7b
      Al Viro 提交于
      grab a reference to dentry we'd got the sucker from, and return
      that dentry via *wait, rather than just returning the address of
      ->i_mutex.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      48f35b7b
    • A
      kernfs: use lookup_one_len_unlocked() · 779b8391
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      779b8391
    • A
  2. 11 4月, 2016 6 次提交
  3. 09 4月, 2016 7 次提交
  4. 07 4月, 2016 1 次提交
    • F
      Btrfs: fix file/data loss caused by fsync after rename and new inode · 56f23fdb
      Filipe Manana 提交于
      If we rename an inode A (be it a file or a directory), create a new
      inode B with the old name of inode A and under the same parent directory,
      fsync inode B and then power fail, at log tree replay time we end up
      removing inode A completely. If inode A is a directory then all its files
      are gone too.
      
      Example scenarios where this happens:
      This is reproducible with the following steps, taken from a couple of
      test cases written for fstests which are going to be submitted upstream
      soon:
      
         # Scenario 1
      
         mkfs.btrfs -f /dev/sdc
         mount /dev/sdc /mnt
         mkdir -p /mnt/a/x
         echo "hello" > /mnt/a/x/foo
         echo "world" > /mnt/a/x/bar
         sync
         mv /mnt/a/x /mnt/a/y
         mkdir /mnt/a/x
         xfs_io -c fsync /mnt/a/x
         <power failure happens>
      
         The next time the fs is mounted, log tree replay happens and
         the directory "y" does not exist nor do the files "foo" and
         "bar" exist anywhere (neither in "y" nor in "x", nor the root
         nor anywhere).
      
         # Scenario 2
      
         mkfs.btrfs -f /dev/sdc
         mount /dev/sdc /mnt
         mkdir /mnt/a
         echo "hello" > /mnt/a/foo
         sync
         mv /mnt/a/foo /mnt/a/bar
         echo "world" > /mnt/a/foo
         xfs_io -c fsync /mnt/a/foo
         <power failure happens>
      
         The next time the fs is mounted, log tree replay happens and the
         file "bar" does not exists anymore. A file with the name "foo"
         exists and it matches the second file we created.
      
      Another related problem that does not involve file/data loss is when a
      new inode is created with the name of a deleted snapshot and we fsync it:
      
         mkfs.btrfs -f /dev/sdc
         mount /dev/sdc /mnt
         mkdir /mnt/testdir
         btrfs subvolume snapshot /mnt /mnt/testdir/snap
         btrfs subvolume delete /mnt/testdir/snap
         rmdir /mnt/testdir
         mkdir /mnt/testdir
         xfs_io -c fsync /mnt/testdir # or fsync some file inside /mnt/testdir
         <power failure>
      
         The next time the fs is mounted the log replay procedure fails because
         it attempts to delete the snapshot entry (which has dir item key type
         of BTRFS_ROOT_ITEM_KEY) as if it were a regular (non-root) entry,
         resulting in the following error that causes mount to fail:
      
         [52174.510532] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257
         [52174.512570] ------------[ cut here ]------------
         [52174.513278] WARNING: CPU: 12 PID: 28024 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x178/0x351 [btrfs]()
         [52174.514681] BTRFS: Transaction aborted (error -2)
         [52174.515630] Modules linked in: btrfs dm_flakey dm_mod overlay crc32c_generic ppdev xor raid6_pq acpi_cpufreq parport_pc tpm_tis sg parport tpm evdev i2c_piix4 proc
         [52174.521568] CPU: 12 PID: 28024 Comm: mount Tainted: G        W       4.5.0-rc6-btrfs-next-27+ #1
         [52174.522805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
         [52174.524053]  0000000000000000 ffff8801df2a7710 ffffffff81264e93 ffff8801df2a7758
         [52174.524053]  0000000000000009 ffff8801df2a7748 ffffffff81051618 ffffffffa03591cd
         [52174.524053]  00000000fffffffe ffff88015e6e5000 ffff88016dbc3c88 ffff88016dbc3c88
         [52174.524053] Call Trace:
         [52174.524053]  [<ffffffff81264e93>] dump_stack+0x67/0x90
         [52174.524053]  [<ffffffff81051618>] warn_slowpath_common+0x99/0xb2
         [52174.524053]  [<ffffffffa03591cd>] ? __btrfs_unlink_inode+0x178/0x351 [btrfs]
         [52174.524053]  [<ffffffff81051679>] warn_slowpath_fmt+0x48/0x50
         [52174.524053]  [<ffffffffa03591cd>] __btrfs_unlink_inode+0x178/0x351 [btrfs]
         [52174.524053]  [<ffffffff8118f5e9>] ? iput+0xb0/0x284
         [52174.524053]  [<ffffffffa0359fe8>] btrfs_unlink_inode+0x1c/0x3d [btrfs]
         [52174.524053]  [<ffffffffa038631e>] check_item_in_log+0x1fe/0x29b [btrfs]
         [52174.524053]  [<ffffffffa0386522>] replay_dir_deletes+0x167/0x1cf [btrfs]
         [52174.524053]  [<ffffffffa038739e>] fixup_inode_link_count+0x289/0x2aa [btrfs]
         [52174.524053]  [<ffffffffa038748a>] fixup_inode_link_counts+0xcb/0x105 [btrfs]
         [52174.524053]  [<ffffffffa038a5ec>] btrfs_recover_log_trees+0x258/0x32c [btrfs]
         [52174.524053]  [<ffffffffa03885b2>] ? replay_one_extent+0x511/0x511 [btrfs]
         [52174.524053]  [<ffffffffa034f288>] open_ctree+0x1dd4/0x21b9 [btrfs]
         [52174.524053]  [<ffffffffa032b753>] btrfs_mount+0x97e/0xaed [btrfs]
         [52174.524053]  [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf
         [52174.524053]  [<ffffffff8117bafa>] mount_fs+0x67/0x131
         [52174.524053]  [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde
         [52174.524053]  [<ffffffffa032af81>] btrfs_mount+0x1ac/0xaed [btrfs]
         [52174.524053]  [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf
         [52174.524053]  [<ffffffff8108c262>] ? lockdep_init_map+0xb9/0x1b3
         [52174.524053]  [<ffffffff8117bafa>] mount_fs+0x67/0x131
         [52174.524053]  [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde
         [52174.524053]  [<ffffffff8119590f>] do_mount+0x8a6/0x9e8
         [52174.524053]  [<ffffffff811358dd>] ? strndup_user+0x3f/0x59
         [52174.524053]  [<ffffffff81195c65>] SyS_mount+0x77/0x9f
         [52174.524053]  [<ffffffff814935d7>] entry_SYSCALL_64_fastpath+0x12/0x6b
         [52174.561288] ---[ end trace 6b53049efb1a3ea6 ]---
      
      Fix this by forcing a transaction commit when such cases happen.
      This means we check in the commit root of the subvolume tree if there
      was any other inode with the same reference when the inode we are
      fsync'ing is a new inode (created in the current transaction).
      
      Test cases for fstests, covering all the scenarios given above, were
      submitted upstream for fstests:
      
        * fstests: generic test for fsync after renaming directory
          https://patchwork.kernel.org/patch/8694281/
      
        * fstests: generic test for fsync after renaming file
          https://patchwork.kernel.org/patch/8694301/
      
        * fstests: add btrfs test for fsync after snapshot deletion
          https://patchwork.kernel.org/patch/8670671/
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      56f23fdb
  5. 05 4月, 2016 2 次提交
    • K
      mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage · ea1754a0
      Kirill A. Shutemov 提交于
      Mostly direct substitution with occasional adjustment or removing
      outdated comments.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea1754a0
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  6. 04 4月, 2016 7 次提交