1. 03 9月, 2013 1 次提交
    • W
      vfs: use lockref_get_not_zero() for optimistic lockless dget_parent() · df3d0bbc
      Waiman Long 提交于
      A valid parent pointer is always going to have a non-zero reference
      count, but if we look up the parent optimistically without locking, we
      have to protect against the (very unlikely) race against renaming
      changing the parent from under us.
      
      We do that by using lockref_get_not_zero(), and then re-checking the
      parent pointer after getting a valid reference.
      
      [ This is a re-implementation of a chunk from the original patch by
        Waiman Long: "dcache: Enable lockless update of dentry's refcount".
        I've completely rewritten the patch-series and split it up, but I'm
        attributing this part to Waiman as it's close enough to his earlier
        patch  - Linus ]
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df3d0bbc
  2. 29 8月, 2013 1 次提交
    • W
      vfs: make the dentry cache use the lockref infrastructure · 98474236
      Waiman Long 提交于
      This just replaces the dentry count/lock combination with the lockref
      structure that contains both a count and a spinlock, and does the
      mechanical conversion to use the lockref infrastructure.
      
      There are no semantic changes here, it's purely syntactic.  The
      reference lockref implementation uses the spinlock exactly the same way
      that the old dcache code did, and the bulk of this patch is just
      expanding the internal "d_count" use in the dcache code to use
      "d_lockref.count" instead.
      
      This is purely preparation for the real change to make the reference
      count updates be lockless during the 3.12 merge window.
      
      [ As with the previous commit, this is a rewritten version of a concept
        originally from Waiman, so credit goes to him, blame for any errors
        goes to me.
      
        Waiman's patch had some semantic differences for taking advantage of
        the lockless update in dget_parent(), while this patch is
        intentionally a pure search-and-replace change with no semantic
        changes.     - Linus ]
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98474236
  3. 25 8月, 2013 1 次提交
  4. 29 6月, 2013 3 次提交
  5. 14 6月, 2013 1 次提交
    • S
      GFS2: Add atomic_open support · 6d4ade98
      Steven Whitehouse 提交于
      I've restricted atomic_open to only operate on regular files, although
      I still don't understand why atomic_open should not be possible also for
      directories on GFS2. That can always be added in later though, if it
      makes sense.
      
      The ->atomic_open function can be passed negative dentries, which
      in most cases means either ENOENT (->lookup) or a call to d_instantiate
      (->create). In the GFS2 case though, we need to actually perform the
      look up, since we do not know whether there has been a new inode created
      on another node. The look up calls d_splice_alias which then tries to
      rehash the dentry - so the solution here is to simply check for that
      in d_splice_alias. The same issue is likely to affect any other cluster
      filesystem implementing ->atomic_open
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "J. Bruce Fields" <bfields fieldses org>
      Cc: Jeff Layton <jlayton@redhat.com>
      6d4ade98
  6. 05 5月, 2013 2 次提交
  7. 01 5月, 2013 1 次提交
    • G
      fs/dcache.c: add cond_resched() to shrink_dcache_parent() · 421348f1
      Greg Thelen 提交于
      Call cond_resched() in shrink_dcache_parent() to maintain interactivity.
      
      Before this patch:
      
      	void shrink_dcache_parent(struct dentry * parent)
      	{
      		while ((found = select_parent(parent, &dispose)) != 0)
      			shrink_dentry_list(&dispose);
      	}
      
      select_parent() populates the dispose list with dentries which
      shrink_dentry_list() then deletes.  select_parent() carefully uses
      need_resched() to avoid doing too much work at once.  But neither
      shrink_dcache_parent() nor its called functions call cond_resched().  So
      once need_resched() is set select_parent() will return single dentry
      dispose list which is then deleted by shrink_dentry_list().  This is
      inefficient when there are a lot of dentry to process.  This can cause
      softlockup and hurts interactivity on non preemptable kernels.
      
      This change adds cond_resched() in shrink_dcache_parent().  The benefit
      of this is that need_resched() is quickly cleared so that future calls
      to select_parent() are able to efficiently return a big batch of dentry.
      
      These additional cond_resched() do not seem to impact performance, at
      least for the workload below.
      
      Here is a program which can cause soft lockup if other system activity
      sets need_resched().
      
      	int main()
      	{
      	        struct rlimit rlim;
      	        int i;
      	        int f[100000];
      	        char buf[20];
      	        struct timeval t1, t2;
      	        double diff;
      
      	        /* cleanup past run */
      	        system("rm -rf x");
      
      	        /* boost nfile rlimit */
      	        rlim.rlim_cur = 200000;
      	        rlim.rlim_max = 200000;
      	        if (setrlimit(RLIMIT_NOFILE, &rlim))
      	                err(1, "setrlimit");
      
      	        /* make directory for files */
      	        if (mkdir("x", 0700))
      	                err(1, "mkdir");
      
      	        if (gettimeofday(&t1, NULL))
      	                err(1, "gettimeofday");
      
      	        /* populate directory with open files */
      	        for (i = 0; i < 100000; i++) {
      	                snprintf(buf, sizeof(buf), "x/%d", i);
      	                f[i] = open(buf, O_CREAT);
      	                if (f[i] == -1)
      	                        err(1, "open");
      	        }
      
      	        /* close some of the files */
      	        for (i = 0; i < 85000; i++)
      	                close(f[i]);
      
      	        /* unlink all files, even open ones */
      	        system("rm -rf x");
      
      	        if (gettimeofday(&t2, NULL))
      	                err(1, "gettimeofday");
      
      	        diff = (((double)t2.tv_sec * 1000000 + t2.tv_usec) -
      	                ((double)t1.tv_sec * 1000000 + t1.tv_usec));
      
      	        printf("done: %g elapsed\n", diff/1e6);
      	        return 0;
      	}
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      421348f1
  8. 27 3月, 2013 1 次提交
    • A
      Nest rename_lock inside vfsmount_lock · 7ea600b5
      Al Viro 提交于
      ... lest we get livelocks between path_is_under() and d_path() and friends.
      
      The thing is, wrt fairness lglocks are more similar to rwsems than to rwlocks;
      it is possible to have thread B spin on attempt to take lock shared while thread
      A is already holding it shared, if B is on lower-numbered CPU than A and there's
      a thread C spinning on attempt to take the same lock exclusive.
      
      As the result, we need consistent ordering between vfsmount_lock (lglock) and
      rename_lock (seq_lock), even though everything that takes both is going to take
      vfsmount_lock only shared.
      Spotted-by: NBrad Spengler <spender@grsecurity.net>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7ea600b5
  9. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  10. 26 2月, 2013 2 次提交
    • J
      vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op · ecf3d1f1
      Jeff Layton 提交于
      The following set of operations on a NFS client and server will cause
      
          server# mkdir a
          client# cd a
          server# mv a a.bak
          client# sleep 30  # (or whatever the dir attrcache timeout is)
          client# stat .
          stat: cannot stat `.': Stale NFS file handle
      
      Obviously, we should not be getting an ESTALE error back there since the
      inode still exists on the server. The problem is that the lookup code
      will call d_revalidate on the dentry that "." refers to, because NFS has
      FS_REVAL_DOT set.
      
      nfs_lookup_revalidate will see that the parent directory has changed and
      will try to reverify the dentry by redoing a LOOKUP. That of course
      fails, so the lookup code returns ESTALE.
      
      The problem here is that d_revalidate is really a bad fit for this case.
      What we really want to know at this point is whether the inode is still
      good or not, but we don't really care what name it goes by or whether
      the dcache is still valid.
      
      Add a new d_op->d_weak_revalidate operation and have complete_walk call
      that instead of d_revalidate. The intent there is to allow for a
      "weaker" d_revalidate that just checks to see whether the inode is still
      good. This is also gives us an opportunity to kill off the FS_REVAL_DOT
      special casing.
      
      [AV: changed method name, added note in porting, fixed confusion re
      having it possibly called from RCU mode (it won't be)]
      
      Cc: NeilBrown <neilb@suse.de>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ecf3d1f1
    • A
      d_hash_and_lookup(): export, switch open-coded instances · 4f522a24
      Al Viro 提交于
      * calling conventions change - ERR_PTR() is returned on ->d_hash() errors;
      NULL is just for dcache miss now.
      * exported, open-coded instances in ncpfs and cifs converted.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4f522a24
  11. 23 2月, 2013 4 次提交
  12. 21 12月, 2012 2 次提交
    • N
      vfs: d_obtain_alias() needs to use "/" as default name. · b911a6bd
      NeilBrown 提交于
      NFS appears to use d_obtain_alias() to create the root dentry rather than
      d_make_root.  This can cause 'prepend_path()' to complain that the root
      has a weird name if an NFS filesystem is lazily unmounted.  e.g.  if
      "/mnt" is an NFS mount then
      
       { cd /mnt; umount -l /mnt ; ls -l /proc/self/cwd; }
      
      will cause a WARN message like
         WARNING: at /home/git/linux/fs/dcache.c:2624 prepend_path+0x1d7/0x1e0()
         ...
         Root dentry has weird name <>
      
      to appear in kernel logs.
      
      So change d_obtain_alias() to use "/" rather than "" as the anonymous
      name.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b911a6bd
    • J
      vfs: remove DCACHE_NEED_LOOKUP · 39e3c955
      Jeff Layton 提交于
      The code that relied on that flag was ripped out of btrfs quite some
      time ago, and never added back. Josef indicated that he was going to
      take a different approach to the problem in btrfs, and that we
      could just eliminate this flag.
      
      Cc: Josef Bacik <jbacik@fusionio.com>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      39e3c955
  13. 30 9月, 2012 1 次提交
    • M
      vfs: dcache: fix deadlock in tree traversal · 8110e16d
      Miklos Szeredi 提交于
      IBM reported a deadlock in select_parent().  This was found to be caused
      by taking rename_lock when already locked when restarting the tree
      traversal.
      
      There are two cases when the traversal needs to be restarted:
      
       1) concurrent d_move(); this can only happen when not already locked,
          since taking rename_lock protects against concurrent d_move().
      
       2) racing with final d_put() on child just at the moment of ascending
          to parent; rename_lock doesn't protect against this rare race, so it
          can happen when already locked.
      
      Because of case 2, we need to be able to handle restarting the traversal
      when rename_lock is already held.  This patch fixes all three callers of
      try_to_ascend().
      
      IBM reported that the deadlock is gone with this patch.
      
      [ I rewrote the patch to be smaller and just do the "goto again" if the
        lock was already held, but credit goes to Miklos for the real work.
         - Linus ]
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8110e16d
  14. 28 9月, 2012 1 次提交
  15. 27 9月, 2012 1 次提交
  16. 19 9月, 2012 1 次提交
    • M
      vfs: dcache: use DCACHE_DENTRY_KILLED instead of DCACHE_DISCONNECTED in d_kill() · b161dfa6
      Miklos Szeredi 提交于
      IBM reported a soft lockup after applying the fix for the rename_lock
      deadlock.  Commit c83ce989 ("VFS: Fix the nfs sillyrename regression
      in kernel 2.6.38") was found to be the culprit.
      
      The nfs sillyrename fix used DCACHE_DISCONNECTED to indicate that the
      dentry was killed.  This flag can be set on non-killed dentries too,
      which results in infinite retries when trying to traverse the dentry
      tree.
      
      This patch introduces a separate flag: DCACHE_DENTRY_KILLED, which is
      only set in d_kill() and makes try_to_ascend() test only this flag.
      
      IBM reported successful test results with this patch.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b161dfa6
  17. 14 7月, 2012 3 次提交
  18. 09 6月, 2012 1 次提交
  19. 31 5月, 2012 2 次提交
    • J
      vfs: remove unused __d_splice_alias argument · 3f50fff4
      J. Bruce Fields 提交于
      Nobody sets want_disconn any more.
      Reported-by: NPeng Tao <bergwolf@gmail.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3f50fff4
    • J
      vfs: stop d_splice_alias creating directory aliases · 7732a557
      J. Bruce Fields 提交于
      A directory should never have more than one dentry pointing to it.
      
      But d_splice_alias() will add one if it finds a directory with an
      already-existing non-DISCONNECTED dentry.
      
      I can't find an obvious reproducer, but I also can't see what prevents
      d_splice_alias() from encountering such a case.
      
      It therefore seems safest to allow d_splice_alias to use any dentry it
      finds.
      
      (Prior to the removal of dentry_unhash() from vfs_rmdir(), around v3.0,
      this could cause an nfsd deadlock like this:
      
      	- Somebody attempts to remove a non-empty directory.
      	- The dentry_unhash() in vfs_rmdir() unhashes the dentry
      	  pointing to the non-empty directory.
      	- ->rmdir() then fails with -ENOTEMPTY
      	- Before the vfs_rmdir() caller reaches dput(), an nfsd process
      	  in rename looks up the directory by filehandle; at the end of
      	  that lookup, this dentry is found by d_alloc_anon(), and a
      	  reference is taken on it, preventing dput() from removing it.
      	- A regular lookup of the directory calls d_splice_alias(),
      	  finds only an unhashed (not a DISCONNECTED) dentry, and
      	  insteads adds a new one, so the directory now has two
      	  dentries.
      	- The nfsd process in rename, which was previously looking up
      	  the source directory of the rename, now looks up the target
      	  directory (which is the same), and gets the dentry newly
      	  created by the previous lookup.
      	- The rename, seeing two different dentries, assumes this is a
      	  cross-directory rename and attempts to take the i_mutex on the
      	  directory twice.
      
      That reproducer no longer exists, but I don't think there was anything
      fundamentally incorrect about the vfs_rmdir() behavior there, so I think
      the real fault was here in d_splice_alias().)
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7732a557
  20. 30 5月, 2012 1 次提交
    • A
      brlocks/lglocks: API cleanups · 962830df
      Andi Kleen 提交于
      lglocks and brlocks are currently generated with some complicated macros
      in lglock.h.  But there's no reason to not just use common utility
      functions and put all the data into a common data structure.
      
      In preparation, this patch changes the API to look more like normal
      function calls with pointers, not magic macros.
      
      The patch is rather large because I move over all users in one go to keep
      it bisectable.  This impacts the VFS somewhat in terms of lines changed.
      But no actual behaviour change.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      962830df
  21. 24 5月, 2012 1 次提交
  22. 22 5月, 2012 2 次提交
    • L
      Revert "vfs: remove unnecessary d_unhashed() check from __d_lookup_rcu" · 2e321806
      Linus Torvalds 提交于
      This reverts commit 8c01a529.
      
      It turns out the d_unhashed() check isn't unnecessary after all: while
      it's true that unhashing will increment the sequence numbers, that does
      not necessarily invalidate the RCU lookup, because it might have seen
      the dentry pointer (before it got unhashed), but by the time it loaded
      the sequence number, it could have seen the *new* sequence number (after
      it got unhashed).
      
      End result: we might look up an unhashed dentry that is about to be
      freed, with the sequence number never indicating anything bad about it.
      So checking that the dentry is still hashed (*after* reading the sequence
      number) is indeed the proper fix, and was never unnecessary.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e321806
    • L
      vfs: be even more careful about dentry RCU name lookups · 6326c71f
      Linus Torvalds 提交于
      Miklos Szeredi points out that we need to also worry about memory
      odering when doing the dentry name comparison asynchronously with RCU.
      
      In particular, doing a rename can do a memcpy() of one dentry name over
      another, and we want to make sure that any unlocked reader will always
      see the proper terminating NUL character, so that it won't ever run off
      the allocation.
      
      Rather than having to be extra careful with the name copy or at lookup
      time for each character, this resolves the issue by making sure that all
      names that are inlined in the dentry always have a NUL character at the
      end of the name allocation.  If we do that at dentry allocation time, we
      know that no future name copy will ever change that final NUL to
      anything else, so there are no memory ordering issues.
      
      So even if a concurrent rename ends up overwriting the NUL character
      that terminates the original name, we always know that there is one
      final NUL at the end, and there is no worry about the lockless RCU
      lookup traversing the name too far.
      
      The out-of-line allocations are never copied over, so we can just make
      sure that we write the name (with terminating NULL) and do a write
      barrier before we expose the name to anything else by setting it in the
      dentry.
      Reported-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Nick Piggin <npiggin@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6326c71f
  23. 11 5月, 2012 4 次提交
  24. 05 5月, 2012 1 次提交
    • L
      vfs: clean up __d_lookup_rcu() and dentry_cmp() interfaces · 12f8ad4b
      Linus Torvalds 提交于
      The calling conventions for __d_lookup_rcu() and dentry_cmp() are
      annoying in different ways, and there is actually one single underlying
      reason for both of the annoyances.
      
      The fundamental reason is that we do the returned dentry sequence number
      check inside __d_lookup_rcu() instead of doing it in the caller.  This
      results in two annoyances:
      
       - __d_lookup_rcu() now not only needs to return the dentry and the
         sequence number that goes along with the lookup, it also needs to
         return the inode pointer that was validated by that sequence number
         check.
      
       - and because we did the sequence number check early (to validate the
         name pointer and length) we also couldn't just pass the dentry itself
         to dentry_cmp(), we had to pass the counted string that contained the
         name.
      
      So that sequence number decision caused two separate ugly calling
      conventions.
      
      Both of these problems would be solved if we just did the sequence
      number check in the caller instead.  There's only one caller, and that
      caller already has to do the sequence number check for the parent
      anyway, so just do that.
      
      That allows us to stop returning the dentry->d_inode in that in-out
      argument (pointer-to-pointer-to-inode), so we can make the inode
      argument just a regular input inode pointer.  The caller can just load
      the inode from dentry->d_inode, and then do the sequence number check
      after that to make sure that it's synchronized with the name we looked
      up.
      
      And it allows us to just pass in the dentry to dentry_cmp(), which is
      what all the callers really wanted.  Sure, dentry_cmp() has to be a bit
      careful about the dentry (which is not stable during RCU lookup), but
      that's actually very simple.
      
      And now that dentry_cmp() can clearly see that the first string argument
      is a dentry, we can use the direct word access for that, instead of the
      careful unaligned zero-padding.  The dentry name is always properly
      aligned, since it is a single path component that is either embedded
      into the dentry itself, or was allocated with kmalloc() (see __d_alloc).
      
      Finally, this also uninlines the nasty slow-case for dentry comparisons:
      that one *does* need to do a sequence number check, since it will call
      in to the low-level filesystems, and we want to give those a stable
      inode pointer and path component length/start arguments.  Doing an extra
      sequence check for that slow case is not a problem, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12f8ad4b
  25. 04 5月, 2012 1 次提交
    • L
      vfs: make word-at-a-time accesses handle a non-existing page · e419b4cc
      Linus Torvalds 提交于
      It turns out that there are more cases than CONFIG_DEBUG_PAGEALLOC that
      can have holes in the kernel address space: it seems to happen easily
      with Xen, and it looks like the AMD gart64 code will also punch holes
      dynamically.
      
      Actually hitting that case is still very unlikely, so just do the
      access, and take an exception and fix it up for the very unlikely case
      of it being a page-crosser with no next page.
      
      And hey, this abstraction might even help other architectures that have
      other issues with unaligned word accesses than the possible missing next
      page.  IOW, this could do the byte order magic too.
      
      Peter Anvin fixed a thinko in the shifting for the exception case.
      Reported-and-tested-by: NJana Saout <jana@saout.de>
      Cc:  Peter Anvin <hpa@zytor.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e419b4cc