1. 24 5月, 2012 1 次提交
  2. 22 5月, 2012 2 次提交
    • L
      Revert "vfs: remove unnecessary d_unhashed() check from __d_lookup_rcu" · 2e321806
      Linus Torvalds 提交于
      This reverts commit 8c01a529.
      
      It turns out the d_unhashed() check isn't unnecessary after all: while
      it's true that unhashing will increment the sequence numbers, that does
      not necessarily invalidate the RCU lookup, because it might have seen
      the dentry pointer (before it got unhashed), but by the time it loaded
      the sequence number, it could have seen the *new* sequence number (after
      it got unhashed).
      
      End result: we might look up an unhashed dentry that is about to be
      freed, with the sequence number never indicating anything bad about it.
      So checking that the dentry is still hashed (*after* reading the sequence
      number) is indeed the proper fix, and was never unnecessary.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e321806
    • L
      vfs: be even more careful about dentry RCU name lookups · 6326c71f
      Linus Torvalds 提交于
      Miklos Szeredi points out that we need to also worry about memory
      odering when doing the dentry name comparison asynchronously with RCU.
      
      In particular, doing a rename can do a memcpy() of one dentry name over
      another, and we want to make sure that any unlocked reader will always
      see the proper terminating NUL character, so that it won't ever run off
      the allocation.
      
      Rather than having to be extra careful with the name copy or at lookup
      time for each character, this resolves the issue by making sure that all
      names that are inlined in the dentry always have a NUL character at the
      end of the name allocation.  If we do that at dentry allocation time, we
      know that no future name copy will ever change that final NUL to
      anything else, so there are no memory ordering issues.
      
      So even if a concurrent rename ends up overwriting the NUL character
      that terminates the original name, we always know that there is one
      final NUL at the end, and there is no worry about the lockless RCU
      lookup traversing the name too far.
      
      The out-of-line allocations are never copied over, so we can just make
      sure that we write the name (with terminating NULL) and do a write
      barrier before we expose the name to anything else by setting it in the
      dentry.
      Reported-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Nick Piggin <npiggin@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6326c71f
  3. 11 5月, 2012 4 次提交
  4. 05 5月, 2012 1 次提交
    • L
      vfs: clean up __d_lookup_rcu() and dentry_cmp() interfaces · 12f8ad4b
      Linus Torvalds 提交于
      The calling conventions for __d_lookup_rcu() and dentry_cmp() are
      annoying in different ways, and there is actually one single underlying
      reason for both of the annoyances.
      
      The fundamental reason is that we do the returned dentry sequence number
      check inside __d_lookup_rcu() instead of doing it in the caller.  This
      results in two annoyances:
      
       - __d_lookup_rcu() now not only needs to return the dentry and the
         sequence number that goes along with the lookup, it also needs to
         return the inode pointer that was validated by that sequence number
         check.
      
       - and because we did the sequence number check early (to validate the
         name pointer and length) we also couldn't just pass the dentry itself
         to dentry_cmp(), we had to pass the counted string that contained the
         name.
      
      So that sequence number decision caused two separate ugly calling
      conventions.
      
      Both of these problems would be solved if we just did the sequence
      number check in the caller instead.  There's only one caller, and that
      caller already has to do the sequence number check for the parent
      anyway, so just do that.
      
      That allows us to stop returning the dentry->d_inode in that in-out
      argument (pointer-to-pointer-to-inode), so we can make the inode
      argument just a regular input inode pointer.  The caller can just load
      the inode from dentry->d_inode, and then do the sequence number check
      after that to make sure that it's synchronized with the name we looked
      up.
      
      And it allows us to just pass in the dentry to dentry_cmp(), which is
      what all the callers really wanted.  Sure, dentry_cmp() has to be a bit
      careful about the dentry (which is not stable during RCU lookup), but
      that's actually very simple.
      
      And now that dentry_cmp() can clearly see that the first string argument
      is a dentry, we can use the direct word access for that, instead of the
      careful unaligned zero-padding.  The dentry name is always properly
      aligned, since it is a single path component that is either embedded
      into the dentry itself, or was allocated with kmalloc() (see __d_alloc).
      
      Finally, this also uninlines the nasty slow-case for dentry comparisons:
      that one *does* need to do a sequence number check, since it will call
      in to the low-level filesystems, and we want to give those a stable
      inode pointer and path component length/start arguments.  Doing an extra
      sequence check for that slow case is not a problem, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12f8ad4b
  5. 04 5月, 2012 1 次提交
    • L
      vfs: make word-at-a-time accesses handle a non-existing page · e419b4cc
      Linus Torvalds 提交于
      It turns out that there are more cases than CONFIG_DEBUG_PAGEALLOC that
      can have holes in the kernel address space: it seems to happen easily
      with Xen, and it looks like the AMD gart64 code will also punch holes
      dynamically.
      
      Actually hitting that case is still very unlikely, so just do the
      access, and take an exception and fix it up for the very unlikely case
      of it being a page-crosser with no next page.
      
      And hey, this abstraction might even help other architectures that have
      other issues with unaligned word accesses than the possible missing next
      page.  IOW, this could do the byte order magic too.
      
      Peter Anvin fixed a thinko in the shifting for the exception case.
      Reported-and-tested-by: NJana Saout <jana@saout.de>
      Cc:  Peter Anvin <hpa@zytor.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e419b4cc
  6. 29 3月, 2012 1 次提交
    • M
      vfs: fix d_ancestor() case in d_materialize_unique · b18dafc8
      Michel Lespinasse 提交于
      In d_materialise_unique() there are 3 subcases to the 'aliased dentry'
      case; in two subcases the inode i_lock is properly released but this
      does not occur in the -ELOOP subcase.
      
      This seems to have been introduced by commit 18367501 ("fix loop
      checks in d_materialise_unique()").
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: stable@vger.kernel.org # v3.0+
      [ Added a comment, and moved the unlock to where we generate the -ELOOP,
        which seems to be more natural.
      
        You probably can't actually trigger this without a buggy network file
        server - d_materialize_unique() is for finding aliases on non-local
        filesystems, and the d_ancestor() case is for a hardlinked directory
        loop.
      
        But we should be robust in the case of such buggy servers anyway. ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b18dafc8
  7. 23 3月, 2012 1 次提交
  8. 21 3月, 2012 1 次提交
  9. 20 3月, 2012 1 次提交
    • L
      vfs: get rid of batshit-insane pointless dentry hash calculations · 6d7d1a0d
      Linus Torvalds 提交于
      For some odd historical reason, the final mixing round for the dentry
      cache hash table lookup had an insane "xor with big constant" logic.  In
      two places.
      
      The big constant that is being xor'ed is GOLDEN_RATIO_PRIME, which is a
      fairly random-looking number that is designed to be *multiplied* with so
      that the bits get spread out over a whole long-word.
      
      But xor'ing with it is insane.  It doesn't really even change the hash -
      it really only shifts the hash around in the hash table.  To make
      matters worse, the insane big constant is different on 32-bit and 64-bit
      builds, even though the name hash bits we use are always 32-bit (and the
      bits from the pointer we mix in effectively are too).
      
      It's all total voodoo programming, in other words.
      
      Now, some testing and analysis of the hash chains shows that the rest of
      the hash function seems to be fairly good.  It does pick the right bits
      of the parent dentry pointer, for example, and while it's generally a
      bad idea to use an xor to mix down the upper bits (because if there is a
      repeating pattern, the xor can cause "destructive interference"), it
      seems to not have been a disaster.
      
      For example, replacing the hash with the normal "hash_long()" code (that
      uses the GOLDEN_RATIO_PRIME constant correctly, btw) actually just makes
      the hash worse.  The hand-picked hash knew which bits of the pointer had
      the highest entropy, and hash_long() ends up mixing bits less optimally
      at least in some trivial tests.
      
      So the hash function overall seems fine, it just has that really odd
      "shift result around by a constant xor".
      
      So get rid of the silly xor, and replace the down-mixing of the bits
      with an add instead of an xor that tends to not have the same kind of
      destructive interference issues.  Some stats on the resulting hash
      chains shows that they look statistically identical before and after,
      but the code is simpler and no longer makes you go "WTF?".
      
      Also, the incoming hash really is just "unsigned int", not a long, and
      there's no real point to worry about the high 26 bits of the dentry
      pointer for the 64-bit case, because they are all going to be identical
      anyway.
      
      So also change the hashing to be done in the more natural 'unsigned int'
      that is the real size of the actual hashed data anyway.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d7d1a0d
  10. 09 3月, 2012 1 次提交
  11. 05 3月, 2012 1 次提交
    • L
      vfs: move dentry_cmp from <linux/dcache.h> to fs/dcache.c · 5483f18e
      Linus Torvalds 提交于
      It's only used inside fs/dcache.c, and we're going to play games with it
      for the word-at-a-time patches.  This time we really don't even want to
      export it, because it really is an internal function to fs/dcache.c, and
      has been since it was introduced.
      
      Having it in that extremely hot header file (it's included in pretty
      much everything, thanks to <linux/fs.h>) is a disaster for testing
      different versions, and is utterly pointless.
      
      We really should have some kind of header file diet thing, where we
      figure out which parts of header files are really better off private and
      only result in more expensive compiles.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5483f18e
  12. 03 3月, 2012 1 次提交
  13. 29 2月, 2012 1 次提交
  14. 14 2月, 2012 1 次提交
  15. 13 1月, 2012 1 次提交
  16. 11 1月, 2012 1 次提交
    • M
      fix shrink_dcache_parent() livelock · eaf5f907
      Miklos Szeredi 提交于
      Two (or more) concurrent calls of shrink_dcache_parent() on the same dentry may
      cause shrink_dcache_parent() to loop forever.
      
      Here's what appears to happen:
      
      1 - CPU0: select_parent(P) finds C and puts it on dispose list, returns 1
      
      2 - CPU1: select_parent(P) locks P->d_lock
      
      3 - CPU0: shrink_dentry_list() locks C->d_lock
         dentry_kill(C) tries to lock P->d_lock but fails, unlocks C->d_lock
      
      4 - CPU1: select_parent(P) locks C->d_lock,
               moves C from dispose list being processed on CPU0 to the new
      dispose list, returns 1
      
      5 - CPU0: shrink_dentry_list() finds dispose list empty, returns
      
      6 - Goto 2 with CPU0 and CPU1 switched
      
      Basically select_parent() steals the dentry from shrink_dentry_list() and thinks
      it found a new one, causing shrink_dentry_list() to think it's making progress
      and loop over and over.
      
      One way to trigger this is to make udev calls stat() on the sysfs file while it
      is going away.
      
      Having a file in /lib/udev/rules.d/ with only this one rule seems to the trick:
      
      ATTR{vendor}=="0x8086", ATTR{device}=="0x10ca", ENV{PCI_SLOT_NAME}="%k", ENV{MATCHADDR}="$attr{address}", RUN+="/bin/true"
      
      Then execute the following loop:
      
      while true; do
              echo -bond0 > /sys/class/net/bonding_masters
              echo +bond0 > /sys/class/net/bonding_masters
              echo -bond1 > /sys/class/net/bonding_masters
              echo +bond1 > /sys/class/net/bonding_masters
      done
      
      One fix would be to check all callers and prevent concurrent calls to
      shrink_dcache_parent().  But I think a better solution is to stop the
      stealing behavior.
      
      This patch adds a new dentry flag that is set when the dentry is added to the
      dispose list.  The flag is cleared in dentry_lru_del() in case the dentry gets a
      new reference just before being pruned.
      
      If the dentry has this flag, select_parent() will skip it and let
      shrink_dentry_list() retry pruning it.  With select_parent() skipping those
      dentries there will not be the appearance of progress (new dentries found) when
      there is none, hence shrink_dcache_parent() will not loop forever.
      
      Set the flag is also set in prune_dcache_sb() for consistency as suggested by
      Linus.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      CC: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      eaf5f907
  17. 10 1月, 2012 2 次提交
    • A
      vfs: new helper - d_make_root() · adc0e91a
      Al Viro 提交于
      d_alloc_root() with iput() in case of allocation failure...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      adc0e91a
    • D
      dcache: use a dispose list in select_parent · b48f03b3
      Dave Chinner 提交于
      select_parent currently abuses the dentry cache LRU to provide
      cleanup features for child dentries that need to be freed. It moves
      them to the tail of the LRU, then tells shrink_dcache_parent() to
      calls __shrink_dcache_sb to unconditionally move them to a dispose
      list (as DCACHE_REFERENCED is ignored). __shrink_dcache_sb() has to
      relock the dentries to move them off the LRU onto the dispose list,
      but otherwise does not touch the dentries that select_parent() moved
      to the tail of the LRU. It then passses the dispose list to
      shrink_dentry_list() which tries to free the dentries.
      
      IOWs, the use of __shrink_dcache_sb() is superfluous - we can build
      exactly the same list of dentries for disposal directly in
      select_parent() and call shrink_dentry_list() instead of calling
      __shrink_dcache_sb() to do that. This means that we avoid long holds
      on the lru lock walking the LRU moving dentries to the dispose list
      We also avoid the need to relock each dentry just to move it off the
      LRU, reducing the numebr of times we lock each dentry to dispose of
      them in shrink_dcache_parent() from 3 to 2 times.
      
      Further, we remove one of the two callers of __shrink_dcache_sb().
      This also means that __shrink_dcache_sb can be moved into back into
      prune_dcache_sb() and we no longer have to handle referenced
      dentries conditionally, simplifying the code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b48f03b3
  18. 04 1月, 2012 7 次提交
  19. 07 12月, 2011 1 次提交
    • A
      fix apparmor dereferencing potentially freed dentry, sanitize __d_path() API · 02125a82
      Al Viro 提交于
      __d_path() API is asking for trouble and in case of apparmor d_namespace_path()
      getting just that.  The root cause is that when __d_path() misses the root
      it had been told to look for, it stores the location of the most remote ancestor
      in *root.  Without grabbing references.  Sure, at the moment of call it had
      been pinned down by what we have in *path.  And if we raced with umount -l, we
      could have very well stopped at vfsmount/dentry that got freed as soon as
      prepend_path() dropped vfsmount_lock.
      
      It is safe to compare these pointers with pre-existing (and known to be still
      alive) vfsmount and dentry, as long as all we are asking is "is it the same
      address?".  Dereferencing is not safe and apparmor ended up stepping into
      that.  d_namespace_path() really wants to examine the place where we stopped,
      even if it's not connected to our namespace.  As the result, it looked
      at ->d_sb->s_magic of a dentry that might've been already freed by that point.
      All other callers had been careful enough to avoid that, but it's really
      a bad interface - it invites that kind of trouble.
      
      The fix is fairly straightforward, even though it's bigger than I'd like:
      	* prepend_path() root argument becomes const.
      	* __d_path() is never called with NULL/NULL root.  It was a kludge
      to start with.  Instead, we have an explicit function - d_absolute_root().
      Same as __d_path(), except that it doesn't get root passed and stops where
      it stops.  apparmor and tomoyo are using it.
      	* __d_path() returns NULL on path outside of root.  The main
      caller is show_mountinfo() and that's precisely what we pass root for - to
      skip those outside chroot jail.  Those who don't want that can (and do)
      use d_path().
      	* __d_path() root argument becomes const.  Everyone agrees, I hope.
      	* apparmor does *NOT* try to use __d_path() or any of its variants
      when it sees that path->mnt is an internal vfsmount.  In that case it's
      definitely not mounted anywhere and dentry_path() is exactly what we want
      there.  Handling of sysctl()-triggered weirdness is moved to that place.
      	* if apparmor is asked to do pathname relative to chroot jail
      and __d_path() tells it we it's not in that jail, the sucker just calls
      d_absolute_path() instead.  That's the other remaining caller of __d_path(),
      BTW.
              * seq_path_root() does _NOT_ return -ENAMETOOLONG (it's stupid anyway -
      the normal seq_file logics will take care of growing the buffer and redoing
      the call of ->show() just fine).  However, if it gets path not reachable
      from root, it returns SEQ_SKIP.  The only caller adjusted (i.e. stopped
      ignoring the return value as it used to do).
      Reviewed-by: NJohn Johansen <john.johansen@canonical.com>
      ACKed-by: NJohn Johansen <john.johansen@canonical.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: stable@vger.kernel.org
      02125a82
  20. 21 11月, 2011 1 次提交
    • D
      VFS: Log the fact that we've given ELOOP rather than creating a loop · dd179946
      David Howells 提交于
      To prevent an NFS server from being used to create a directory loop in an NFS
      superblock on the client, the following patch was committed:
      
      	commit 18367501
      	Author: Al Viro <viro@zeniv.linux.org.uk>
      	Date:   Tue Jul 12 21:42:24 2011 -0400
      	Subject: fix loop checks in d_materialise_unique()
      
      This causes ELOOP to be reported to anyone trying to access the dentry that
      would otherwise cause the kernel to complete the loop.
      
      However, no indication is given to the caller as to why an operation that ought
      to work doesn't.  The fault is with the kernel, which doesn't want to try and
      solve the problem as it gets horrendously messy if there's another mountpoint
      somewhere in the trees being spliced that can't be moved[*].
      
      [*] The real problem is that we don't handle the excision of a subtree that
      gets moved _out_ of what we can see.  This can happen on the server where a
      directory is merely moved between two other dirs on the same filesystem, but
      where destination dir is not accessible by the client.
      
      So, given the choice to return ELOOP rather than trying to reconfigure the
      dentry tree, we should give the caller some indication of why they aren't being
      allowed to make what should be a legitimate request and log a message.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NSachin Prabhu <sprabhu@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      dd179946
  21. 08 11月, 2011 1 次提交
  22. 02 11月, 2011 1 次提交
    • S
      vfs: add d_prune dentry operation · f0023bc6
      Sage Weil 提交于
      This adds a d_prune dentry operation that is called by the VFS prior to
      pruning (i.e. unhashing and killing) a hashed dentry from the dcache.
      Wrap dentry_lru_del() and use the new _prune() helper in the cases where we
      are about to unhash and kill the dentry.
      
      This will be used by Ceph to maintain a flag indicating whether the
      complete contents of a directory are contained in the dcache, allowing it
      to satisfy lookups and readdir without addition server communication.
      
      Renumber a few DCACHE_* #defines to group DCACHE_OP_PRUNE with the other
      DCACHE_OP_ bits.
      Signed-off-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      f0023bc6
  23. 07 8月, 2011 1 次提交
    • L
      vfs: renumber DCACHE_xyz flags, remove some stale ones · 830c0f0e
      Linus Torvalds 提交于
      Gcc tends to generate better code with small integers, including the
      DCACHE_xyz flag tests - so move the common ones to be first in the list.
      Also just remove the unused DCACHE_INOTIFY_PARENT_WATCHED and
      DCACHE_AUTOFS_PENDING values, their users no longer exists in the source
      tree.
      
      And add a "unlikely()" to the DCACHE_OP_COMPARE test, since we want the
      common case to be a nice straight-line fall-through.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      830c0f0e
  24. 04 8月, 2011 1 次提交
  25. 01 8月, 2011 3 次提交
    • D
      VFS: Reorganise shrink_dcache_for_umount_subtree() after demise of dcache_lock · 43c1c9cd
      David Howells 提交于
      Reorganise shrink_dcache_for_umount_subtree() in light of the demise of
      dcache_lock.  Without that dcache_lock, there is no need for the batching of
      removal of dentries from the system under it (we wanted to make intensive use
      of the locked data whilst we held it, but didn't want to hold it for long at a
      time).
      
      This works, provided the preceding patch is correct in its removal of locking
      on dentry->d_lock on the basis that no one should be locking these dentries any
      more as the whole superblock is defunct.
      
      With this patch, the calls to dentry_lru_del() and __d_shrink() are placed at
      the point where each dentry is detached handled.
      
      It is possible that, as an alternative, the batching should still be done -
      but only for dentry_lru_del() of all a dentry's children in one go.  In such a
      case, the batching would be done under dcache_lru_lock.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      43c1c9cd
    • D
      VFS: Remove dentry->d_lock locking from shrink_dcache_for_umount_subtree() · c6627c60
      David Howells 提交于
      Locks of the dcache_lock were replaced by locks of dentry->d_lock in commits
      such as:
      
      	23044507
      	2fd6b7f5
      
      as part of the RCU-based pathwalk changes, despite the fact that the caller
      (shrink_dcache_for_umount()) notes in the banner comment the reasons that
      d_lock is not necessary in these functions:
      
      /*
       * destroy the dentries attached to a superblock on unmounting
       * - we don't need to use dentry->d_lock because:
       *   - the superblock is detached from all mountings and open files, so the
       *     dentry trees will not be rearranged by the VFS
       *   - s_umount is write-locked, so the memory pressure shrinker will ignore
       *     any dentries belonging to this superblock that it comes across
       *   - the filesystem itself is no longer permitted to rearrange the dentries
       *     in this superblock
       */
      
      So remove these locks.  If the locks are actually necessary, then this banner
      comment should be altered instead.
      
      The hash table chains are protected by 1-bit locks in the hash table heads, so
      those shouldn't be a problem.
      
      Note that to make this work, __d_drop() has to be split so that the RCUwalk
      barrier can be avoided.  This causes problems otherwise as it has an assertion
      that dentry->d_lock is locked - but there is no need for that as no one else
      can be trying to access this dentry, except to step over it (and that should
      be handled by d_free(), I think).
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c6627c60
    • D
      VFS: Remove detached-dentry counter from shrink_dcache_for_umount_subtree() · 35f40ef0
      David Howells 提交于
      Remove the detached-dentry counter from shrink_dcache_for_umount_subtree() as
      the value it computes is no longer used as of commit
      312d3ca8 which made the nr_dentry counters
      summed per-CPU rather than global atomic.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      35f40ef0
  26. 27 7月, 2011 1 次提交
  27. 22 7月, 2011 1 次提交