1. 16 9月, 2014 1 次提交
    • L
      vfs: simplify and shrink stack frame of link_path_walk() · d6bb3e90
      Linus Torvalds 提交于
      Commit 9226b5b4 ("vfs: avoid non-forwarding large load after small
      store in path lookup") made link_path_walk() always access the
      "hash_len" field as a single 64-bit entity, in order to avoid mixed size
      accesses to the members.
      
      However, what I didn't notice was that that effectively means that the
      whole "struct qstr this" is now basically redundant.  We already
      explicitly track the "const char *name", and if we just use "u64
      hash_len" instead of "long len", there is nothing else left of the
      "struct qstr".
      
      We do end up wanting the "struct qstr" if we have a filesystem with a
      "d_hash()" function, but that's a rare case, and we might as well then
      just squirrell away the name and hash_len at that point.
      
      End result: fewer live variables in the loop, a smaller stack frame, and
      better code generation.  And we don't need to pass in pointers variables
      to helper functions any more, because the return value contains all the
      relevant information.  So this removes more lines than it adds, and the
      source code is clearer too.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6bb3e90
  2. 15 9月, 2014 3 次提交
    • L
      vfs: avoid non-forwarding large load after small store in path lookup · 9226b5b4
      Linus Torvalds 提交于
      The performance regression that Josef Bacik reported in the pathname
      lookup (see commit 99d263d4 "vfs: fix bad hashing of dentries") made
      me look at performance stability of the dcache code, just to verify that
      the problem was actually fixed.  That turned up a few other problems in
      this area.
      
      There are a few cases where we exit RCU lookup mode and go to the slow
      serializing case when we shouldn't, Al has fixed those and they'll come
      in with the next VFS pull.
      
      But my performance verification also shows that link_path_walk() turns
      out to have a very unfortunate 32-bit store of the length and hash of
      the name we look up, followed by a 64-bit read of the combined hash_len
      field.  That screws up the processor store to load forwarding, causing
      an unnecessary hickup in this critical routine.
      
      It's caused by the ugly calling convention for the "hash_name()"
      function, and easily fixed by just making hash_name() fill in the whole
      'struct qstr' rather than passing it a pointer to just the hash value.
      
      With that, the profile for this function looks much smoother.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9226b5b4
    • A
      be careful with nd->inode in path_init() and follow_dotdot_rcu() · 4023bfc9
      Al Viro 提交于
      in the former we simply check if dentry is still valid after picking
      its ->d_inode; in the latter we fetch ->d_inode in the same places
      where we fetch dentry and its ->d_seq, under the same checks.
      
      Cc: stable@vger.kernel.org # 2.6.38+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4023bfc9
    • A
      don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu() · 7bd88377
      Al Viro 提交于
      return the value instead, and have path_init() do the assignment.  Broken by
      "vfs: Fix absolute RCU path walk failures due to uninitialized seq number",
      which was Cc-stable with 2.6.38+ as destination.  This one should go where
      it went.
      
      To avoid dummy value returned in case when root is already set (it would do
      no harm, actually, since the only caller that doesn't ignore the return value
      is guaranteed to have nd->root *not* set, but it's more obvious that way),
      lift the check into callers.  And do the same to set_root(), to keep them
      in sync.
      
      Cc: stable@vger.kernel.org # 2.6.38+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7bd88377
  3. 14 9月, 2014 2 次提交
    • A
      fix bogus read_seqretry() checks introduced in b37199e6 · f5be3e29
      Al Viro 提交于
      read_seqretry() returns true on mismatch, not on match...
      
      Cc: stable@vger.kernel.org # 3.15+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f5be3e29
    • L
      vfs: fix bad hashing of dentries · 99d263d4
      Linus Torvalds 提交于
      Josef Bacik found a performance regression between 3.2 and 3.10 and
      narrowed it down to commit bfcfaa77 ("vfs: use 'unsigned long'
      accesses for dcache name comparison and hashing"). He reports:
      
       "The test case is essentially
      
            for (i = 0; i < 1000000; i++)
                    mkdir("a$i");
      
        On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k
        dir/sec with 3.10.  This is because we spend waaaaay more time in
        __d_lookup on 3.10 than in 3.2.
      
        The new hashing function for strings is suboptimal for <
        sizeof(unsigned long) string names (and hell even > sizeof(unsigned
        long) string names that I've tested).  I broke out the old hashing
        function and the new one into a userspace helper to get real numbers
        and this is what I'm getting:
      
            Old hash table had 1000000 entries, 0 dupes, 0 max dupes
            New hash table had 12628 entries, 987372 dupes, 900 max dupes
            We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash
      
        My test does the hash, and then does the d_hash into a integer pointer
        array the same size as the dentry hash table on my system, and then
        just increments the value at the address we got to see how many
        entries we overlap with.
      
        As you can see the old hash function ended up with all 1 million
        entries in their own bucket, whereas the new one they are only
        distributed among ~12.5k buckets, which is why we're using so much
        more CPU in __d_lookup".
      
      The reason for this hash regression is two-fold:
      
       - On 64-bit architectures the down-mixing of the original 64-bit
         word-at-a-time hash into the final 32-bit hash value is very
         simplistic and suboptimal, and just adds the two 32-bit parts
         together.
      
         In particular, because there is no bit shuffling and the mixing
         boundary is also a byte boundary, similar character patterns in the
         low and high word easily end up just canceling each other out.
      
       - the old byte-at-a-time hash mixed each byte into the final hash as it
         hashed the path component name, resulting in the low bits of the hash
         generally being a good source of hash data.  That is not true for the
         word-at-a-time case, and the hash data is distributed among all the
         bits.
      
      The fix is the same in both cases: do a better job of mixing the bits up
      and using as much of the hash data as possible.  We already have the
      "hash_32|64()" functions to do that.
      Reported-by: NJosef Bacik <jbacik@fb.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99d263d4
  4. 08 8月, 2014 3 次提交
  5. 24 7月, 2014 1 次提交
  6. 11 6月, 2014 1 次提交
    • A
      fs,userns: Change inode_capable to capable_wrt_inode_uidgid · 23adbe12
      Andy Lutomirski 提交于
      The kernel has no concept of capabilities with respect to inodes; inodes
      exist independently of namespaces.  For example, inode_capable(inode,
      CAP_LINUX_IMMUTABLE) would be nonsense.
      
      This patch changes inode_capable to check for uid and gid mappings and
      renames it to capable_wrt_inode_uidgid, which should make it more
      obvious what it does.
      
      Fixes CVE-2014-4014.
      
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23adbe12
  7. 20 4月, 2014 1 次提交
    • A
      fix races between __d_instantiate() and checks of dentry flags · 22213318
      Al Viro 提交于
      in non-lazy walk we need to be careful about dentry switching from
      negative to positive - both ->d_flags and ->d_inode are updated,
      and in some places we might see only one store.  The cases where
      dentry has been obtained by dcache lookup with ->i_mutex held on
      parent are safe - ->d_lock and ->i_mutex provide all the barriers
      we need.  However, there are several places where we run into
      trouble:
      	* do_last() fetches ->d_inode, then checks ->d_flags and
      assumes that inode won't be NULL unless d_is_negative() is true.
      Race with e.g. creat() - we might have fetched the old value of
      ->d_inode (still NULL) and new value of ->d_flags (already not
      DCACHE_MISS_TYPE).  Lin Ming has observed and reported the resulting
      oops.
      	* a bunch of places checks ->d_inode for being non-NULL,
      then checks ->d_flags for "is it a symlink".  Race with symlink(2)
      in case if our CPU sees ->d_inode update first - we see non-NULL
      there, but ->d_flags still contains DCACHE_MISS_TYPE instead of
      DCACHE_SYMLINK_TYPE.  Result: false negative on "should we follow
      link here?", with subsequent unpleasantness.
      
      Cc: stable@vger.kernel.org # 3.13 and 3.14 need that one
      Reported-and-tested-by: NLin Ming <minggr@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      22213318
  8. 02 4月, 2014 3 次提交
  9. 01 4月, 2014 7 次提交
  10. 31 3月, 2014 1 次提交
  11. 23 3月, 2014 1 次提交
    • A
      rcuwalk: recheck mount_lock after mountpoint crossing attempts · b37199e6
      Al Viro 提交于
      We can get false negative from __lookup_mnt() if an unrelated vfsmount
      gets moved.  In that case legitimize_mnt() is guaranteed to fail,
      and we will fall back to non-RCU walk... unless we end up running
      into a hard error on a filesystem object we wouldn't have reached
      if not for that false negative.  IOW, delaying that check until
      the end of pathname resolution is wrong - we should recheck right
      after we attempt to cross the mountpoint.  We don't need to recheck
      unless we see d_mountpoint() being true - in that case even if
      we have just raced with mount/umount, we can simply go on as if
      we'd come at the moment when the sucker wasn't a mountpoint; if we
      run into a hard error as the result, it was a legitimate outcome.
      __lookup_mnt() returning NULL is different in that respect, since
      it might've happened due to operation on completely unrelated
      mountpoint.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b37199e6
  12. 10 3月, 2014 1 次提交
    • L
      vfs: atomic f_pos accesses as per POSIX · 9c225f26
      Linus Torvalds 提交于
      Our write() system call has always been atomic in the sense that you get
      the expected thread-safe contiguous write, but we haven't actually
      guaranteed that concurrent writes are serialized wrt f_pos accesses, so
      threads (or processes) that share a file descriptor and use "write()"
      concurrently would quite likely overwrite each others data.
      
      This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says:
      
       "2.9.7 Thread Interactions with Regular File Operations
      
        All of the following functions shall be atomic with respect to each
        other in the effects specified in POSIX.1-2008 when they operate on
        regular files or symbolic links: [...]"
      
      and one of the effects is the file position update.
      
      This unprotected file position behavior is not new behavior, and nobody
      has ever cared.  Until now.  Yongzhi Pan reported unexpected behavior to
      Michael Kerrisk that was due to this.
      
      This resolves the issue with a f_pos-specific lock that is taken by
      read/write/lseek on file descriptors that may be shared across threads
      or processes.
      Reported-by: NYongzhi Pan <panyongzhi@gmail.com>
      Reported-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9c225f26
  13. 06 2月, 2014 1 次提交
    • L
      execve: use 'struct filename *' for executable name passing · c4ad8f98
      Linus Torvalds 提交于
      This changes 'do_execve()' to get the executable name as a 'struct
      filename', and to free it when it is done.  This is what the normal
      users want, and it simplifies and streamlines their error handling.
      
      The controlled lifetime of the executable name also fixes a
      use-after-free problem with the trace_sched_process_exec tracepoint: the
      lifetime of the passed-in string for kernel users was not at all
      obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize
      the pathname allocation lifetime with the execve() having finished,
      which in turn meant that the trace point that happened after
      mm_release() of the old process VM ended up using already free'd memory.
      
      To solve the kernel string lifetime issue, this simply introduces
      "getname_kernel()" that works like the normal user-space getname()
      function, except with the source coming from kernel memory.
      
      As Oleg points out, this also means that we could drop the tcomm[] array
      from 'struct linux_binprm', since the pathname lifetime now covers
      setup_new_exec().  That would be a separate cleanup.
      Reported-by: NIgor Zhbanov <i.zhbanov@samsung.com>
      Tested-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4ad8f98
  14. 01 2月, 2014 2 次提交
  15. 26 1月, 2014 1 次提交
  16. 13 12月, 2013 1 次提交
  17. 29 11月, 2013 1 次提交
  18. 09 11月, 2013 9 次提交