1. 14 3月, 2020 18 次提交
    • A
      pick_link(): check for WALK_TRAILING, not LOOKUP_PARENT · b1a81972
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b1a81972
    • A
      namei: invert the meaning of WALK_FOLLOW · 8c4efe22
      Al Viro 提交于
      old flags & WALK_FOLLOW <=> new !(flags & WALK_TRAILING)
      That's what that flag had really been used for.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8c4efe22
    • A
      sanitize handling of nd->last_type, kill LAST_BIND · b4c03536
      Al Viro 提交于
      ->last_type values are set in 3 places: path_init() (sets to LAST_ROOT),
      link_path_walk (LAST_NORM/DOT/DOTDOT) and pick_link (LAST_BIND).
      
      The are checked in walk_component(), lookup_last() and do_last().
      They also get copied to the caller by filename_parentat().  In the last
      3 cases the value is what we had at the return from link_path_walk().
      In case of walk_component() it's either directly downstream from
      assignment in link_path_walk() or, when called by lookup_last(), the
      value we have at the return from link_path_walk().
      
      The value at the entry into link_path_walk() can survive to return only
      if the pathname contains nothing but slashes.  Note that pick_link()
      never returns such - pure jumps are handled directly.  So for the calls
      of link_path_walk() for trailing symlinks it does not matter what value
      had been there at the entry; the value at the return won't depend upon it.
      
      There are 3 call chains that might have pick_link() storing LAST_BIND:
      
      1) pick_link() from step_into() from walk_component() from
      link_path_walk().  In that case we will either be parsing the next
      component immediately after return into link_path_walk(), which will
      overwrite the ->last_type before anyone has a chance to look at it,
      or we'll fail, in which case nobody will be looking at ->last_type at all.
      
      2) pick_link() from step_into() from walk_component() from lookup_last().
      The value is never looked at due to the above; it won't affect the value
      seen at return from any link_path_walk().
      
      3) pick_link() from step_into() from do_last().  Ditto.
      
      In other words, assignemnt in pick_link() is pointless, and so is
      LAST_BIND itself; nothing ever looks at that value.  Kill it off.
      And make link_path_walk() _always_ assign ->last_type - in the only
      case when the value at the entry might survive to the return that value
      is always LAST_ROOT, inherited from path_init().  Move that assignment
      from path_init() into the beginning of link_path_walk(), to consolidate
      the things.
      
      Historical note: LAST_BIND used to be used for the kludge with trailing
      pure jump symlinks (extra iteration through the top-level loop).
      No point keeping it anymore...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b4c03536
    • A
      finally fold get_link() into pick_link() · ad6cc4c3
      Al Viro 提交于
      kill nd->link_inode, while we are at it
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ad6cc4c3
    • A
      merging pick_link() with get_link(), part 6 · 06708adb
      Al Viro 提交于
      move the only remaining call of get_link() into pick_link()
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      06708adb
    • A
      merging pick_link() with get_link(), part 5 · b0417d2c
      Al Viro 提交于
      move get_link() call into step_into().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b0417d2c
    • A
      merging pick_link() with get_link(), part 4 · 92d27016
      Al Viro 提交于
      Move the call of get_link() into walk_component().  Change the
      calling conventions for walk_component() to returning the link
      body to follow (if any).
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      92d27016
    • A
      merging pick_link() with get_link(), part 3 · 40fcf5a9
      Al Viro 提交于
      After a pure jump ("/" or procfs-style symlink) we don't need to
      hold the link anymore.  link_path_walk() dropped it if such case
      had been detected, lookup_last/do_last() (i.e. old trailing_symlink())
      left it on the stack - it ended up calling terminate_walk() shortly
      anyway, which would've purged the entire stack.
      
      Do it in get_link() itself instead.  Simpler logics that way...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      40fcf5a9
    • A
      merging pick_link() with get_link(), part 2 · 1ccac622
      Al Viro 提交于
      Fold trailing_symlink() into lookup_last() and do_last(), change
      the calling conventions of those two.  Rules change:
      	success, we are done => NULL instead of 0
      	error	=> ERR_PTR(-E...) instead of -E...
      	got a symlink to follow => return the path to be followed instead of 1
      
      The loops calling those (in path_lookupat() and path_openat()) adjusted.
      
      A subtle change of control flow here: originally a pure-jump trailing
      symlink ("/" or procfs one) would've passed through the upper level
      loop once more, with "" for path to traverse.  That would've brought
      us back to the lookup_last/do_last entry and we would've hit LAST_BIND
      case (LAST_BIND left from get_link() called by trailing_symlink())
      and pretty much skip to the point right after where we'd left the
      sucker back when we picked that trailing symlink.
      
      Now we don't bother with that extra pass through the upper level
      loop - if get_link() says "I've just done a pure jump, nothing
      else to do", we just treat that as non-symlink case.
      
      Boilerplate added on that step will go away shortly - it'll migrate
      into walk_component() and then to step_into(), collapsing into the
      change of calling conventions for those.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1ccac622
    • A
      merging pick_link() with get_link(), part 1 · 43679723
      Al Viro 提交于
      Move restoring LOOKUP_PARENT and zeroing nd->stack.name[0] past
      the call of get_link() (nothing _currently_ uses them in there).
      That allows to moved the call of may_follow_link() into get_link()
      as well, since now the presence of LOOKUP_PARENT distinguishes
      the callers from each other (link_path_walk() has it, trailing_symlink()
      doesn't).
      
      Preparations for folding trailing_symlink() into callers (lookup_last()
      and do_last()) and changing the calling conventions of those.  Next
      stage after that will have get_link() call migrate into walk_component(),
      then - into step_into().  It's tricky enough to warrant doing that
      in stages, unfortunately...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      43679723
    • A
      a9dc1494
    • A
      LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat() · 161aff1d
      Al Viro 提交于
      New LOOKUP flag, telling path_lookupat() to act as path_mountpointat().
      IOW, traverse mounts at the final point and skip revalidation of the
      location where it ends up.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      161aff1d
    • A
      fold handle_mounts() into step_into() · cbae4d12
      Al Viro 提交于
      The following is true:
      	* calls of handle_mounts() and step_into() are always
      paired in sequences like
      	err = handle_mounts(nd, dentry, &path, &inode, &seq);
      	if (unlikely(err < 0))
      		return err;
      	err = step_into(nd, &path, flags, inode, seq);
      	* in all such sequences path is uninitialized before and
      unused after this pair of calls
      	* in all such sequences inode and seq are unused afterwards.
      
      So the call of handle_mounts() can be shifted inside step_into(),
      turning 'path' into a local variable in the combined function.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cbae4d12
    • A
      new step_into() flag: WALK_NOFOLLOW · aca2903e
      Al Viro 提交于
      Tells step_into() not to follow symlinks, regardless of LOOKUP_FOLLOW.
      Allows to switch handle_lookup_down() to of step_into(), getting
      all follow_managed() and step_into() calls paired.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      aca2903e
    • A
      step_into() callers: dismiss the symlink earlier · 56676ec3
      Al Viro 提交于
      We need to dismiss a symlink when we are done traversing it;
      currently that's done when we call step_into() for its last
      component.  For the cases when we do not call step_into()
      for that component (i.e. when it's . or ..) we do the same
      symlink dismissal after the call of handle_dots().
      
      What we need to guarantee is that the symlink won't be dismissed
      while we are still using nd->last.name - it's pointing into the
      body of said symlink.  step_into() is sufficiently late - by
      the time it's called we'd already obtained the dentry, so the
      name we'd been looking up is no longer needed.  However, it
      turns out to be cleaner to have that ("we are done with that
      component now, can dismiss the link") done explicitly - in the
      callers of step_into().
      
      In handle_dots() case we won't be using the component string
      at all, so for . and .. the corresponding point is actually
      _before_ the call of handle_dots(), not after it.
      
      Fix a minor irregularity in do_last(), while we are at it -
      if trailing symlink ended with . or .. we forgot to dismiss
      it.  Not a problem, since nameidata is about to be done with
      (neither . nor .. can be a trailing symlink, so this is the
      last iteration through the loop) and terminate_walk() will
      clean the stack anyway, but let's keep it more regular.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      56676ec3
    • A
      lookup_fast(): take mount traversal into callers · 20e34357
      Al Viro 提交于
      Current calling conventions: -E... on error, 0 on cache miss,
      result of handle_mounts(nd, dentry, path, inode, seqp) on
      success.  Turn that into returning ERR_PTR(-E...), NULL and dentry
      resp.; deal with handle_mounts() in the callers.  The thing
      is, they already do that in cache miss handling case, so we
      just need to supply dentry to them and unify the mount traversal
      in those cases.  Fewer arguments that way, and we get closer
      to merging handle_mounts() and step_into().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      20e34357
    • A
      teach handle_mounts() to handle RCU mode · c153007b
      Al Viro 提交于
      ... and make the callers of __follow_mount_rcu() use handle_mounts().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c153007b
    • A
      lookup_fast(): consolidate the RCU success case · b023e172
      Al Viro 提交于
      1) in case of __follow_mount_rcu() failure, lookup_fast() proceeds
      to call unlazy_child() and, should it succeed, handle_mounts().
      Note that we have status > 0 (or we wouldn't be calling
      __follow_mount_rcu() at all), so all stuff conditional upon
      non-positive status won't be even touched.
      
      Consolidate just that sequence after the call of __follow_mount_rcu().
      
      2) calling d_is_negative() and keeping its result is pointless -
      we either don't get past checking ->d_seq (and don't use the results of
      d_is_negative() at all), or we are guaranteed that ->d_inode and
      type bits of ->d_flags had been consistent at the time of d_is_negative()
      call.  IOW, we could only get to the use of its result if it's
      equal to !inode.  The same ->d_seq check guarantees that after that point
      this CPU won't observe ->d_flags values older than ->d_inode update.
      So 'negative' variable is completely pointless these days.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b023e172
  2. 13 3月, 2020 3 次提交
  3. 28 2月, 2020 6 次提交
    • A
      atomic_open(): saner calling conventions (return dentry on success) · 239eb983
      Al Viro 提交于
      Currently it either returns -E... or puts (nd->path.mnt,dentry)
      into *path and returns 0.  Make it return ERR_PTR(-E...) or
      dentry; adjust the caller.  Fewer arguments and it's easier
      to keep track of *path contents that way.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      239eb983
    • A
      handle_mounts(): start building a sane wrapper for follow_managed() · bd7c4b50
      Al Viro 提交于
      All callers of follow_managed() follow it on success with the same steps -
      d_backing_inode(path->dentry) is calculated and stored into some struct inode *
      variable and, in all but one case, an unsigned variable (nd->seq to be) is
      zeroed.  The single exception is lookup_fast() and there zeroing is correct
      thing to do - not doing it is a pointless microoptimization.
      
      	Add a wrapper for follow_managed() that would do that combination.
      It's mostly a vehicle for code massage - it will be changing quite a bit,
      and the current calling conventions are by no means final.  Right now it
      takes path, nameidata and (as out params) inode and seq, similar to
      __follow_mount_rcu().  Which will soon get folded into it...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bd7c4b50
    • A
      make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW · 31d1726d
      Al Viro 提交于
      O_CREAT | O_EXCL means "-EEXIST if we run into a trailing symlink".
      As it is, we might or might not have LOOKUP_FOLLOW in op->intent
      in that case - that depends upon having O_NOFOLLOW in open flags.
      It doesn't matter, since we won't be checking it in that case -
      do_last() bails out earlier.
      
      However, making sure it's not set (i.e. acting as if we had an explicit
      O_NOFOLLOW) makes the behaviour more explicit and allows to reorder the
      check for O_CREAT | O_EXCL in do_last() with the call of step_into()
      immediately following it.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      31d1726d
    • A
      follow_automount() doesn't need the entire nameidata · 1c9f5e06
      Al Viro 提交于
      Only the address of ->total_link_count and the flags.
      And fix an off-by-one is ELOOP detection - make it
      consistent with symlink following, where we check if
      the pre-increment value has reached 40, rather than
      check the post-increment one.
      
      [kudos to Christian Brauner for spotted braino]
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1c9f5e06
    • A
      follow_automount(): get rid of dead^Wstillborn code · 25e195aa
      Al Viro 提交于
      1) no instances of ->d_automount() have ever made use of the "return
      ERR_PTR(-EISDIR) if you don't feel like mounting anything" - that's
      a rudiment of plans that got superseded before the thing went into
      the tree.  Despite the comment in follow_automount(), autofs has
      never done that.
      
      2) if there's no ->d_automount() in dentry_operations, filesystems
      should not set DCACHE_NEED_AUTOMOUNT in the first place.  None have
      ever done so...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      25e195aa
    • A
      fix automount/automount race properly · 26df6034
      Al Viro 提交于
      Protection against automount/automount races (two threads hitting the same
      referral point at the same time) is based upon do_add_mount() prevention of
      identical overmounts - trying to overmount the root of mounted tree with
      the same tree fails with -EBUSY.  It's unreliable (the other thread might've
      mounted something on top of the automount it has triggered) *and* causes
      no end of headache for follow_automount() and its caller, since
      finish_automount() behaves like do_new_mount() - if the mountpoint to be is
      overmounted, it mounts on top what's overmounting it.  It's not only wrong
      (we want to go into what's overmounting the automount point and quietly
      discard what we planned to mount there), it introduces the possibility of
      original parent mount getting dropped.  That's what 8aef1884 (VFS: Fix
      vfsmount overput on simultaneous automount) deals with, but it can't do
      anything about the reliability of conflict detection - if something had
      been overmounted the other thread's automount (e.g. that other thread
      having stepped into automount in mount(2)), we don't get that -EBUSY and
      the result is
      	 referral point under automounted NFS under explicit overmount
      under another copy of automounted NFS
      
      What we need is finish_automount() *NOT* digging into overmounts - if it
      finds one, it should just quietly discard the thing it was asked to mount.
      And don't bother with actually crossing into the results of finish_automount() -
      the same loop that calls follow_automount() will do that just fine on the
      next iteration.
      
      IOW, instead of calling lock_mount() have finish_automount() do it manually,
      _without_ the "move into overmount and retry" part.  And leave crossing into
      the results to the caller of follow_automount(), which simplifies it a lot.
      
      Moral: if you end up with a lot of glue working around the calling conventions
      of something, perhaps these calling conventions are simply wrong...
      
      Fixes: 8aef1884 (VFS: Fix vfsmount overput on simultaneous automount)
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      26df6034
  4. 11 2月, 2020 1 次提交
  5. 09 2月, 2020 3 次提交
    • H
      fs: Add VirtualBox guest shared folder (vboxsf) support · 0fd16957
      Hans de Goede 提交于
      VirtualBox hosts can share folders with guests, this commit adds a
      VFS driver implementing the Linux-guest side of this, allowing folders
      exported by the host to be mounted under Linux.
      
      This driver depends on the guest <-> host IPC functions exported by
      the vboxguest driver.
      Acked-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NHans de Goede <hdegoede@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0fd16957
    • L
      pipe: use exclusive waits when reading or writing · 0ddad21d
      Linus Torvalds 提交于
      This makes the pipe code use separate wait-queues and exclusive waiting
      for readers and writers, avoiding a nasty thundering herd problem when
      there are lots of readers waiting for data on a pipe (or, less commonly,
      lots of writers waiting for a pipe to have space).
      
      While this isn't a common occurrence in the traditional "use a pipe as a
      data transport" case, where you typically only have a single reader and
      a single writer process, there is one common special case: using a pipe
      as a source of "locking tokens" rather than for data communication.
      
      In particular, the GNU make jobserver code ends up using a pipe as a way
      to limit parallelism, where each job consumes a token by reading a byte
      from the jobserver pipe, and releases the token by writing a byte back
      to the pipe.
      
      This pattern is fairly traditional on Unix, and works very well, but
      will waste a lot of time waking up a lot of processes when only a single
      reader needs to be woken up when a writer releases a new token.
      
      A simplified test-case of just this pipe interaction is to create 64
      processes, and then pass a single token around between them (this
      test-case also intentionally passes another token that gets ignored to
      test the "wake up next" logic too, in case anybody wonders about it):
      
          #include <unistd.h>
      
          int main(int argc, char **argv)
          {
              int fd[2], counters[2];
      
              pipe(fd);
              counters[0] = 0;
              counters[1] = -1;
              write(fd[1], counters, sizeof(counters));
      
              /* 64 processes */
              fork(); fork(); fork(); fork(); fork(); fork();
      
              do {
                      int i;
                      read(fd[0], &i, sizeof(i));
                      if (i < 0)
                              continue;
                      counters[0] = i+1;
                      write(fd[1], counters, (1+(i & 1)) *sizeof(int));
              } while (counters[0] < 1000000);
              return 0;
          }
      
      and in a perfect world, passing that token around should only cause one
      context switch per transfer, when the writer of a token causes a
      directed wakeup of just a single reader.
      
      But with the "writer wakes all readers" model we traditionally had, on
      my test box the above case causes more than an order of magnitude more
      scheduling: instead of the expected ~1M context switches, "perf stat"
      shows
      
              231,852.37 msec task-clock                #   15.857 CPUs utilized
              11,250,961      context-switches          #    0.049 M/sec
                 616,304      cpu-migrations            #    0.003 M/sec
                   1,648      page-faults               #    0.007 K/sec
       1,097,903,998,514      cycles                    #    4.735 GHz
         120,781,778,352      instructions              #    0.11  insn per cycle
          27,997,056,043      branches                  #  120.754 M/sec
             283,581,233      branch-misses             #    1.01% of all branches
      
            14.621273891 seconds time elapsed
      
             0.018243000 seconds user
             3.611468000 seconds sys
      
      before this commit.
      
      After this commit, I get
      
                5,229.55 msec task-clock                #    3.072 CPUs utilized
               1,212,233      context-switches          #    0.232 M/sec
                 103,951      cpu-migrations            #    0.020 M/sec
                   1,328      page-faults               #    0.254 K/sec
          21,307,456,166      cycles                    #    4.074 GHz
          12,947,819,999      instructions              #    0.61  insn per cycle
           2,881,985,678      branches                  #  551.096 M/sec
              64,267,015      branch-misses             #    2.23% of all branches
      
             1.702148350 seconds time elapsed
      
             0.004868000 seconds user
             0.110786000 seconds sys
      
      instead. Much better.
      
      [ Note! This kernel improvement seems to be very good at triggering a
        race condition in the make jobserver (in GNU make 4.2.1) for me. It's
        a long known bug that was fixed back in June 2017 by GNU make commit
        b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
        avoid hangs.").
      
        But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
        so a number of distributions may still have the buggy version. Some
        have backported the fix to their 4.2.1 release, though, and even
        without the fix it's quite timing-dependent whether the bug actually
        is hit. ]
      
      Josh Triplett says:
       "I've been hammering on your pipe fix patch (switching to exclusive
        wait queues) for a month or so, on several different systems, and I've
        run into no issues with it. The patch *substantially* improves
        parallel build times on large (~100 CPU) systems, both with parallel
        make and with other things that use make's pipe-based jobserver.
      
        All current distributions (including stable and long-term stable
        distributions) have versions of GNU make that no longer have the
        jobserver bug"
      Tested-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ddad21d
    • A
      compat_ioctl: fix FIONREAD on devices · 0a061743
      Arnd Bergmann 提交于
      My final cleanup patch for sys_compat_ioctl() introduced a regression on
      the FIONREAD ioctl command, which is used for both regular and special
      files, but only works on regular files after my patch, as I had missed
      the warning that Al Viro put into a comment right above it.
      
      Change it back so it can work on any file again by moving the implementation
      to do_vfs_ioctl() instead.
      
      Fixes: 77b90401 ("compat_ioctl: simplify the implementation")
      Reported-and-tested-by: NChristian Zigotzky <chzigotzky@xenosoft.de>
      Reported-and-tested-by: Nyouling257 <youling257@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      0a061743
  6. 08 2月, 2020 9 次提交