1. 28 5月, 2020 12 次提交
  2. 29 4月, 2020 4 次提交
    • A
      fix autofs regression caused by follow_managed() changes · 23b92780
      Al Viro 提交于
      fix #27211210
      
      commit 508c8772760d4ef9c1a044519b564710c3684fc5 upstream.
      
      we need to reload ->d_flags after the call of ->d_manage() - the thing
      might've been called with dentry still negative and have the damn thing
      turned positive while we'd waited.
      
      Fixes: d41efb522e90 "fs/namei.c: pull positivity check into follow_managed()"
      Reported-by: NIan Kent <raven@themaw.net>
      Tested-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      23b92780
    • A
      fs/namei.c: fix missing barriers when checking positivity · e2bdaed8
      Al Viro 提交于
      fix #27211210
      
      commit 2fa6b1e01a9b1a54769c394f06cd72c3d12a2d48 upstream.
      
      Pinned negative dentries can, generally, be made positive
      by another thread.  Conditions that prevent that are
      	* ->d_lock on dentry in question
      	* parent directory held at least shared
      	* nobody else could have observed the address of dentry
      Most of the places working with those fall into one of those
      categories; however, d_lookup() and friends need to be used
      with some care.  Fortunately, there's not a lot of call sites,
      and with few exceptions all of those fall under one of the
      cases above.
      
      Exceptions are all in fs/namei.c - in lookup_fast(), lookup_dcache()
      and mountpoint_last().  Another one is lookup_slow() - there
      dcache lookup is done with parent held shared, but the result
      is used after we'd drop the lock.  The same happens in do_last() -
      the lookup (in lookup_one()) is done with parent locked, but
      result is used after unlocking.
      
      lookup_fast(), do_last() and mountpoint_last() flat-out reject
      negatives.
      
      Most of lookup_dcache() calls are made with parent locked at least
      shared; the only exception is lookup_one_len_unlocked().  It might
      return pinned negative, needs serious care from callers.  Fortunately,
      almost nobody calls it directly anymore; all but two callers have
      converted to lookup_positive_unlocked(), which rejects negatives.
      
      lookup_slow() is called by the same lookup_one_len_unlocked() (see
      above), mountpoint_last() and walk_component().  In those two negatives
      are rejected.
      
      In other words, there is a small set of places where we need to
      check carefully if a pinned potentially negative dentry is, in
      fact, positive.  After that check we want to be sure that both
      ->d_inode and type bits in ->d_flags are stable and observed.
      The set consists of follow_managed() (where the rejection happens
      for lookup_fast(), walk_component() and do_last()), last_mountpoint()
      and lookup_positive_unlocked().
      
      Solution:
      	1) transition from negative to positive (in __d_set_inode_and_type())
      stores ->d_inode, then uses smp_store_release() to set ->d_flags type bits.
      	2) aforementioned 3 places in fs/namei.c fetch ->d_flags with
      smp_load_acquire() and bugger off if it type bits say "negative".
      That way anyone downstream of those checks has dentry know positive pinned,
      with ->d_inode and type bits of ->d_flags stable and observed.
      
      I considered splitting off d_lookup_positive(), so that the checks could
      be done right there, under ->d_lock.  However, that leads to massive
      duplication of rather subtle code in fs/namei.c and fs/dcache.c.  It's
      worse than it might seem, thanks to autofs ->d_manage() getting involved ;-/
      No matter what, autofs_d_manage()/autofs_d_automount() must live with
      the possibility of pinned negative dentry passed their way, becoming
      positive under them - that's the intended behaviour when lookup comes
      in the middle of automount in progress, so we can't keep them out of
      the area that has to deal with those, more's the pity...
      Reported-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e2bdaed8
    • A
      new helper: lookup_positive_unlocked() · ec6880e8
      Al Viro 提交于
      fix #27211210
      
      commit 6c2d4798a8d16cf4f3a28c3cd4af4f1dcbbb4d04 upstream.
      
      Most of the callers of lookup_one_len_unlocked() treat negatives are
      ERR_PTR(-ENOENT).  Provide a helper that would do just that.  Note
      that a pinned positive dentry remains positive - it's ->d_inode is
      stable, etc.; a pinned _negative_ dentry can become positive at any
      point as long as you are not holding its parent at least shared.
      So using lookup_one_len_unlocked() needs to be careful;
      lookup_positive_unlocked() is safer and that's what the callers
      end up open-coding anyway.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      ec6880e8
    • A
      fs/namei.c: pull positivity check into follow_managed() · da2c0773
      Al Viro 提交于
      fix #27211210
      
      commit d41efb522e902364ab09c782d511c1bedc388ddd upstream.
      
      There are 4 callers; two proceed to check if result is positive and
      fail with ENOENT if it isn't; one (in handle_lookup_down()) is
      guaranteed to yield positive and one (in lookup_fast()) is _preceded_
      by positivity check.
      
      However, follow_managed() on a negative dentry is a (fairly cheap)
      no-op on anything other than autofs.  And negative autofs dentries
      are never hashed, so lookup_fast() is not going to run into one
      of those.  Moreover, successful follow_managed() on a _positive_
      dentry never yields a negative one (and we significantly rely upon
      that in callers of lookup_fast()).
      
      In other words, we can easily transpose the positivity check and
      the call of follow_managed() in lookup_fast().  And that allows
      to fold the positivity check *into* follow_managed(), simplifying
      life for the code downstream of its calls.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      da2c0773
  3. 20 3月, 2020 1 次提交
    • A
      vfs: fix do_last() regression · 6073719d
      Al Viro 提交于
      commit 6404674acd596de41fd3ad5f267b4525494a891a upstream
      
      Brown paperbag time: fetching ->i_uid/->i_mode really should've been
      done from nd->inode.  I even suggested that, but the reason for that has
      slipped through the cracks and I went for dir->d_inode instead - made
      for more "obvious" patch.
      
      Analysis:
      
       - at the entry into do_last() and all the way to step_into(): dir (aka
         nd->path.dentry) is known not to have been freed; so's nd->inode and
         it's equal to dir->d_inode unless we are already doomed to -ECHILD.
         inode of the file to get opened is not known.
      
       - after step_into(): inode of the file to get opened is known; dir
         might be pointing to freed memory/be negative/etc.
      
       - at the call of may_create_in_sticky(): guaranteed to be out of RCU
         mode; inode of the file to get opened is known and pinned; dir might
         be garbage.
      
      The last was the reason for the original patch.  Except that at the
      do_last() entry we can be in RCU mode and it is possible that
      nd->path.dentry->d_inode has already changed under us.
      
      In that case we are going to fail with -ECHILD, but we need to be
      careful; nd->inode is pointing to valid struct inode and it's the same
      as nd->path.dentry->d_inode in "won't fail with -ECHILD" case, so we
      should use that.
      Reported-by: N"Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@nokia.com>
      Reported-by: syzbot+190005201ced78a74ad6@syzkaller.appspotmail.com
      Wearing-brown-paperbag: Al Viro <viro@zeniv.linux.org.uk>
      Cc: stable@kernel.org
      Fixes: d0cb50185ae9 ("do_last(): fetch directory ->i_mode and ->i_uid before it's too late")
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6073719d
  4. 18 3月, 2020 1 次提交
  5. 29 12月, 2018 1 次提交
    • C
      Revert "vfs: Allow userns root to call mknod on owned filesystems." · 9c5ccadb
      Christian Brauner 提交于
      commit 94f82008ce30e2624537d240d64ce718255e0b80 upstream.
      
      This reverts commit 55956b59.
      
      commit 55956b59 ("vfs: Allow userns root to call mknod on owned filesystems.")
      enabled mknod() in user namespaces for userns root if CAP_MKNOD is
      available. However, these device nodes are useless since any filesystem
      mounted from a non-initial user namespace will set the SB_I_NODEV flag on
      the filesystem. Now, when a device node s created in a non-initial user
      namespace a call to open() on said device node will fail due to:
      
      bool may_open_dev(const struct path *path)
      {
              return !(path->mnt->mnt_flags & MNT_NODEV) &&
                      !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
      }
      
      The problem with this is that as of the aforementioned commit mknod()
      creates partially functional device nodes in non-initial user namespaces.
      In particular, it has the consequence that as of the aforementioned commit
      open() will be more privileged with respect to device nodes than mknod().
      Before it was the other way around. Specifically, if mknod() succeeded
      then it was transparent for any userspace application that a fatal error
      must have occured when open() failed.
      
      All of this breaks multiple userspace workloads and a widespread assumption
      about how to handle mknod(). Basically, all container runtimes and systemd
      live by the slogan "ask for forgiveness not permission" when running user
      namespace workloads. For mknod() the assumption is that if the syscall
      succeeds the device nodes are useable irrespective of whether it succeeds
      in a non-initial user namespace or not. This logic was chosen explicitly
      to allow for the glorious day when mknod() will actually be able to create
      fully functional device nodes in user namespaces.
      A specific problem people are already running into when running 4.18 rc
      kernels are failing systemd services. For any distro that is run in a
      container systemd services started with the PrivateDevices= property set
      will fail to start since the device nodes in question cannot be
      opened (cf. the arguments in [1]).
      
      Full disclosure, Seth made the very sound argument that it is already
      possible to end up with partially functional device nodes. Any filesystem
      mounted with MS_NODEV set will allow mknod() to succeed but will not allow
      open() to succeed. The difference to the case here is that the MS_NODEV
      case is transparent to userspace since it is an explicitly set mount option
      while the SB_I_NODEV case is an implicit property enforced by the kernel
      and hence opaque to userspace.
      
      [1]: https://github.com/systemd/systemd/pull/9483Signed-off-by: NChristian Brauner <christian@brauner.io>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Seth Forshee <seth.forshee@canonical.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c5ccadb
  6. 24 8月, 2018 1 次提交
  7. 20 7月, 2018 1 次提交
  8. 18 7月, 2018 1 次提交
  9. 12 7月, 2018 17 次提交
  10. 15 6月, 2018 1 次提交
    • D
      afs: Display manually added cells in dynamic root mount · 0da0b7fd
      David Howells 提交于
      Alter the dynroot mount so that cells created by manipulation of
      /proc/fs/afs/cells and /proc/fs/afs/rootcell and by specification of a root
      cell as a module parameter will cause directories for those cells to be
      created in the dynamic root superblock for the network namespace[*].
      
      To this end:
      
       (1) Only one dynamic root superblock is now created per network namespace
           and this is shared between all attempts to mount it.  This makes it
           easier to find the superblock to modify.
      
       (2) When a dynamic root superblock is created, the list of cells is walked
           and directories created for each cell already defined.
      
       (3) When a new cell is added, if a dynamic root superblock exists, a
           directory is created for it.
      
       (4) When a cell is destroyed, the directory is removed.
      
       (5) These directories are created by calling lookup_one_len() on the root
           dir which automatically creates them if they don't exist.
      
      [*] Inasmuch as network namespaces are currently supported here.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      0da0b7fd