1. 04 6月, 2020 1 次提交
    • M
      ovl: make private mounts longterm · df820f8d
      Miklos Szeredi 提交于
      Overlayfs is using clone_private_mount() to create internal mounts for
      underlying layers.  These are used for operations requiring a path, such as
      dentry_open().
      
      Since these private mounts are not in any namespace they are treated as
      short term, "detached" mounts and mntput() involves taking the global
      mount_lock, which can result in serious cacheline pingpong.
      
      Make these private mounts longterm instead, which trade the penalty on
      mntput() for a slightly longer shutdown time due to an added RCU grace
      period when putting these mounts.
      
      Introduce a new helper kern_unmount_many() that can take care of multiple
      longterm mounts with a single RCU grace period.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      df820f8d
  2. 29 5月, 2020 1 次提交
  3. 14 5月, 2020 1 次提交
    • M
      proc/mounts: add cursor · 9f6c61f9
      Miklos Szeredi 提交于
      If mounts are deleted after a read(2) call on /proc/self/mounts (or its
      kin), the subsequent read(2) could miss a mount that comes after the
      deleted one in the list.  This is because the file position is interpreted
      as the number mount entries from the start of the list.
      
      E.g. first read gets entries #0 to #9; the seq file index will be 10.  Then
      entry #5 is deleted, resulting in #10 becoming #9 and #11 becoming #10,
      etc...  The next read will continue from entry #10, and #9 is missed.
      
      Solve this by adding a cursor entry for each open instance.  Taking the
      global namespace_sem for write seems excessive, since we are only dealing
      with a per-namespace list.  Instead add a per-namespace spinlock and use
      that together with namespace_sem taken for read to protect against
      concurrent modification of the mount list.  This may reduce parallelism of
      is_local_mountpoint(), but it's hardly a big contention point.  We could
      also use RCU freeing of cursors to make traversal not need additional
      locks, if that turns out to be neceesary.
      
      Only move the cursor once for each read (cursor is not added on open) to
      minimize cacheline invalidation.  When EOF is reached, the cursor is taken
      off the list, in order to prevent an excessive number of cursors due to
      inactive open file descriptors.
      Reported-by: NKarel Zak <kzak@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      9f6c61f9
  4. 13 5月, 2020 1 次提交
    • C
      nsproxy: attach to namespaces via pidfds · 303cc571
      Christian Brauner 提交于
      For quite a while we have been thinking about using pidfds to attach to
      namespaces. This patchset has existed for about a year already but we've
      wanted to wait to see how the general api would be received and adopted.
      Now that more and more programs in userspace have started using pidfds
      for process management it's time to send this one out.
      
      This patch makes it possible to use pidfds to attach to the namespaces
      of another process, i.e. they can be passed as the first argument to the
      setns() syscall. When only a single namespace type is specified the
      semantics are equivalent to passing an nsfd. That means
      setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However,
      when a pidfd is passed, multiple namespace flags can be specified in the
      second setns() argument and setns() will attach the caller to all the
      specified namespaces all at once or to none of them. Specifying 0 is not
      valid together with a pidfd.
      
      Here are just two obvious examples:
      setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
      setns(pidfd, CLONE_NEWUSER);
      Allowing to also attach subsets of namespaces supports various use-cases
      where callers setns to a subset of namespaces to retain privilege, perform
      an action and then re-attach another subset of namespaces.
      
      If the need arises, as Eric suggested, we can extend this patchset to
      assume even more context than just attaching all namespaces. His suggestion
      specifically was about assuming the process' root directory when
      setns(pidfd, 0) or setns(pidfd, SETNS_PIDFD) is specified. For now, just
      keep it flexible in terms of supporting subsets of namespaces but let's
      wait until we have users asking for even more context to be assumed. At
      that point we can add an extension.
      
      The obvious example where this is useful is a standard container
      manager interacting with a running container: pushing and pulling files
      or directories, injecting mounts, attaching/execing any kind of process,
      managing network devices all these operations require attaching to all
      or at least multiple namespaces at the same time. Given that nowadays
      most containers are spawned with all namespaces enabled we're currently
      looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns>
      nsfds, another 7 to actually perform the namespace switch. With time
      namespaces we're looking at about 16 syscalls.
      (We could amortize the first 7 or 8 syscalls for opening the nsfds by
       stashing them in each container's monitor process but that would mean
       we need to send around those file descriptors through unix sockets
       everytime we want to interact with the container or keep on-disk
       state. Even in scenarios where a caller wants to join a particular
       namespace in a particular order callers still profit from batching
       other namespaces. That mostly applies to the user namespace but
       all container runtimes I found join the user namespace first no matter
       if it privileges or deprivileges the container similar to how unshare
       behaves.)
      With pidfds this becomes a single syscall no matter how many namespaces
      are supposed to be attached to.
      
      A decently designed, large-scale container manager usually isn't the
      parent of any of the containers it spawns so the containers don't die
      when it crashes or needs to update or reinitialize. This means that
      for the manager to interact with containers through pids is inherently
      racy especially on systems where the maximum pid number is not
      significicantly bumped. This is even more problematic since we often spawn
      and manage thousands or ten-thousands of containers. Interacting with a
      container through a pid thus can become risky quite quickly. Especially
      since we allow for an administrator to enable advanced features such as
      syscall interception where we're performing syscalls in lieu of the
      container. In all of those cases we use pidfds if they are available and
      we pass them around as stable references. Using them to setns() to the
      target process' namespaces is as reliable as using nsfds. Either the
      target process is already dead and we get ESRCH or we manage to attach
      to its namespaces but we can't accidently attach to another process'
      namespaces. So pidfds lend themselves to be used with this api.
      The other main advantage is that with this change the pidfd becomes the
      only relevant token for most container interactions and it's the only
      token we need to create and send around.
      
      Apart from significiantly reducing the number of syscalls from double
      digit to single digit which is a decent reason post-spectre/meltdown
      this also allows to switch to a set of namespaces atomically, i.e.
      either attaching to all the specified namespaces succeeds or we fail. If
      we fail we haven't changed a single namespace. There are currently three
      namespaces that can fail (other than for ENOMEM which really is not
      very interesting since we then have other problems anyway) for
      non-trivial reasons, user, mount, and pid namespaces. We can fail to
      attach to a pid namespace if it is not our current active pid namespace
      or a descendant of it. We can fail to attach to a user namespace because
      we are multi-threaded or because our current mount namespace shares
      filesystem state with other tasks, or because we're trying to setns()
      to the same user namespace, i.e. the target task has the same user
      namespace as we do. We can fail to attach to a mount namespace because
      it shares filesystem state with other tasks or because we fail to lookup
      the new root for the new mount namespace. In most non-pathological
      scenarios these issues can be somewhat mitigated. But there are cases where
      we're half-attached to some namespace and failing to attach to another one.
      I've talked about some of these problem during the hallway track (something
      only the pre-COVID-19 generation will remember) of Plumbers in Los Angeles
      in 2018(?). Even if all these issues could be avoided with super careful
      userspace coding it would be nicer to have this done in-kernel. Pidfds seem
      to lend themselves nicely for this.
      
      The other neat thing about this is that setns() becomes an actual
      counterpart to the namespace bits of unshare().
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NSerge Hallyn <serge@hallyn.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Link: https://lore.kernel.org/r/20200505140432.181565-3-christian.brauner@ubuntu.com
      303cc571
  5. 09 5月, 2020 1 次提交
    • C
      nsproxy: add struct nsset · f2a8d52e
      Christian Brauner 提交于
      Add a simple struct nsset. It holds all necessary pieces to switch to a new
      set of namespaces without leaving a task in a half-switched state which we
      will make use of in the next patch. This patch switches the existing setns
      logic over without causing a change in setns() behavior. This brings
      setns() closer to how unshare() works(). The prepare_ns() function is
      responsible to prepare all necessary information. This has two reasons.
      First it minimizes dependencies between individual namespaces, i.e. all
      install handler can expect that all fields are properly initialized
      independent in what order they are called in. Second, this makes the code
      easier to maintain and easier to follow if it needs to be changed.
      
      The prepare_ns() helper will only be switched over to use a flags argument
      in the next patch. Here it will still use nstype as a simple integer
      argument which was argued would be clearer. I'm not particularly
      opinionated about this if it really helps or not. The struct nsset itself
      already contains the flags field since its name already indicates that it
      can contain information required by different namespaces. None of this
      should have functional consequences.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NSerge Hallyn <serge@hallyn.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Link: https://lore.kernel.org/r/20200505140432.181565-2-christian.brauner@ubuntu.com
      f2a8d52e
  6. 21 4月, 2020 1 次提交
  7. 14 3月, 2020 1 次提交
  8. 28 2月, 2020 2 次提交
    • A
      follow_automount(): get rid of dead^Wstillborn code · 25e195aa
      Al Viro 提交于
      1) no instances of ->d_automount() have ever made use of the "return
      ERR_PTR(-EISDIR) if you don't feel like mounting anything" - that's
      a rudiment of plans that got superseded before the thing went into
      the tree.  Despite the comment in follow_automount(), autofs has
      never done that.
      
      2) if there's no ->d_automount() in dentry_operations, filesystems
      should not set DCACHE_NEED_AUTOMOUNT in the first place.  None have
      ever done so...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      25e195aa
    • A
      fix automount/automount race properly · 26df6034
      Al Viro 提交于
      Protection against automount/automount races (two threads hitting the same
      referral point at the same time) is based upon do_add_mount() prevention of
      identical overmounts - trying to overmount the root of mounted tree with
      the same tree fails with -EBUSY.  It's unreliable (the other thread might've
      mounted something on top of the automount it has triggered) *and* causes
      no end of headache for follow_automount() and its caller, since
      finish_automount() behaves like do_new_mount() - if the mountpoint to be is
      overmounted, it mounts on top what's overmounting it.  It's not only wrong
      (we want to go into what's overmounting the automount point and quietly
      discard what we planned to mount there), it introduces the possibility of
      original parent mount getting dropped.  That's what 8aef1884 (VFS: Fix
      vfsmount overput on simultaneous automount) deals with, but it can't do
      anything about the reliability of conflict detection - if something had
      been overmounted the other thread's automount (e.g. that other thread
      having stepped into automount in mount(2)), we don't get that -EBUSY and
      the result is
      	 referral point under automounted NFS under explicit overmount
      under another copy of automounted NFS
      
      What we need is finish_automount() *NOT* digging into overmounts - if it
      finds one, it should just quietly discard the thing it was asked to mount.
      And don't bother with actually crossing into the results of finish_automount() -
      the same loop that calls follow_automount() will do that just fine on the
      next iteration.
      
      IOW, instead of calling lock_mount() have finish_automount() do it manually,
      _without_ the "move into overmount and retry" part.  And leave crossing into
      the results to the caller of follow_automount(), which simplifies it a lot.
      
      Moral: if you end up with a lot of glue working around the calling conventions
      of something, perhaps these calling conventions are simply wrong...
      
      Fixes: 8aef1884 (VFS: Fix vfsmount overput on simultaneous automount)
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      26df6034
  9. 11 2月, 2020 1 次提交
  10. 04 2月, 2020 1 次提交
  11. 05 1月, 2020 1 次提交
  12. 12 12月, 2019 1 次提交
    • D
      init: use do_mount() instead of ksys_mount() · cccaa5e3
      Dominik Brodowski 提交于
      In prepare_namespace(), do_mount() can be used instead of ksys_mount()
      as the first and third argument are const strings in the kernel, the
      second and fourth argument are passed through anyway, and the fifth
      argument is NULL.
      
      In do_mount_root(), ksys_mount() is called with the first and third
      argument being already kernelspace strings, which do not need to be
      copied over from userspace to kernelspace (again). The second and
      fourth arguments are passed through to do_mount() anyway. The fifth
      argument, while already residing in kernelspace, needs to be put into
      a page of its own. Then, do_mount() can be used instead of
      ksys_mount().
      
      Once this is done, there are no in-kernel users to ksys_mount() left,
      which can therefore be removed.
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      cccaa5e3
  13. 22 10月, 2019 1 次提交
    • B
      fs/namespace: add __user to open_tree and move_mount syscalls · 2658ce09
      Ben Dooks 提交于
      Thw open_tree and move_mount syscalls take names from the
      user, so add the __user to these to ensure the following
      warnings from sparse are fixed:
      
      fs/namespace.c:2392:35: warning: incorrect type in argument 2 (different address spaces)
      fs/namespace.c:2392:35:    expected char const [noderef] <asn:1> *name
      fs/namespace.c:2392:35:    got char const *filename
      fs/namespace.c:3541:38: warning: incorrect type in argument 2 (different address spaces)
      fs/namespace.c:3541:38:    expected char const [noderef] <asn:1> *name
      fs/namespace.c:3541:38:    got char const *from_pathname
      fs/namespace.c:3550:36: warning: incorrect type in argument 2 (different address spaces)
      fs/namespace.c:3550:36:    expected char const [noderef] <asn:1> *name
      fs/namespace.c:3550:36:    got char const *to_pathname
      Signed-off-by: NBen Dooks <ben.dooks@codethink.co.uk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2658ce09
  14. 17 10月, 2019 1 次提交
  15. 26 9月, 2019 1 次提交
  16. 07 9月, 2019 1 次提交
  17. 31 8月, 2019 1 次提交
  18. 30 8月, 2019 1 次提交
    • D
      mount: Add mount warning for impending timestamp expiry · f8b92ba6
      Deepa Dinamani 提交于
      The warning reuses the uptime max of 30 years used by
      settimeofday().
      
      Note that the warning is only emitted for writable filesystem mounts
      through the mount syscall. Automounts do not have the same warning.
      
      Print out the warning in human readable format using the struct tm.
      After discussion with Arnd Bergmann, we chose to print only the year number.
      The raw s_time_max is also displayed, and the user can easily decode
      it e.g. "date -u -d @$((0x7fffffff))". We did not want to consolidate
      struct rtc_tm and struct tm just to print the date using a format specifier
      as part of this series.
      Given that the rtc_tm is not compiled on all architectures, this is not a
      trivial patch. This can be added in the future.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Acked-by: NJeff Layton <jlayton@kernel.org>
      f8b92ba6
  19. 17 8月, 2019 1 次提交
  20. 26 7月, 2019 1 次提交
  21. 22 7月, 2019 1 次提交
  22. 17 7月, 2019 3 次提交
  23. 05 7月, 2019 5 次提交
  24. 01 7月, 2019 1 次提交
    • E
      vfs: move_mount: reject moving kernel internal mounts · 570d7a98
      Eric Biggers 提交于
      sys_move_mount() crashes by dereferencing the pointer MNT_NS_INTERNAL,
      a.k.a. ERR_PTR(-EINVAL), if the old mount is specified by fd for a
      kernel object with an internal mount, such as a pipe or memfd.
      
      Fix it by checking for this case and returning -EINVAL.
      
      [AV: what we want is is_mounted(); use that instead of making the
      condition even more convoluted]
      
      Reproducer:
      
          #include <unistd.h>
      
          #define __NR_move_mount         429
          #define MOVE_MOUNT_F_EMPTY_PATH 0x00000004
      
          int main()
          {
          	int fds[2];
      
          	pipe(fds);
              syscall(__NR_move_mount, fds[0], "", -1, "/", MOVE_MOUNT_F_EMPTY_PATH);
          }
      
      Reported-by: syzbot+6004acbaa1893ad013f0@syzkaller.appspotmail.com
      Fixes: 2db154b3 ("vfs: syscall: Add move_mount(2) to move mounts around")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      570d7a98
  25. 18 6月, 2019 2 次提交
    • C
      fs/namespace: fix unprivileged mount propagation · d728cf79
      Christian Brauner 提交于
      When propagating mounts across mount namespaces owned by different user
      namespaces it is not possible anymore to move or umount the mount in the
      less privileged mount namespace.
      
      Here is a reproducer:
      
        sudo mount -t tmpfs tmpfs /mnt
        sudo --make-rshared /mnt
      
        # create unprivileged user + mount namespace and preserve propagation
        unshare -U -m --map-root --propagation=unchanged
      
        # now change back to the original mount namespace in another terminal:
        sudo mkdir /mnt/aaa
        sudo mount -t tmpfs tmpfs /mnt/aaa
      
        # now in the unprivileged user + mount namespace
        mount --move /mnt/aaa /opt
      
      Unfortunately, this is a pretty big deal for userspace since this is
      e.g. used to inject mounts into running unprivileged containers.
      So this regression really needs to go away rather quickly.
      
      The problem is that a recent change falsely locked the root of the newly
      added mounts by setting MNT_LOCKED. Fix this by only locking the mounts
      on copy_mnt_ns() and not when adding a new mount.
      
      Fixes: 3bd045cc ("separate copying and locking mount tree on cross-userns copies")
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Tested-by: NChristian Brauner <christian@brauner.io>
      Acked-by: NChristian Brauner <christian@brauner.io>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NChristian Brauner <christian@brauner.io>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d728cf79
    • E
      vfs: fsmount: add missing mntget() · 1b0b9cc8
      Eric Biggers 提交于
      sys_fsmount() needs to take a reference to the new mount when adding it
      to the anonymous mount namespace.  Otherwise the filesystem can be
      unmounted while it's still in use, as found by syzkaller.
      Reported-by: NMark Rutland <mark.rutland@arm.com>
      Reported-by: syzbot+99de05d099a170867f22@syzkaller.appspotmail.com
      Reported-by: syzbot+7008b8b8ba7df475fdc8@syzkaller.appspotmail.com
      Fixes: 93766fbd ("vfs: syscall: Add fsmount() to create a mount for a superblock")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1b0b9cc8
  26. 31 5月, 2019 1 次提交
  27. 26 5月, 2019 1 次提交
    • A
      move mount_capable() further out · c3aabf07
      Al Viro 提交于
      Call graph of vfs_get_tree():
      	vfs_fsconfig_locked()	# neither kernmount, nor submount
      	do_new_mount()		# neither kernmount, nor submount
      	fc_mount()
      		afs_mntpt_do_automount()	# submount
      		mount_one_hugetlbfs()		# kernmount
      		pid_ns_prepare_proc()		# kernmount
      		mq_create_mount()		# kernmount
      		vfs_kern_mount()
      			simple_pin_fs()		# kernmount
      			vfs_submount()		# submount
      			kern_mount()		# kernmount
      			init_mount_tree()
      			btrfs_mount()
      			nfs_do_root_mount()
      
      	The first two need the check (unconditionally).
      init_mount_tree() is setting rootfs up; any capability
      checks make zero sense for that one.  And btrfs_mount()/
      nfs_do_root_mount() have the checks already done in their
      callers.
      
      	IOW, we can shift mount_capable() handling into
      the two callers - one in the normal case of mount(2),
      another - in fsconfig(2) handling of FSCONFIG_CMD_CREATE.
      I.e. the syscalls that set a new filesystem up.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c3aabf07
  28. 09 5月, 2019 1 次提交
    • A
      do_move_mount(): fix an unsafe use of is_anon_ns() · 05883eee
      Al Viro 提交于
      What triggers it is a race between mount --move and umount -l
      of the source; we should reject it (the source is parentless *and*
      not the root of anon namespace at that), but the check for namespace
      being an anon one is broken in that case - is_anon_ns() needs
      ns to be non-NULL.  Better fixed here than in is_anon_ns(), since
      the rest of the callers is guaranteed to get a non-NULL argument...
      
      Reported-by: syzbot+494c7ddf66acac0ad747@syzkaller.appspotmail.com
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      05883eee
  29. 21 3月, 2019 4 次提交
    • D
      vfs: syscall: Add fsmount() to create a mount for a superblock · 93766fbd
      David Howells 提交于
      Provide a system call by which a filesystem opened with fsopen() and
      configured by a series of fsconfig() calls can have a detached mount object
      created for it.  This mount object can then be attached to the VFS mount
      hierarchy using move_mount() by passing the returned file descriptor as the
      from directory fd.
      
      The system call looks like:
      
      	int mfd = fsmount(int fsfd, unsigned int flags,
      			  unsigned int attr_flags);
      
      where fsfd is the file descriptor returned by fsopen().  flags can be 0 or
      FSMOUNT_CLOEXEC.  attr_flags is a bitwise-OR of the following flags:
      
      	MOUNT_ATTR_RDONLY	Mount read-only
      	MOUNT_ATTR_NOSUID	Ignore suid and sgid bits
      	MOUNT_ATTR_NODEV	Disallow access to device special files
      	MOUNT_ATTR_NOEXEC	Disallow program execution
      	MOUNT_ATTR__ATIME	Setting on how atime should be updated
      	MOUNT_ATTR_RELATIME	- Update atime relative to mtime/ctime
      	MOUNT_ATTR_NOATIME	- Do not update access times
      	MOUNT_ATTR_STRICTATIME	- Always perform atime updates
      	MOUNT_ATTR_NODIRATIME	Do not update directory access times
      
      In the event that fsmount() fails, it may be possible to get an error
      message by calling read() on fsfd.  If no message is available, ENODATA
      will be reported.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      93766fbd
    • D
      teach move_mount(2) to work with OPEN_TREE_CLONE · 44dfd84a
      David Howells 提交于
      Allow a detached tree created by open_tree(..., OPEN_TREE_CLONE) to be
      attached by move_mount(2).
      
      If by the time of final fput() of OPEN_TREE_CLONE-opened file its tree is
      not detached anymore, it won't be dissolved.  move_mount(2) is adjusted
      to handle detached source.
      
      That gives us equivalents of mount --bind and mount --rbind.
      
      Thanks also to Alan Jenkins <alan.christopher.jenkins@gmail.com> for
      providing a whole bunch of ways to break things using this interface.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      44dfd84a
    • D
      vfs: syscall: Add move_mount(2) to move mounts around · 2db154b3
      David Howells 提交于
      Add a move_mount() system call that will move a mount from one place to
      another and, in the next commit, allow to attach an unattached mount tree.
      
      The new system call looks like the following:
      
      	int move_mount(int from_dfd, const char *from_path,
      		       int to_dfd, const char *to_path,
      		       unsigned int flags);
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2db154b3
    • A
      vfs: syscall: Add open_tree(2) to reference or clone a mount · a07b2000
      Al Viro 提交于
      open_tree(dfd, pathname, flags)
      
      Returns an O_PATH-opened file descriptor or an error.
      dfd and pathname specify the location to open, in usual
      fashion (see e.g. fstatat(2)).  flags should be an OR of
      some of the following:
      	* AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
      same meanings as usual
      	* OPEN_TREE_CLOEXEC - make the resulting descriptor
      close-on-exec
      	* OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
      instead of opening the location in question, create a detached
      mount tree matching the subtree rooted at location specified by
      dfd/pathname.  With AT_RECURSIVE the entire subtree is cloned,
      without it - only the part within in the mount containing the
      location in question.  In other words, the same as mount --rbind
      or mount --bind would've taken.  The detached tree will be
      dissolved on the final close of obtained file.  Creation of such
      detached trees requires the same capabilities as doing mount --bind.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a07b2000