1. 26 10月, 2019 1 次提交
  2. 17 7月, 2019 1 次提交
  3. 10 7月, 2019 1 次提交
    • A
      Teach shrink_dcache_parent() to cope with mixed-filesystem shrink lists · 9bdebc2b
      Al Viro 提交于
      Currently, running into a shrink list that contains dentries from different
      filesystems can cause several unpleasant things for shrink_dcache_parent()
      and for umount(2).
      
      The first problem is that there's a window during shrink_dentry_list() between
      __dentry_kill() takes a victim out and dropping reference to its parent.  During
      that window the parent looks like a genuine busy dentry.  shrink_dcache_parent()
      (or, worse yet, shrink_dcache_for_umount()) coming at that time will see no
      eviction candidates and no indication that it needs to wait for some
      shrink_dentry_list() to proceed further.
      
      That applies for any shrink list that might intersect with the subtree we are
      trying to shrink; the only reason it does not blow on umount(2) in the mainline
      is that we unregister the memory shrinker before hitting shrink_dcache_for_umount().
      
      Another problem happens if something in a mixed-filesystem shrink list gets
      be stuck in e.g. iput(), getting umount of unrelated fs to spin waiting for
      the stuck shrinker to get around to our dentries.
      
      Solution:
              1) have shrink_dentry_list() decrement the parent's refcount and
      make sure it's on a shrink list (ours unless it already had been on some
      other) before calling __dentry_kill().  That eliminates the window when
      shrink_dcache_parent() would've blown past the entire subtree without
      noticing anything with zero refcount not on shrink lists.
      	2) when shrink_dcache_parent() has found no eviction candidates,
      but some dentries are still sitting on shrink lists, rather than
      repeating the scan in hope that shrinkers have progressed, scan looking
      for something on shrink lists with zero refcount.  If such a thing is
      found, grab rcu_read_lock() and stop the scan, with caller locking
      it for eviction, dropping out of RCU and doing __dentry_kill(), with
      the same treatment for parent as shrink_dentry_list() would do.
      
      Note that right now mixed-filesystem shrink lists do not occur, so this
      is not a mainline bug.  Howevere, there's a bunch of uses for such
      beasts (e.g. the "try and evict everything we can out of given page"
      patches; there are potential uses in mount-related code, considerably
      simplifying the life in fs/namespace.c, etc.)
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9bdebc2b
  4. 28 6月, 2019 1 次提交
  5. 31 5月, 2019 1 次提交
  6. 26 5月, 2019 2 次提交
    • A
      switch mount_capable() to fs_context · 20284ab7
      Al Viro 提交于
      	now both callers of mount_capable() have access to fs_context;
      the only difference is that for sget_fc() we have the possibility
      of fc->global being true, while for legacy_get_tree() it's guaranteed
      to be impossible.  Unify to more generic variant...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      20284ab7
    • A
      move the capability checks from sget_userns() to legacy_get_tree() · 2527b284
      Al Viro 提交于
      1) all call chains leading to sget_userns() pass through ->mount()
      instances.
      2) none of ->mount() instances is ever called directly - the only
      call site is legacy_get_tree()
      3) all remaining ->mount() instances end up calling sget_userns()
      
      IOW, we might as well do the capability checks just before calling
      ->mount().  As for the arguments passed to mount_capable(),
      in case of call chains to sget_userns() going through sget(),
      we either don't call mount_capable() at all, or pass current_user_ns()
      to it.  The call chains going through mount_pseudo_xattr() don't
      call mount_capable() at all (SB_KERNMOUNT in flags on those).
      
      That could've been split into smaller steps (lifting the checks
      into sget(), then callers of sget(), then all the way to the
      entries of every ->mount() out there, then to the sole caller),
      but that would be too much churn for little benefit...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2527b284
  7. 21 5月, 2019 1 次提交
  8. 01 5月, 2019 1 次提交
  9. 10 4月, 2019 1 次提交
  10. 05 4月, 2019 1 次提交
    • A
      acct_on(): don't mess with freeze protection · 9419a319
      Al Viro 提交于
      What happens there is that we are replacing file->path.mnt of
      a file we'd just opened with a clone and we need the write
      count contribution to be transferred from original mount to
      new one.  That's it.  We do *NOT* want any kind of freeze
      protection for the duration of switchover.
      
      IOW, we should just use __mnt_{want,drop}_write() for that
      switchover; no need to bother with mnt_{want,drop}_write()
      there.
      Tested-by: NAmir Goldstein <amir73il@gmail.com>
      Reported-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9419a319
  11. 21 3月, 2019 2 次提交
    • D
      vfs: syscall: Add fsconfig() for configuring and managing a context · ecdab150
      David Howells 提交于
      Add a syscall for configuring a filesystem creation context and triggering
      actions upon it, to be used in conjunction with fsopen, fspick and fsmount.
      
          long fsconfig(int fs_fd, unsigned int cmd, const char *key,
      		  const void *value, int aux);
      
      Where fs_fd indicates the context, cmd indicates the action to take, key
      indicates the parameter name for parameter-setting actions and, if needed,
      value points to a buffer containing the value and aux can give more
      information for the value.
      
      The following command IDs are proposed:
      
       (*) FSCONFIG_SET_FLAG: No value is specified.  The parameter must be
           boolean in nature.  The key may be prefixed with "no" to invert the
           setting. value must be NULL and aux must be 0.
      
       (*) FSCONFIG_SET_STRING: A string value is specified.  The parameter can
           be expecting boolean, integer, string or take a path.  A conversion to
           an appropriate type will be attempted (which may include looking up as
           a path).  value points to a NUL-terminated string and aux must be 0.
      
       (*) FSCONFIG_SET_BINARY: A binary blob is specified.  value points to
           the blob and aux indicates its size.  The parameter must be expecting
           a blob.
      
       (*) FSCONFIG_SET_PATH: A non-empty path is specified.  The parameter must
           be expecting a path object.  value points to a NUL-terminated string
           that is the path and aux is a file descriptor at which to start a
           relative lookup or AT_FDCWD.
      
       (*) FSCONFIG_SET_PATH_EMPTY: As fsconfig_set_path, but with AT_EMPTY_PATH
           implied.
      
       (*) FSCONFIG_SET_FD: An open file descriptor is specified.  value must
           be NULL and aux indicates the file descriptor.
      
       (*) FSCONFIG_CMD_CREATE: Trigger superblock creation.
      
       (*) FSCONFIG_CMD_RECONFIGURE: Trigger superblock reconfiguration.
      
      For the "set" command IDs, the idea is that the file_system_type will point
      to a list of parameters and the types of value that those parameters expect
      to take.  The core code can then do the parse and argument conversion and
      then give the LSM and FS a cooked option or array of options to use.
      
      Source specification is also done the same way same way, using special keys
      "source", "source1", "source2", etc..
      
      [!] Note that, for the moment, the key and value are just glued back
      together and handed to the filesystem.  Every filesystem that uses options
      uses match_token() and co. to do this, and this will need to be changed -
      but not all at once.
      
      Example usage:
      
          fd = fsopen("ext4", FSOPEN_CLOEXEC);
          fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);
          fsconfig(fd, fsconfig_set_path_empty, "journal_path", "", journal_fd);
          fsconfig(fd, fsconfig_set_fd, "journal_fd", "", journal_fd);
          fsconfig(fd, fsconfig_set_flag, "user_xattr", NULL, 0);
          fsconfig(fd, fsconfig_set_flag, "noacl", NULL, 0);
          fsconfig(fd, fsconfig_set_string, "sb", "1", 0);
          fsconfig(fd, fsconfig_set_string, "errors", "continue", 0);
          fsconfig(fd, fsconfig_set_string, "data", "journal", 0);
          fsconfig(fd, fsconfig_set_string, "context", "unconfined_u:...", 0);
          fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
          mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
      
      or:
      
          fd = fsopen("ext4", FSOPEN_CLOEXEC);
          fsconfig(fd, fsconfig_set_string, "source", "/dev/sda1", 0);
          fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
          mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
      
      or:
      
          fd = fsopen("afs", FSOPEN_CLOEXEC);
          fsconfig(fd, fsconfig_set_string, "source", "#grand.central.org:root.cell", 0);
          fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
          mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
      
      or:
      
          fd = fsopen("jffs2", FSOPEN_CLOEXEC);
          fsconfig(fd, fsconfig_set_string, "source", "mtd0", 0);
          fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
          mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ecdab150
    • A
      vfs: syscall: Add open_tree(2) to reference or clone a mount · a07b2000
      Al Viro 提交于
      open_tree(dfd, pathname, flags)
      
      Returns an O_PATH-opened file descriptor or an error.
      dfd and pathname specify the location to open, in usual
      fashion (see e.g. fstatat(2)).  flags should be an OR of
      some of the following:
      	* AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
      same meanings as usual
      	* OPEN_TREE_CLOEXEC - make the resulting descriptor
      close-on-exec
      	* OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
      instead of opening the location in question, create a detached
      mount tree matching the subtree rooted at location specified by
      dfd/pathname.  With AT_RECURSIVE the entire subtree is cloned,
      without it - only the part within in the mount containing the
      location in question.  In other words, the same as mount --rbind
      or mount --bind would've taken.  The detached tree will be
      dissolved on the final close of obtained file.  Creation of such
      detached trees requires the same capabilities as doing mount --bind.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-api@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a07b2000
  12. 28 2月, 2019 1 次提交
    • D
      vfs: Add configuration parser helpers · 31d921c7
      David Howells 提交于
      Because the new API passes in key,value parameters, match_token() cannot be
      used with it.  Instead, provide three new helpers to aid with parsing:
      
       (1) fs_parse().  This takes a parameter and a simple static description of
           all the parameters and maps the key name to an ID.  It returns 1 on a
           match, 0 on no match if unknowns should be ignored and some other
           negative error code on a parse error.
      
           The parameter description includes a list of key names to IDs, desired
           parameter types and a list of enumeration name -> ID mappings.
      
           [!] Note that for the moment I've required that the key->ID mapping
           array is expected to be sorted and unterminated.  The size of the
           array is noted in the fsconfig_parser struct.  This allows me to use
           bsearch(), but I'm not sure any performance gain is worth the hassle
           of requiring people to keep the array sorted.
      
           The parameter type array is sized according to the number of parameter
           IDs and is indexed directly.  The optional enum mapping array is an
           unterminated, unsorted list and the size goes into the fsconfig_parser
           struct.
      
           The function can do some additional things:
      
      	(a) If it's not ambiguous and no value is given, the prefix "no" on
      	    a key name is permitted to indicate that the parameter should
      	    be considered negatory.
      
      	(b) If the desired type is a single simple integer, it will perform
      	    an appropriate conversion and store the result in a union in
      	    the parse result.
      
      	(c) If the desired type is an enumeration, {key ID, name} will be
      	    looked up in the enumeration list and the matching value will
      	    be stored in the parse result union.
      
      	(d) Optionally generate an error if the key is unrecognised.
      
           This is called something like:
      
      	enum rdt_param {
      		Opt_cdp,
      		Opt_cdpl2,
      		Opt_mba_mpbs,
      		nr__rdt_params
      	};
      
      	const struct fs_parameter_spec rdt_param_specs[nr__rdt_params] = {
      		[Opt_cdp]	= { fs_param_is_bool },
      		[Opt_cdpl2]	= { fs_param_is_bool },
      		[Opt_mba_mpbs]	= { fs_param_is_bool },
      	};
      
      	const const char *const rdt_param_keys[nr__rdt_params] = {
      		[Opt_cdp]	= "cdp",
      		[Opt_cdpl2]	= "cdpl2",
      		[Opt_mba_mpbs]	= "mba_mbps",
      	};
      
      	const struct fs_parameter_description rdt_parser = {
      		.name		= "rdt",
      		.nr_params	= nr__rdt_params,
      		.keys		= rdt_param_keys,
      		.specs		= rdt_param_specs,
      		.no_source	= true,
      	};
      
      	int rdt_parse_param(struct fs_context *fc,
      			    struct fs_parameter *param)
      	{
      		struct fs_parse_result parse;
      		struct rdt_fs_context *ctx = rdt_fc2context(fc);
      		int ret;
      
      		ret = fs_parse(fc, &rdt_parser, param, &parse);
      		if (ret < 0)
      			return ret;
      
      		switch (parse.key) {
      		case Opt_cdp:
      			ctx->enable_cdpl3 = true;
      			return 0;
      		case Opt_cdpl2:
      			ctx->enable_cdpl2 = true;
      			return 0;
      		case Opt_mba_mpbs:
      			ctx->enable_mba_mbps = true;
      			return 0;
      		}
      
      		return -EINVAL;
      	}
      
       (2) fs_lookup_param().  This takes a { dirfd, path, LOOKUP_EMPTY? } or
           string value and performs an appropriate path lookup to convert it
           into a path object, which it will then return.
      
           If the desired type was a blockdev, the type of the looked up inode
           will be checked to make sure it is one.
      
           This can be used like:
      
      	enum foo_param {
      		Opt_source,
      		nr__foo_params
      	};
      
      	const struct fs_parameter_spec foo_param_specs[nr__foo_params] = {
      		[Opt_source]	= { fs_param_is_blockdev },
      	};
      
      	const char *char foo_param_keys[nr__foo_params] = {
      		[Opt_source]	= "source",
      	};
      
      	const struct constant_table foo_param_alt_keys[] = {
      		{ "device",	Opt_source },
      	};
      
      	const struct fs_parameter_description foo_parser = {
      		.name		= "foo",
      		.nr_params	= nr__foo_params,
      		.nr_alt_keys	= ARRAY_SIZE(foo_param_alt_keys),
      		.keys		= foo_param_keys,
      		.alt_keys	= foo_param_alt_keys,
      		.specs		= foo_param_specs,
      	};
      
      	int foo_parse_param(struct fs_context *fc,
      			    struct fs_parameter *param)
      	{
      		struct fs_parse_result parse;
      		struct foo_fs_context *ctx = foo_fc2context(fc);
      		int ret;
      
      		ret = fs_parse(fc, &foo_parser, param, &parse);
      		if (ret < 0)
      			return ret;
      
      		switch (parse.key) {
      		case Opt_source:
      			return fs_lookup_param(fc, &foo_parser, param,
      					       &parse, &ctx->source);
      		default:
      			return -EINVAL;
      		}
      	}
      
       (3) lookup_constant().  This takes a table of named constants and looks up
           the given name within it.  The table is expected to be sorted such
           that bsearch() be used upon it.
      
           Possibly I should require the table be terminated and just use a
           for-loop to scan it instead of using bsearch() to reduce hassle.
      
           Tables look something like:
      
      	static const struct constant_table bool_names[] = {
      		{ "0",		false },
      		{ "1",		true },
      		{ "false",	false },
      		{ "no",		false },
      		{ "true",	true },
      		{ "yes",	true },
      	};
      
           and a lookup is done with something like:
      
      	b = lookup_constant(bool_names, param->string, -1);
      
      Additionally, optional validation routines for the parameter description
      are provided that can be enabled at compile time.  A later patch will
      invoke these when a filesystem is registered.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      31d921c7
  13. 31 1月, 2019 4 次提交
    • A
      introduce fs_context methods · f3a09c92
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f3a09c92
    • D
      convert do_remount_sb() to fs_context · 8d0347f6
      David Howells 提交于
      Replace do_remount_sb() with a function, reconfigure_super(), that's
      fs_context aware.  The fs_context is expected to be parameterised already
      and have ->root pointing to the superblock to be reconfigured.
      
      A legacy wrapper is provided that is intended to be called from the
      fs_context ops when those appear, but for now is called directly from
      reconfigure_super().  This wrapper invokes the ->remount_fs() superblock op
      for the moment.  It is intended that the remount_fs() op will be phased
      out.
      
      The fs_context->purpose is set to FS_CONTEXT_FOR_RECONFIGURE to indicate
      that the context is being used for reconfiguration.
      
      do_umount_root() is provided to consolidate remount-to-R/O for umount and
      emergency remount by creating a context and invoking reconfiguration.
      
      do_remount(), do_umount() and do_emergency_remount_callback() are switched
      to use the new process.
      
      [AV -- fold UMOUNT and EMERGENCY_REMOUNT in; fixes the
      umount / bug, gets rid of pointless complexity]
      [AV -- set ->net_ns in all cases; nfs remount will need that]
      [AV -- shift security_sb_remount() call into reconfigure_super(); the callers
      that didn't do security_sb_remount() have NULL fc->security anyway, so it's
      a no-op for them]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Co-developed-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8d0347f6
    • A
      vfs_get_tree(): evict the call of security_sb_kern_mount() · c9ce29ed
      Al Viro 提交于
      Right now vfs_get_tree() calls security_sb_kern_mount() (i.e.
      mount MAC) unless it gets MS_KERNMOUNT or MS_SUBMOUNT in flags.
      Doing it that way is both clumsy and imprecise.
      
      Consider the callers' tree of vfs_get_tree():
      vfs_get_tree()
              <- do_new_mount()
      	<- vfs_kern_mount()
      		<- simple_pin_fs()
      		<- vfs_submount()
      		<- kern_mount_data()
      		<- init_mount_tree()
      		<- btrfs_mount()
      			<- vfs_get_tree()
      		<- nfs_do_root_mount()
      			<- nfs4_try_mount()
      				<- nfs_fs_mount()
      					<- vfs_get_tree()
      			<- nfs4_referral_mount()
      
      do_new_mount() always does need MAC (we are guaranteed that neither
      MS_KERNMOUNT nor MS_SUBMOUNT will be passed there).
      
      simple_pin_fs(), vfs_submount() and kern_mount_data() pass explicit
      flags inhibiting that check.  So does nfs4_referral_mount() (the
      flags there are ulimately coming from vfs_submount()).
      
      init_mount_tree() is called too early for anything LSM-related; it
      doesn't matter whether we attempt those checks, they'll do nothing.
      
      Finally, in case of btrfs_mount() and nfs_fs_mount(), doing MAC
      is pointless - either the caller will do it, or the flags are
      such that we wouldn't have done it either.
      
      In other words, the one and only case when we want that check
      done is when we are called from do_new_mount(), and there we
      want it unconditionally.
      
      So let's simply move it there.  The superblock is still locked,
      so nobody is going to get access to it (via ustat(2), etc.)
      until we get a chance to apply the checks - we are free to
      move them to any point up to where we drop ->s_umount (in
      do_new_mount_fc()).
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c9ce29ed
    • D
      vfs: Introduce fs_context, switch vfs_kern_mount() to it. · 9bc61ab1
      David Howells 提交于
      Introduce a filesystem context concept to be used during superblock
      creation for mount and superblock reconfiguration for remount.  This is
      allocated at the beginning of the mount procedure and into it is placed:
      
       (1) Filesystem type.
      
       (2) Namespaces.
      
       (3) Source/Device names (there may be multiple).
      
       (4) Superblock flags (SB_*).
      
       (5) Security details.
      
       (6) Filesystem-specific data, as set by the mount options.
      
      Accessor functions are then provided to set up a context, parameterise it
      from monolithic mount data (the data page passed to mount(2)) and tear it
      down again.
      
      A legacy wrapper is provided that implements what will be the basic
      operations, wrapping access to filesystems that aren't yet aware of the
      fs_context.
      
      Finally, vfs_kern_mount() is changed to make use of the fs_context and
      mount_fs() is replaced by vfs_get_tree(), called from vfs_kern_mount().
      [AV -- add missing kstrdup()]
      [AV -- put_cred() can be unconditional - fc->cred can't be NULL]
      [AV -- take legacy_validate() contents into legacy_parse_monolithic()]
      [AV -- merge KERNEL_MOUNT and USER_MOUNT]
      [AV -- don't unlock superblock on success return from vfs_get_tree()]
      [AV -- kill 'reference' argument of init_fs_context()]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Co-developed-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9bc61ab1
  14. 18 7月, 2018 4 次提交
  15. 12 7月, 2018 4 次提交
  16. 11 7月, 2018 1 次提交
  17. 20 6月, 2018 1 次提交
  18. 04 6月, 2018 1 次提交
    • A
      Revert "fs: fold open_check_o_direct into do_dentry_open" · af04fadc
      Al Viro 提交于
      This reverts commit cab64df1.
      
      Having vfs_open() in some cases drop the reference to
      struct file combined with
      
      	error = vfs_open(path, f, cred);
      	if (error) {
      		put_filp(f);
      		return ERR_PTR(error);
      	}
      	return f;
      
      is flat-out wrong.  It used to be
      
      		error = vfs_open(path, f, cred);
      		if (!error) {
      			/* from now on we need fput() to dispose of f */
      			error = open_check_o_direct(f);
      			if (error) {
      				fput(f);
      				f = ERR_PTR(error);
      			}
      		} else {
      			put_filp(f);
      			f = ERR_PTR(error);
      		}
      
      and sure, having that open_check_o_direct() boilerplate gotten rid of is
      nice, but not that way...
      
      Worse, another call chain (via finish_open()) is FUBAR now wrt
      FILE_OPENED handling - in that case we get error returned, with file
      already hit by fput() *AND* FILE_OPENED not set.  Guess what happens in
      path_openat(), when it hits
      
      	if (!(opened & FILE_OPENED)) {
      		BUG_ON(!error);
      		put_filp(file);
      	}
      
      The root cause of all that crap is that the callers of do_dentry_open()
      have no way to tell which way did it fail; while that could be fixed up
      (by passing something like int *opened to do_dentry_open() and have it
      marked if we'd called ->open()), it's probably much too late in the
      cycle to do so right now.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af04fadc
  19. 03 4月, 2018 9 次提交
  20. 28 3月, 2018 1 次提交
  21. 10 11月, 2017 1 次提交