1. 09 12月, 2019 7 次提交
  2. 16 11月, 2019 3 次提交
    • A
      fs/namei.c: fix missing barriers when checking positivity · 2fa6b1e0
      Al Viro 提交于
      Pinned negative dentries can, generally, be made positive
      by another thread.  Conditions that prevent that are
      	* ->d_lock on dentry in question
      	* parent directory held at least shared
      	* nobody else could have observed the address of dentry
      Most of the places working with those fall into one of those
      categories; however, d_lookup() and friends need to be used
      with some care.  Fortunately, there's not a lot of call sites,
      and with few exceptions all of those fall under one of the
      cases above.
      
      Exceptions are all in fs/namei.c - in lookup_fast(), lookup_dcache()
      and mountpoint_last().  Another one is lookup_slow() - there
      dcache lookup is done with parent held shared, but the result
      is used after we'd drop the lock.  The same happens in do_last() -
      the lookup (in lookup_one()) is done with parent locked, but
      result is used after unlocking.
      
      lookup_fast(), do_last() and mountpoint_last() flat-out reject
      negatives.
      
      Most of lookup_dcache() calls are made with parent locked at least
      shared; the only exception is lookup_one_len_unlocked().  It might
      return pinned negative, needs serious care from callers.  Fortunately,
      almost nobody calls it directly anymore; all but two callers have
      converted to lookup_positive_unlocked(), which rejects negatives.
      
      lookup_slow() is called by the same lookup_one_len_unlocked() (see
      above), mountpoint_last() and walk_component().  In those two negatives
      are rejected.
      
      In other words, there is a small set of places where we need to
      check carefully if a pinned potentially negative dentry is, in
      fact, positive.  After that check we want to be sure that both
      ->d_inode and type bits in ->d_flags are stable and observed.
      The set consists of follow_managed() (where the rejection happens
      for lookup_fast(), walk_component() and do_last()), last_mountpoint()
      and lookup_positive_unlocked().
      
      Solution:
      	1) transition from negative to positive (in __d_set_inode_and_type())
      stores ->d_inode, then uses smp_store_release() to set ->d_flags type bits.
      	2) aforementioned 3 places in fs/namei.c fetch ->d_flags with
      smp_load_acquire() and bugger off if it type bits say "negative".
      That way anyone downstream of those checks has dentry know positive pinned,
      with ->d_inode and type bits of ->d_flags stable and observed.
      
      I considered splitting off d_lookup_positive(), so that the checks could
      be done right there, under ->d_lock.  However, that leads to massive
      duplication of rather subtle code in fs/namei.c and fs/dcache.c.  It's
      worse than it might seem, thanks to autofs ->d_manage() getting involved ;-/
      No matter what, autofs_d_manage()/autofs_d_automount() must live with
      the possibility of pinned negative dentry passed their way, becoming
      positive under them - that's the intended behaviour when lookup comes
      in the middle of automount in progress, so we can't keep them out of
      the area that has to deal with those, more's the pity...
      Reported-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2fa6b1e0
    • A
      new helper: lookup_positive_unlocked() · 6c2d4798
      Al Viro 提交于
      Most of the callers of lookup_one_len_unlocked() treat negatives are
      ERR_PTR(-ENOENT).  Provide a helper that would do just that.  Note
      that a pinned positive dentry remains positive - it's ->d_inode is
      stable, etc.; a pinned _negative_ dentry can become positive at any
      point as long as you are not holding its parent at least shared.
      So using lookup_one_len_unlocked() needs to be careful;
      lookup_positive_unlocked() is safer and that's what the callers
      end up open-coding anyway.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6c2d4798
    • A
      fs/namei.c: pull positivity check into follow_managed() · d41efb52
      Al Viro 提交于
      There are 4 callers; two proceed to check if result is positive and
      fail with ENOENT if it isn't; one (in handle_lookup_down()) is
      guaranteed to yield positive and one (in lookup_fast()) is _preceded_
      by positivity check.
      
      However, follow_managed() on a negative dentry is a (fairly cheap)
      no-op on anything other than autofs.  And negative autofs dentries
      are never hashed, so lookup_fast() is not going to run into one
      of those.  Moreover, successful follow_managed() on a _positive_
      dentry never yields a negative one (and we significantly rely upon
      that in callers of lookup_fast()).
      
      In other words, we can easily transpose the positivity check and
      the call of follow_managed() in lookup_fast().  And that allows
      to fold the positivity check *into* follow_managed(), simplifying
      life for the code downstream of its calls.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d41efb52
  3. 04 10月, 2019 1 次提交
  4. 03 9月, 2019 1 次提交
    • A
      fs/namei.c: keep track of nd->root refcount status · 84a2bd39
      Al Viro 提交于
      The rules for nd->root are messy:
      	* if we have LOOKUP_ROOT, it doesn't contribute to refcounts
      	* if we have LOOKUP_RCU, it doesn't contribute to refcounts
      	* if nd->root.mnt is NULL, it doesn't contribute to refcounts
      	* otherwise it does contribute
      
      terminate_walk() needs to drop the references if they are contributing.
      So everything else should be careful not to confuse it, leading to
      rather convoluted code.
      
      It's easier to keep track of whether we'd grabbed the reference(s)
      explicitly.  Use a new flag for that.  Don't bother with zeroing
      nd->root.mnt on unlazy failures and in terminate_walk - it's not
      needed anymore (terminate_walk() won't care and the next path_init()
      will zero nd->root in !LOOKUP_ROOT case anyway).
      
      Resulting rules for nd->root refcounts are much simpler: they are
      contributing iff LOOKUP_ROOT_GRABBED is set in nd->flags.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      84a2bd39
  5. 31 8月, 2019 1 次提交
  6. 22 7月, 2019 3 次提交
  7. 20 6月, 2019 1 次提交
    • A
      fsnotify: add empty fsnotify_{unlink,rmdir}() hooks · 116b9731
      Amir Goldstein 提交于
      We would like to move fsnotify_nameremove() calls from d_delete()
      into a higher layer where the hook makes more sense and so we can
      consider every d_delete() call site individually.
      
      Start by creating empty hook fsnotify_{unlink,rmdir}() and place
      them in the proper VFS call sites.  After all d_delete() call sites
      will be converted to use the new hook, the new hook will generate the
      delete events and fsnotify_nameremove() hook will be removed.
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      116b9731
  8. 27 4月, 2019 2 次提交
  9. 18 4月, 2019 1 次提交
    • E
      vfs: use READ_ONCE() to access ->i_link · 4c4f7c19
      Eric Biggers 提交于
      Use 'READ_ONCE(inode->i_link)' to explicitly support filesystems caching
      the symlink target in ->i_link later if it was unavailable at iget()
      time, or wasn't easily available.  I'll be doing this in fscrypt, to
      improve the performance of encrypted symlinks on ext4, f2fs, and ubifs.
      
      ->i_link will start NULL and may later be set to a non-NULL value by a
      smp_store_release() or cmpxchg_release().  READ_ONCE() is needed on the
      read side.  smp_load_acquire() is unnecessary because only a data
      dependency barrier is required.  (Thanks to Al for pointing this out.)
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      4c4f7c19
  10. 08 3月, 2019 1 次提交
  11. 28 2月, 2019 1 次提交
    • D
      vfs: Add configuration parser helpers · 31d921c7
      David Howells 提交于
      Because the new API passes in key,value parameters, match_token() cannot be
      used with it.  Instead, provide three new helpers to aid with parsing:
      
       (1) fs_parse().  This takes a parameter and a simple static description of
           all the parameters and maps the key name to an ID.  It returns 1 on a
           match, 0 on no match if unknowns should be ignored and some other
           negative error code on a parse error.
      
           The parameter description includes a list of key names to IDs, desired
           parameter types and a list of enumeration name -> ID mappings.
      
           [!] Note that for the moment I've required that the key->ID mapping
           array is expected to be sorted and unterminated.  The size of the
           array is noted in the fsconfig_parser struct.  This allows me to use
           bsearch(), but I'm not sure any performance gain is worth the hassle
           of requiring people to keep the array sorted.
      
           The parameter type array is sized according to the number of parameter
           IDs and is indexed directly.  The optional enum mapping array is an
           unterminated, unsorted list and the size goes into the fsconfig_parser
           struct.
      
           The function can do some additional things:
      
      	(a) If it's not ambiguous and no value is given, the prefix "no" on
      	    a key name is permitted to indicate that the parameter should
      	    be considered negatory.
      
      	(b) If the desired type is a single simple integer, it will perform
      	    an appropriate conversion and store the result in a union in
      	    the parse result.
      
      	(c) If the desired type is an enumeration, {key ID, name} will be
      	    looked up in the enumeration list and the matching value will
      	    be stored in the parse result union.
      
      	(d) Optionally generate an error if the key is unrecognised.
      
           This is called something like:
      
      	enum rdt_param {
      		Opt_cdp,
      		Opt_cdpl2,
      		Opt_mba_mpbs,
      		nr__rdt_params
      	};
      
      	const struct fs_parameter_spec rdt_param_specs[nr__rdt_params] = {
      		[Opt_cdp]	= { fs_param_is_bool },
      		[Opt_cdpl2]	= { fs_param_is_bool },
      		[Opt_mba_mpbs]	= { fs_param_is_bool },
      	};
      
      	const const char *const rdt_param_keys[nr__rdt_params] = {
      		[Opt_cdp]	= "cdp",
      		[Opt_cdpl2]	= "cdpl2",
      		[Opt_mba_mpbs]	= "mba_mbps",
      	};
      
      	const struct fs_parameter_description rdt_parser = {
      		.name		= "rdt",
      		.nr_params	= nr__rdt_params,
      		.keys		= rdt_param_keys,
      		.specs		= rdt_param_specs,
      		.no_source	= true,
      	};
      
      	int rdt_parse_param(struct fs_context *fc,
      			    struct fs_parameter *param)
      	{
      		struct fs_parse_result parse;
      		struct rdt_fs_context *ctx = rdt_fc2context(fc);
      		int ret;
      
      		ret = fs_parse(fc, &rdt_parser, param, &parse);
      		if (ret < 0)
      			return ret;
      
      		switch (parse.key) {
      		case Opt_cdp:
      			ctx->enable_cdpl3 = true;
      			return 0;
      		case Opt_cdpl2:
      			ctx->enable_cdpl2 = true;
      			return 0;
      		case Opt_mba_mpbs:
      			ctx->enable_mba_mbps = true;
      			return 0;
      		}
      
      		return -EINVAL;
      	}
      
       (2) fs_lookup_param().  This takes a { dirfd, path, LOOKUP_EMPTY? } or
           string value and performs an appropriate path lookup to convert it
           into a path object, which it will then return.
      
           If the desired type was a blockdev, the type of the looked up inode
           will be checked to make sure it is one.
      
           This can be used like:
      
      	enum foo_param {
      		Opt_source,
      		nr__foo_params
      	};
      
      	const struct fs_parameter_spec foo_param_specs[nr__foo_params] = {
      		[Opt_source]	= { fs_param_is_blockdev },
      	};
      
      	const char *char foo_param_keys[nr__foo_params] = {
      		[Opt_source]	= "source",
      	};
      
      	const struct constant_table foo_param_alt_keys[] = {
      		{ "device",	Opt_source },
      	};
      
      	const struct fs_parameter_description foo_parser = {
      		.name		= "foo",
      		.nr_params	= nr__foo_params,
      		.nr_alt_keys	= ARRAY_SIZE(foo_param_alt_keys),
      		.keys		= foo_param_keys,
      		.alt_keys	= foo_param_alt_keys,
      		.specs		= foo_param_specs,
      	};
      
      	int foo_parse_param(struct fs_context *fc,
      			    struct fs_parameter *param)
      	{
      		struct fs_parse_result parse;
      		struct foo_fs_context *ctx = foo_fc2context(fc);
      		int ret;
      
      		ret = fs_parse(fc, &foo_parser, param, &parse);
      		if (ret < 0)
      			return ret;
      
      		switch (parse.key) {
      		case Opt_source:
      			return fs_lookup_param(fc, &foo_parser, param,
      					       &parse, &ctx->source);
      		default:
      			return -EINVAL;
      		}
      	}
      
       (3) lookup_constant().  This takes a table of named constants and looks up
           the given name within it.  The table is expected to be sorted such
           that bsearch() be used upon it.
      
           Possibly I should require the table be terminated and just use a
           for-loop to scan it instead of using bsearch() to reduce hassle.
      
           Tables look something like:
      
      	static const struct constant_table bool_names[] = {
      		{ "0",		false },
      		{ "1",		true },
      		{ "false",	false },
      		{ "no",		false },
      		{ "true",	true },
      		{ "yes",	true },
      	};
      
           and a lookup is done with something like:
      
      	b = lookup_constant(bool_names, param->string, -1);
      
      Additionally, optional validation routines for the parameter description
      are provided that can be enabled at compile time.  A later patch will
      invoke these when a filesystem is registered.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      31d921c7
  12. 05 2月, 2019 1 次提交
  13. 31 1月, 2019 1 次提交
    • R
      audit: ignore fcaps on umount · 57d46577
      Richard Guy Briggs 提交于
      Don't fetch fcaps when umount2 is called to avoid a process hang while
      it waits for the missing resource to (possibly never) re-appear.
      
      Note the comment above user_path_mountpoint_at():
       * A umount is a special case for path walking. We're not actually interested
       * in the inode in this situation, and ESTALE errors can be a problem.  We
       * simply want track down the dentry and vfsmount attached at the mountpoint
       * and avoid revalidating the last component.
      
      This can happen on ceph, cifs, 9p, lustre, fuse (gluster) or NFS.
      
      Please see the github issue tracker
      https://github.com/linux-audit/audit-kernel/issues/100Signed-off-by: NRichard Guy Briggs <rgb@redhat.com>
      [PM: merge fuzz in audit_log_fcaps()]
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      57d46577
  14. 23 12月, 2018 1 次提交
    • C
      Revert "vfs: Allow userns root to call mknod on owned filesystems." · 94f82008
      Christian Brauner 提交于
      This reverts commit 55956b59.
      
      commit 55956b59 ("vfs: Allow userns root to call mknod on owned filesystems.")
      enabled mknod() in user namespaces for userns root if CAP_MKNOD is
      available. However, these device nodes are useless since any filesystem
      mounted from a non-initial user namespace will set the SB_I_NODEV flag on
      the filesystem. Now, when a device node s created in a non-initial user
      namespace a call to open() on said device node will fail due to:
      
      bool may_open_dev(const struct path *path)
      {
              return !(path->mnt->mnt_flags & MNT_NODEV) &&
                      !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
      }
      
      The problem with this is that as of the aforementioned commit mknod()
      creates partially functional device nodes in non-initial user namespaces.
      In particular, it has the consequence that as of the aforementioned commit
      open() will be more privileged with respect to device nodes than mknod().
      Before it was the other way around. Specifically, if mknod() succeeded
      then it was transparent for any userspace application that a fatal error
      must have occured when open() failed.
      
      All of this breaks multiple userspace workloads and a widespread assumption
      about how to handle mknod(). Basically, all container runtimes and systemd
      live by the slogan "ask for forgiveness not permission" when running user
      namespace workloads. For mknod() the assumption is that if the syscall
      succeeds the device nodes are useable irrespective of whether it succeeds
      in a non-initial user namespace or not. This logic was chosen explicitly
      to allow for the glorious day when mknod() will actually be able to create
      fully functional device nodes in user namespaces.
      A specific problem people are already running into when running 4.18 rc
      kernels are failing systemd services. For any distro that is run in a
      container systemd services started with the PrivateDevices= property set
      will fail to start since the device nodes in question cannot be
      opened (cf. the arguments in [1]).
      
      Full disclosure, Seth made the very sound argument that it is already
      possible to end up with partially functional device nodes. Any filesystem
      mounted with MS_NODEV set will allow mknod() to succeed but will not allow
      open() to succeed. The difference to the case here is that the MS_NODEV
      case is transparent to userspace since it is an explicitly set mount option
      while the SB_I_NODEV case is an implicit property enforced by the kernel
      and hence opaque to userspace.
      
      [1]: https://github.com/systemd/systemd/pull/9483Signed-off-by: NChristian Brauner <christian@brauner.io>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Seth Forshee <seth.forshee@canonical.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94f82008
  15. 24 8月, 2018 1 次提交
  16. 20 7月, 2018 1 次提交
  17. 18 7月, 2018 1 次提交
  18. 12 7月, 2018 12 次提交