1. 28 5月, 2020 2 次提交
  2. 29 4月, 2020 4 次提交
    • A
      fix autofs regression caused by follow_managed() changes · 23b92780
      Al Viro 提交于
      fix #27211210
      
      commit 508c8772760d4ef9c1a044519b564710c3684fc5 upstream.
      
      we need to reload ->d_flags after the call of ->d_manage() - the thing
      might've been called with dentry still negative and have the damn thing
      turned positive while we'd waited.
      
      Fixes: d41efb522e90 "fs/namei.c: pull positivity check into follow_managed()"
      Reported-by: NIan Kent <raven@themaw.net>
      Tested-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      23b92780
    • A
      fs/namei.c: fix missing barriers when checking positivity · e2bdaed8
      Al Viro 提交于
      fix #27211210
      
      commit 2fa6b1e01a9b1a54769c394f06cd72c3d12a2d48 upstream.
      
      Pinned negative dentries can, generally, be made positive
      by another thread.  Conditions that prevent that are
      	* ->d_lock on dentry in question
      	* parent directory held at least shared
      	* nobody else could have observed the address of dentry
      Most of the places working with those fall into one of those
      categories; however, d_lookup() and friends need to be used
      with some care.  Fortunately, there's not a lot of call sites,
      and with few exceptions all of those fall under one of the
      cases above.
      
      Exceptions are all in fs/namei.c - in lookup_fast(), lookup_dcache()
      and mountpoint_last().  Another one is lookup_slow() - there
      dcache lookup is done with parent held shared, but the result
      is used after we'd drop the lock.  The same happens in do_last() -
      the lookup (in lookup_one()) is done with parent locked, but
      result is used after unlocking.
      
      lookup_fast(), do_last() and mountpoint_last() flat-out reject
      negatives.
      
      Most of lookup_dcache() calls are made with parent locked at least
      shared; the only exception is lookup_one_len_unlocked().  It might
      return pinned negative, needs serious care from callers.  Fortunately,
      almost nobody calls it directly anymore; all but two callers have
      converted to lookup_positive_unlocked(), which rejects negatives.
      
      lookup_slow() is called by the same lookup_one_len_unlocked() (see
      above), mountpoint_last() and walk_component().  In those two negatives
      are rejected.
      
      In other words, there is a small set of places where we need to
      check carefully if a pinned potentially negative dentry is, in
      fact, positive.  After that check we want to be sure that both
      ->d_inode and type bits in ->d_flags are stable and observed.
      The set consists of follow_managed() (where the rejection happens
      for lookup_fast(), walk_component() and do_last()), last_mountpoint()
      and lookup_positive_unlocked().
      
      Solution:
      	1) transition from negative to positive (in __d_set_inode_and_type())
      stores ->d_inode, then uses smp_store_release() to set ->d_flags type bits.
      	2) aforementioned 3 places in fs/namei.c fetch ->d_flags with
      smp_load_acquire() and bugger off if it type bits say "negative".
      That way anyone downstream of those checks has dentry know positive pinned,
      with ->d_inode and type bits of ->d_flags stable and observed.
      
      I considered splitting off d_lookup_positive(), so that the checks could
      be done right there, under ->d_lock.  However, that leads to massive
      duplication of rather subtle code in fs/namei.c and fs/dcache.c.  It's
      worse than it might seem, thanks to autofs ->d_manage() getting involved ;-/
      No matter what, autofs_d_manage()/autofs_d_automount() must live with
      the possibility of pinned negative dentry passed their way, becoming
      positive under them - that's the intended behaviour when lookup comes
      in the middle of automount in progress, so we can't keep them out of
      the area that has to deal with those, more's the pity...
      Reported-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e2bdaed8
    • A
      new helper: lookup_positive_unlocked() · ec6880e8
      Al Viro 提交于
      fix #27211210
      
      commit 6c2d4798a8d16cf4f3a28c3cd4af4f1dcbbb4d04 upstream.
      
      Most of the callers of lookup_one_len_unlocked() treat negatives are
      ERR_PTR(-ENOENT).  Provide a helper that would do just that.  Note
      that a pinned positive dentry remains positive - it's ->d_inode is
      stable, etc.; a pinned _negative_ dentry can become positive at any
      point as long as you are not holding its parent at least shared.
      So using lookup_one_len_unlocked() needs to be careful;
      lookup_positive_unlocked() is safer and that's what the callers
      end up open-coding anyway.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      ec6880e8
    • A
      fs/namei.c: pull positivity check into follow_managed() · da2c0773
      Al Viro 提交于
      fix #27211210
      
      commit d41efb522e902364ab09c782d511c1bedc388ddd upstream.
      
      There are 4 callers; two proceed to check if result is positive and
      fail with ENOENT if it isn't; one (in handle_lookup_down()) is
      guaranteed to yield positive and one (in lookup_fast()) is _preceded_
      by positivity check.
      
      However, follow_managed() on a negative dentry is a (fairly cheap)
      no-op on anything other than autofs.  And negative autofs dentries
      are never hashed, so lookup_fast() is not going to run into one
      of those.  Moreover, successful follow_managed() on a _positive_
      dentry never yields a negative one (and we significantly rely upon
      that in callers of lookup_fast()).
      
      In other words, we can easily transpose the positivity check and
      the call of follow_managed() in lookup_fast().  And that allows
      to fold the positivity check *into* follow_managed(), simplifying
      life for the code downstream of its calls.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      da2c0773
  3. 20 3月, 2020 1 次提交
    • A
      vfs: fix do_last() regression · 6073719d
      Al Viro 提交于
      commit 6404674acd596de41fd3ad5f267b4525494a891a upstream
      
      Brown paperbag time: fetching ->i_uid/->i_mode really should've been
      done from nd->inode.  I even suggested that, but the reason for that has
      slipped through the cracks and I went for dir->d_inode instead - made
      for more "obvious" patch.
      
      Analysis:
      
       - at the entry into do_last() and all the way to step_into(): dir (aka
         nd->path.dentry) is known not to have been freed; so's nd->inode and
         it's equal to dir->d_inode unless we are already doomed to -ECHILD.
         inode of the file to get opened is not known.
      
       - after step_into(): inode of the file to get opened is known; dir
         might be pointing to freed memory/be negative/etc.
      
       - at the call of may_create_in_sticky(): guaranteed to be out of RCU
         mode; inode of the file to get opened is known and pinned; dir might
         be garbage.
      
      The last was the reason for the original patch.  Except that at the
      do_last() entry we can be in RCU mode and it is possible that
      nd->path.dentry->d_inode has already changed under us.
      
      In that case we are going to fail with -ECHILD, but we need to be
      careful; nd->inode is pointing to valid struct inode and it's the same
      as nd->path.dentry->d_inode in "won't fail with -ECHILD" case, so we
      should use that.
      Reported-by: N"Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@nokia.com>
      Reported-by: syzbot+190005201ced78a74ad6@syzkaller.appspotmail.com
      Wearing-brown-paperbag: Al Viro <viro@zeniv.linux.org.uk>
      Cc: stable@kernel.org
      Fixes: d0cb50185ae9 ("do_last(): fetch directory ->i_mode and ->i_uid before it's too late")
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6073719d
  4. 18 3月, 2020 1 次提交
  5. 29 12月, 2018 1 次提交
    • C
      Revert "vfs: Allow userns root to call mknod on owned filesystems." · 9c5ccadb
      Christian Brauner 提交于
      commit 94f82008ce30e2624537d240d64ce718255e0b80 upstream.
      
      This reverts commit 55956b59.
      
      commit 55956b59 ("vfs: Allow userns root to call mknod on owned filesystems.")
      enabled mknod() in user namespaces for userns root if CAP_MKNOD is
      available. However, these device nodes are useless since any filesystem
      mounted from a non-initial user namespace will set the SB_I_NODEV flag on
      the filesystem. Now, when a device node s created in a non-initial user
      namespace a call to open() on said device node will fail due to:
      
      bool may_open_dev(const struct path *path)
      {
              return !(path->mnt->mnt_flags & MNT_NODEV) &&
                      !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
      }
      
      The problem with this is that as of the aforementioned commit mknod()
      creates partially functional device nodes in non-initial user namespaces.
      In particular, it has the consequence that as of the aforementioned commit
      open() will be more privileged with respect to device nodes than mknod().
      Before it was the other way around. Specifically, if mknod() succeeded
      then it was transparent for any userspace application that a fatal error
      must have occured when open() failed.
      
      All of this breaks multiple userspace workloads and a widespread assumption
      about how to handle mknod(). Basically, all container runtimes and systemd
      live by the slogan "ask for forgiveness not permission" when running user
      namespace workloads. For mknod() the assumption is that if the syscall
      succeeds the device nodes are useable irrespective of whether it succeeds
      in a non-initial user namespace or not. This logic was chosen explicitly
      to allow for the glorious day when mknod() will actually be able to create
      fully functional device nodes in user namespaces.
      A specific problem people are already running into when running 4.18 rc
      kernels are failing systemd services. For any distro that is run in a
      container systemd services started with the PrivateDevices= property set
      will fail to start since the device nodes in question cannot be
      opened (cf. the arguments in [1]).
      
      Full disclosure, Seth made the very sound argument that it is already
      possible to end up with partially functional device nodes. Any filesystem
      mounted with MS_NODEV set will allow mknod() to succeed but will not allow
      open() to succeed. The difference to the case here is that the MS_NODEV
      case is transparent to userspace since it is an explicitly set mount option
      while the SB_I_NODEV case is an implicit property enforced by the kernel
      and hence opaque to userspace.
      
      [1]: https://github.com/systemd/systemd/pull/9483Signed-off-by: NChristian Brauner <christian@brauner.io>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Seth Forshee <seth.forshee@canonical.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c5ccadb
  6. 24 8月, 2018 1 次提交
  7. 20 7月, 2018 1 次提交
  8. 18 7月, 2018 1 次提交
  9. 12 7月, 2018 17 次提交
  10. 15 6月, 2018 1 次提交
    • D
      afs: Display manually added cells in dynamic root mount · 0da0b7fd
      David Howells 提交于
      Alter the dynroot mount so that cells created by manipulation of
      /proc/fs/afs/cells and /proc/fs/afs/rootcell and by specification of a root
      cell as a module parameter will cause directories for those cells to be
      created in the dynamic root superblock for the network namespace[*].
      
      To this end:
      
       (1) Only one dynamic root superblock is now created per network namespace
           and this is shared between all attempts to mount it.  This makes it
           easier to find the superblock to modify.
      
       (2) When a dynamic root superblock is created, the list of cells is walked
           and directories created for each cell already defined.
      
       (3) When a new cell is added, if a dynamic root superblock exists, a
           directory is created for it.
      
       (4) When a cell is destroyed, the directory is removed.
      
       (5) These directories are created by calling lookup_one_len() on the root
           dir which automatically creates them if they don't exist.
      
      [*] Inasmuch as network namespaces are currently supported here.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      0da0b7fd
  11. 13 6月, 2018 1 次提交
    • K
      treewide: kmalloc() -> kmalloc_array() · 6da2ec56
      Kees Cook 提交于
      The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
      patch replaces cases of:
      
              kmalloc(a * b, gfp)
      
      with:
              kmalloc_array(a * b, gfp)
      
      as well as handling cases of:
      
              kmalloc(a * b * c, gfp)
      
      with:
      
              kmalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kmalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kmalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The tools/ directory was manually excluded, since it has its own
      implementation of kmalloc().
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kmalloc
      + kmalloc_array
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(sizeof(THING) * C2, ...)
      |
        kmalloc(sizeof(TYPE) * C2, ...)
      |
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(C1 * C2, ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6da2ec56
  12. 04 6月, 2018 1 次提交
    • A
      Revert "fs: fold open_check_o_direct into do_dentry_open" · af04fadc
      Al Viro 提交于
      This reverts commit cab64df1.
      
      Having vfs_open() in some cases drop the reference to
      struct file combined with
      
      	error = vfs_open(path, f, cred);
      	if (error) {
      		put_filp(f);
      		return ERR_PTR(error);
      	}
      	return f;
      
      is flat-out wrong.  It used to be
      
      		error = vfs_open(path, f, cred);
      		if (!error) {
      			/* from now on we need fput() to dispose of f */
      			error = open_check_o_direct(f);
      			if (error) {
      				fput(f);
      				f = ERR_PTR(error);
      			}
      		} else {
      			put_filp(f);
      			f = ERR_PTR(error);
      		}
      
      and sure, having that open_check_o_direct() boilerplate gotten rid of is
      nice, but not that way...
      
      Worse, another call chain (via finish_open()) is FUBAR now wrt
      FILE_OPENED handling - in that case we get error returned, with file
      already hit by fput() *AND* FILE_OPENED not set.  Guess what happens in
      path_openat(), when it hits
      
      	if (!(opened & FILE_OPENED)) {
      		BUG_ON(!error);
      		put_filp(file);
      	}
      
      The root cause of all that crap is that the callers of do_dentry_open()
      have no way to tell which way did it fail; while that could be fixed up
      (by passing something like int *opened to do_dentry_open() and have it
      marked if we'd called ->open()), it's probably much too late in the
      cycle to do so right now.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af04fadc
  13. 28 5月, 2018 1 次提交
    • A
      rmdir(),rename(): do shrink_dcache_parent() only on success · 8767712f
      Al Viro 提交于
      Once upon a time ->rmdir() instances used to check if victim inode
      had more than one (in-core) reference and failed with -EBUSY if it
      had.  The reason was race avoidance - emptiness check is worthless
      if somebody could just go and create new objects in the victim
      directory afterwards.
      
      With introduction of dcache the checks had been replaced with
      checking the refcount of dentry.  However, since a cached negative
      lookup leaves a negative child dentry, such check had lead to false
      positives - with empty foo/ doing stat foo/bar before rmdir foo
      ended up with -EBUSY unless the negative dentry of foo/bar happened
      to be evicted by the time of rmdir(2).  That had been fixed by
      doing shrink_dcache_parent() just before the refcount check.
      
      At the same time, ext2_rmdir() has grown a private solution that
      eliminated those -EBUSY - it did something (setting ->i_size to 0)
      which made any subsequent ext2_add_entry() fail.
      
      Unfortunately, even with shrink_dcache_parent() the check had been
      racy - after all, the victim itself could be found by dcache lookup
      just after we'd checked its refcount.  That got fixed by a new
      helper (dentry_unhash()) that did shrink_dcache_parent() and unhashed
      the sucker if its refcount ended up equal to 1.  That got called before
      ->rmdir(), turning the checks in ->rmdir() instances into "if not
      unhashed fail with -EBUSY".  Which reduced the boilerplate nicely, but
      had an unpleasant side effect - now shrink_dcache_parent() had been
      done before the emptiness checks, leading to easily triggerable calls
      of shrink_dcache_parent() on arbitrary large subtrees, quite possibly
      nested into each other.
      
      Several years later the ext2-private trick had been generalized -
      (in-core) inodes of dead directories are flagged and calls of
      lookup, readdir and all directory-modifying methods were prevented
      in so marked directories.  Remaining boilerplate in ->rmdir() instances
      became redundant and some instances got rid of it.
      
      In 2011 the call of dentry_unhash() got shifted into ->rmdir() instances
      and then killed off in all of them.  That has lead to another problem,
      though - in case of successful rmdir we *want* any (negative) child
      dentries dropped and the victim itself made negative.  There's no point
      keeping cached negative lookups in foo when we can get the negative
      lookup of foo itself cached.  So shrink_dcache_parent() call had been
      restored; unfortunately, it went into the place where dentry_unhash()
      used to be, i.e. before the ->rmdir() call.  Note that we don't unhash
      anymore, so any "is it busy" checks would be racy; fortunately, all of
      them are gone.
      
      We should've done that call right *after* successful ->rmdir().  That
      reduces contention caused by tree-walking in shrink_dcache_parent()
      and, especially, contention caused by evictions in two nested subtrees
      going on in parallel.  The same goes for directory-overwriting rename() -
      the story there had been parallel to that of rmdir().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8767712f
  14. 25 5月, 2018 2 次提交
  15. 18 5月, 2018 1 次提交
  16. 08 4月, 2018 1 次提交
  17. 07 4月, 2018 3 次提交