1. 26 5月, 2016 24 次提交
  2. 13 5月, 2016 2 次提交
    • J
      ocfs2: fix posix_acl_create deadlock · c25a1e06
      Junxiao Bi 提交于
      Commit 702e5bc6 ("ocfs2: use generic posix ACL infrastructure")
      refactored code to use posix_acl_create.  The problem with this function
      is that it is not mindful of the cluster wide inode lock making it
      unsuitable for use with ocfs2 inode creation with ACLs.  For example,
      when used in ocfs2_mknod, this function can cause deadlock as follows.
      The parent dir inode lock is taken when calling posix_acl_create ->
      get_acl -> ocfs2_iop_get_acl which takes the inode lock again.  This can
      cause deadlock if there is a blocked remote lock request waiting for the
      lock to be downconverted.  And same deadlock happened in ocfs2_reflink.
      This fix is to revert back using ocfs2_init_acl.
      
      Fixes: 702e5bc6 ("ocfs2: use generic posix ACL infrastructure")
      Signed-off-by: NTariq Saeed <tariq.x.saeed@oracle.com>
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c25a1e06
    • J
      ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock hang · 5ee0fbd5
      Junxiao Bi 提交于
      Commit 743b5f14 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
      introduced this issue.  ocfs2_setattr called by chmod command holds
      cluster wide inode lock when calling posix_acl_chmod.  This latter
      function in turn calls ocfs2_iop_get_acl and ocfs2_iop_set_acl.  These
      two are also called directly from vfs layer for getfacl/setfacl commands
      and therefore acquire the cluster wide inode lock.  If a remote
      conversion request comes after the first inode lock in ocfs2_setattr,
      OCFS2_LOCK_BLOCKED will be set.  And this will cause the second call to
      inode lock from the ocfs2_iop_get_acl() to block indefinetly.
      
      The deleted version of ocfs2_acl_chmod() calls __posix_acl_chmod() which
      does not call back into the filesystem.  Therefore, we restore
      ocfs2_acl_chmod(), modify it slightly for locking as needed, and use that
      instead.
      
      Fixes: 743b5f14 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
      Signed-off-by: NTariq Saeed <tariq.x.saeed@oracle.com>
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ee0fbd5
  3. 12 5月, 2016 1 次提交
  4. 11 5月, 2016 4 次提交
    • M
      ovl: ignore permissions on underlying lookup · 38b78a5f
      Miklos Szeredi 提交于
      Generally permission checking is not necessary when overlayfs looks up a
      dentry on one of the underlying layers, since search permission on base
      directory was already checked in ovl_permission().
      
      More specifically using lookup_one_len() causes a problem when the lower
      directory lacks search permission for a specific user while the upper
      directory does have search permission.  Since lookups are cached, this
      causes inconsistency in behavior: success depends on who did the first
      lookup.
      
      So instead use lookup_hash() which doesn't do the permission check.
      Reported-by: NIgnacy Gawędzki <ignacy.gawedzki@green-communications.fr>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      38b78a5f
    • M
      vfs: add lookup_hash() helper · 3c9fe8cd
      Miklos Szeredi 提交于
      Overlayfs needs lookup without inode_permission() and already has the name
      hash (in form of dentry->d_name on overlayfs dentry).  It also doesn't
      support filesystems with d_op->d_hash() so basically it only needs
      the actual hashed lookup from lookup_one_len_unlocked()
      
      So add a new helper that does unlocked lookup of a hashed name.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      3c9fe8cd
    • M
      vfs: rename: check backing inode being equal · 9409e22a
      Miklos Szeredi 提交于
      If a file is renamed to a hardlink of itself POSIX specifies that rename(2)
      should do nothing and return success.
      
      This condition is checked in vfs_rename().  However it won't detect hard
      links on overlayfs where these are given separate inodes on the overlayfs
      layer.
      
      Overlayfs itself detects this condition and returns success without doing
      anything, but then vfs_rename() will proceed as if this was a successful
      rename (detach_mounts(), d_move()).
      
      The correct thing to do is to detect this condition before even calling
      into overlayfs.  This patch does this by calling vfs_select_inode() to get
      the underlying inodes.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Cc: <stable@vger.kernel.org> # v4.2+
      9409e22a
    • M
      vfs: add vfs_select_inode() helper · 54d5ca87
      Miklos Szeredi 提交于
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Cc: <stable@vger.kernel.org> # v4.2+
      54d5ca87
  5. 10 5月, 2016 2 次提交
    • R
      Revert "proc/base: make prompt shell start from new line after executing "cat /proc/$pid/wchan"" · 1e92a61c
      Robin Humble 提交于
      This reverts the 4.6-rc1 commit 7e2bc81d ("proc/base: make prompt
      shell start from new line after executing "cat /proc/$pid/wchan")
      because it breaks /proc/$PID/whcan formatting in ps and top.
      
      Revert also because the patch is inconsistent - it adds a newline at the
      end of only the '0' wchan, and does not add a newline when
      /proc/$PID/wchan contains a symbol name.
      
      eg.
      $ ps -eo pid,stat,wchan,comm
      PID STAT WCHAN  COMMAND
      ...
      1189 S    -      dbus-launch
      1190 Ssl  0
      dbus-daemon
      1198 Sl   0
      lightdm
      1299 Ss   ep_pol systemd
      1301 S    -      (sd-pam)
      1304 Ss   wait   sh
      Signed-off-by: NRobin Humble <plaguedbypenguins@gmail.com>
      Cc: Minfei Huang <mnfhuang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e92a61c
    • S
      cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces · 4f41fc59
      Serge E. Hallyn 提交于
      Patch summary:
      
      When showing a cgroupfs entry in mountinfo, show the path of the mount
      root dentry relative to the reader's cgroup namespace root.
      
      Short explanation (courtesy of mkerrisk):
      
      If we create a new cgroup namespace, then we want both /proc/self/cgroup
      and /proc/self/mountinfo to show cgroup paths that are correctly
      virtualized with respect to the cgroup mount point.  Previous to this
      patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
      does not.
      
      Long version:
      
      When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
      namespace, and then mounts a new instance of the freezer cgroup, the new
      mount will be rooted at /a/b.  The root dentry field of the mountinfo
      entry will show '/a/b'.
      
       cat > /tmp/do1 << EOF
       mount -t cgroup -o freezer freezer /mnt
       grep freezer /proc/self/mountinfo
       EOF
      
       unshare -Gm  bash /tmp/do1
       > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
       > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer
      
      The task's freezer cgroup entry in /proc/self/cgroup will simply show
      '/':
      
       grep freezer /proc/self/cgroup
       9:freezer:/
      
      If instead the same task simply bind mounts the /a/b cgroup directory,
      the resulting mountinfo entry will again show /a/b for the dentry root.
      However in this case the task will find its own cgroup at /mnt/a/b,
      not at /mnt:
      
       mount --bind /sys/fs/cgroup/freezer/a/b /mnt
       130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
      
      In other words, there is no way for the task to know, based on what is
      in mountinfo, which cgroup directory is its own.
      
      Example (by mkerrisk):
      
      First, a little script to save some typing and verbiage:
      
      echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
      cat /proc/self/mountinfo | grep freezer |
              awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'
      
      Create cgroup, place this shell into the cgroup, and look at the state
      of the /proc files:
      
      2653
      2653                         # Our shell
      14254                        # cat(1)
              /proc/self/cgroup:      10:freezer:/a/b
              mountinfo:              /       /sys/fs/cgroup/freezer
      
      Create a shell in new cgroup and mount namespaces. The act of creating
      a new cgroup namespace causes the process's current cgroups directories
      to become its cgroup root directories. (Here, I'm using my own version
      of the "unshare" utility, which takes the same options as the util-linux
      version):
      
      Look at the state of the /proc files:
      
              /proc/self/cgroup:      10:freezer:/
              mountinfo:              /       /sys/fs/cgroup/freezer
      
      The third entry in /proc/self/cgroup (the pathname of the cgroup inside
      the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
      is rooted at /a/b in the outer namespace.
      
      However, the info in /proc/self/mountinfo is not for this cgroup
      namespace, since we are seeing a duplicate of the mount from the
      old mount namespace, and the info there does not correspond to the
      new cgroup namespace. However, trying to create a new mount still
      doesn't show us the right information in mountinfo:
      
                                            # propagating to other mountns
              /proc/self/cgroup:      7:freezer:/
              mountinfo:              /a/b    /mnt/freezer
      
      The act of creating a new cgroup namespace caused the process's
      current freezer directory, "/a/b", to become its cgroup freezer root
      directory. In other words, the pathname directory of the directory
      within the newly mounted cgroup filesystem should be "/",
      but mountinfo wrongly shows us "/a/b". The consequence of this is
      that the process in the cgroup namespace cannot correctly construct
      the pathname of its cgroup root directory from the information in
      /proc/PID/mountinfo.
      
      With this patch, the dentry root field in mountinfo is shown relative
      to the reader's cgroup namespace.  So the same steps as above:
      
              /proc/self/cgroup:      10:freezer:/a/b
              mountinfo:              /       /sys/fs/cgroup/freezer
              /proc/self/cgroup:      10:freezer:/
              mountinfo:              /../..  /sys/fs/cgroup/freezer
              /proc/self/cgroup:      10:freezer:/
              mountinfo:              /       /mnt/freezer
      
      cgroup.clone_children  freezer.parent_freezing  freezer.state      tasks
      cgroup.procs           freezer.self_freezing    notify_on_release
      3164
      2653                   # First shell that placed in this cgroup
      3164                   # Shell started by 'unshare'
      14197                  # cat(1)
      Signed-off-by: NSerge Hallyn <serge.hallyn@ubuntu.com>
      Tested-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4f41fc59
  6. 08 5月, 2016 1 次提交
    • A
      get_rock_ridge_filename(): handle malformed NM entries · 99d82582
      Al Viro 提交于
      Payloads of NM entries are not supposed to contain NUL.  When we run
      into such, only the part prior to the first NUL goes into the
      concatenation (i.e. the directory entry name being encoded by a bunch
      of NM entries).  We do stop when the amount collected so far + the
      claimed amount in the current NM entry exceed 254.  So far, so good,
      but what we return as the total length is the sum of *claimed*
      sizes, not the actual amount collected.  And that can grow pretty
      large - not unlimited, since you'd need to put CE entries in
      between to be able to get more than the maximum that could be
      contained in one isofs directory entry / continuation chunk and
      we are stop once we'd encountered 32 CEs, but you can get about 8Kb
      easily.  And that's what will be passed to readdir callback as the
      name length.  8Kb __copy_to_user() from a buffer allocated by
      __get_free_page()
      
      Cc: stable@vger.kernel.org # 0.98pl6+ (yes, really)
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      99d82582
  7. 06 5月, 2016 1 次提交
    • M
      proc: prevent accessing /proc/<PID>/environ until it's ready · 8148a73c
      Mathias Krause 提交于
      If /proc/<PID>/environ gets read before the envp[] array is fully set up
      in create_{aout,elf,elf_fdpic,flat}_tables(), we might end up trying to
      read more bytes than are actually written, as env_start will already be
      set but env_end will still be zero, making the range calculation
      underflow, allowing to read beyond the end of what has been written.
      
      Fix this as it is done for /proc/<PID>/cmdline by testing env_end for
      zero.  It is, apparently, intentionally set last in create_*_tables().
      
      This bug was found by the PaX size_overflow plugin that detected the
      arithmetic underflow of 'this_len = env_end - (env_start + src)' when
      env_end is still zero.
      
      The expected consequence is that userland trying to access
      /proc/<PID>/environ of a not yet fully set up process may get
      inconsistent data as we're in the middle of copying in the environment
      variables.
      
      Fixes: https://forums.grsecurity.net/viewtopic.php?f=3&t=4363
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=116461Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Cc: Emese Revfy <re.emese@gmail.com>
      Cc: Pax Team <pageexec@freemail.hu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jarod Wilson <jarod@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8148a73c
  8. 05 5月, 2016 2 次提交
    • E
      propogate_mnt: Handle the first propogated copy being a slave · 5ec0811d
      Eric W. Biederman 提交于
      When the first propgated copy was a slave the following oops would result:
      > BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      > IP: [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
      > PGD bacd4067 PUD bac66067 PMD 0
      > Oops: 0000 [#1] SMP
      > Modules linked in:
      > CPU: 1 PID: 824 Comm: mount Not tainted 4.6.0-rc5userns+ #1523
      > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
      > task: ffff8800bb0a8000 ti: ffff8800bac3c000 task.ti: ffff8800bac3c000
      > RIP: 0010:[<ffffffff811fba4e>]  [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
      > RSP: 0018:ffff8800bac3fd38  EFLAGS: 00010283
      > RAX: 0000000000000000 RBX: ffff8800bb77ec00 RCX: 0000000000000010
      > RDX: 0000000000000000 RSI: ffff8800bb58c000 RDI: ffff8800bb58c480
      > RBP: ffff8800bac3fd48 R08: 0000000000000001 R09: 0000000000000000
      > R10: 0000000000001ca1 R11: 0000000000001c9d R12: 0000000000000000
      > R13: ffff8800ba713800 R14: ffff8800bac3fda0 R15: ffff8800bb77ec00
      > FS:  00007f3c0cd9b7e0(0000) GS:ffff8800bfb00000(0000) knlGS:0000000000000000
      > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      > CR2: 0000000000000010 CR3: 00000000bb79d000 CR4: 00000000000006e0
      > Stack:
      >  ffff8800bb77ec00 0000000000000000 ffff8800bac3fd88 ffffffff811fbf85
      >  ffff8800bac3fd98 ffff8800bb77f080 ffff8800ba713800 ffff8800bb262b40
      >  0000000000000000 0000000000000000 ffff8800bac3fdd8 ffffffff811f1da0
      > Call Trace:
      >  [<ffffffff811fbf85>] propagate_mnt+0x105/0x140
      >  [<ffffffff811f1da0>] attach_recursive_mnt+0x120/0x1e0
      >  [<ffffffff811f1ec3>] graft_tree+0x63/0x70
      >  [<ffffffff811f1f6b>] do_add_mount+0x9b/0x100
      >  [<ffffffff811f2c1a>] do_mount+0x2aa/0xdf0
      >  [<ffffffff8117efbe>] ? strndup_user+0x4e/0x70
      >  [<ffffffff811f3a45>] SyS_mount+0x75/0xc0
      >  [<ffffffff8100242b>] do_syscall_64+0x4b/0xa0
      >  [<ffffffff81988f3c>] entry_SYSCALL64_slow_path+0x25/0x25
      > Code: 00 00 75 ec 48 89 0d 02 22 22 01 8b 89 10 01 00 00 48 89 05 fd 21 22 01 39 8e 10 01 00 00 0f 84 e0 00 00 00 48 8b 80 d8 00 00 00 <48> 8b 50 10 48 89 05 df 21 22 01 48 89 15 d0 21 22 01 8b 53 30
      > RIP  [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
      >  RSP <ffff8800bac3fd38>
      > CR2: 0000000000000010
      > ---[ end trace 2725ecd95164f217 ]---
      
      This oops happens with the namespace_sem held and can be triggered by
      non-root users.  An all around not pleasant experience.
      
      To avoid this scenario when finding the appropriate source mount to
      copy stop the walk up the mnt_master chain when the first source mount
      is encountered.
      
      Further rewrite the walk up the last_source mnt_master chain so that
      it is clear what is going on.
      
      The reason why the first source mount is special is that it it's
      mnt_parent is not a mount in the dest_mnt propagation tree, and as
      such termination conditions based up on the dest_mnt mount propgation
      tree do not make sense.
      
      To avoid other kinds of confusion last_dest is not changed when
      computing last_source.  last_dest is only used once in propagate_one
      and that is above the point of the code being modified, so changing
      the global variable is meaningless and confusing.
      
      Cc: stable@vger.kernel.org
      fixes: f2ebb3a9 ("smarter propagate_mnt()")
      Reported-by: NTycho Andersen <tycho.andersen@canonical.com>
      Reviewed-by: NSeth Forshee <seth.forshee@canonical.com>
      Tested-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      5ec0811d
    • A
      ecryptfs: fix handling of directory opening · 6a480a78
      Al Viro 提交于
      First of all, trying to open them r/w is idiocy; it's guaranteed to fail.
      Moreover, assigning ->f_pos and assuming that everything will work is
      blatantly broken - try that with e.g. tmpfs as underlying layer and watch
      the fireworks.  There may be a non-trivial amount of state associated with
      current IO position, well beyond the numeric offset.  Using the single
      struct file associated with underlying inode is really not a good idea;
      we ought to open one for each ecryptfs directory struct file.
      
      Additionally, file_operations both for directories and non-directories are
      full of pointless methods; non-directories should *not* have ->iterate(),
      directories should not have ->flush(), ->fasync() and ->splice_read().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6a480a78
  9. 03 5月, 2016 1 次提交
    • S
      kernfs_path_from_node_locked: don't overwrite nlen · e99ed4de
      Serge Hallyn 提交于
      We've calculated @len to be the bytes we need for '/..' entries from
      @kn_from to the common ancestor, and calculated @nlen to be the extra
      bytes we need to get from the common ancestor to @kn_to.  We use them
      as such at the end.  But in the loop copying the actual entries, we
      overwrite @nlen.  Use a temporary variable for that instead.
      
      Without this, the return length, when the buffer is large enough, is
      wrong.  (When the buffer is NULL or too small, the returned value is
      correct. The buffer contents are also correct.)
      
      Interestingly, no callers of this function are affected by this as of
      yet.  However the upcoming cgroup_show_path() will be.
      Signed-off-by: NSerge Hallyn <serge.hallyn@ubuntu.com>
      e99ed4de
  10. 01 5月, 2016 1 次提交
    • A
      atomic_open(): fix the handling of create_error · 10c64cea
      Al Viro 提交于
      * if we have a hashed negative dentry and either CREAT|EXCL on
      r/o filesystem, or CREAT|TRUNC on r/o filesystem, or CREAT|EXCL
      with failing may_o_create(), we should fail with EROFS or the
      error may_o_create() has returned, but not ENOENT.  Which is what
      the current code ends up returning.
      
      * if we have CREAT|TRUNC hitting a regular file on a read-only
      filesystem, we can't fail with EROFS here.  At the very least,
      not until we'd done follow_managed() - we might have a writable
      file (or a device, for that matter) bound on top of that one.
      Moreover, the code downstream will see that O_TRUNC and attempt
      to grab the write access (*after* following possible mount), so
      if we really should fail with EROFS, it will happen.  No need
      to do that inside atomic_open().
      
      The real logics is much simpler than what the current code is
      trying to do - if we decided to go for simple lookup, ended
      up with a negative dentry *and* had create_error set, fail with
      create_error.  No matter whether we'd got that negative dentry
      from lookup_real() or had found it in dcache.
      
      Cc: stable@vger.kernel.org # v3.6+
      Acked-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      10c64cea
  11. 29 4月, 2016 1 次提交