1. 10 4月, 2013 3 次提交
    • A
      mnt: release locks on error path in do_loopback · e9c5d8a5
      Andrey Vagin 提交于
      do_loopback calls lock_mount(path) and forget to unlock_mount
      if clone_mnt or copy_mnt fails.
      
      [   77.661566] ================================================
      [   77.662939] [ BUG: lock held when returning to user space! ]
      [   77.664104] 3.9.0-rc5+ #17 Not tainted
      [   77.664982] ------------------------------------------------
      [   77.666488] mount/514 is leaving the kernel with locks still held!
      [   77.668027] 2 locks held by mount/514:
      [   77.668817]  #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<ffffffff811cca22>] lock_mount+0x32/0xe0
      [   77.671755]  #1:  (&namespace_sem){+++++.}, at: [<ffffffff811cca3a>] lock_mount+0x4a/0xe0
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e9c5d8a5
    • A
      procfs: add proc_remove_subtree() · 8ce584c7
      Al Viro 提交于
      just what it sounds like; do that only to procfs subtrees you've
      created - doing that to something shared with another driver is
      not only antisocial, but might cause interesting races with
      proc_create() and its ilk.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8ce584c7
    • A
      ecryptfs: close rmmod race · 52f21999
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      52f21999
  2. 06 4月, 2013 1 次提交
    • B
      GFS2: Issue discards in 512b sectors · b2c87cae
      Bob Peterson 提交于
      This patch changes GFS2's discard issuing code so that it calls
      function sb_issue_discard rather than blkdev_issue_discard. The
      code was calling blkdev_issue_discard and specifying the correct
      sector offset and sector size, but blkdev_issue_discard expects
      these values to be in terms of 512 byte sectors, even if the native
      sector size for the device is different. Calling sb_issue_discard
      with the BLOCK size instead ensures the correct block-to-512b-sector
      translation. I verified that "minlen" is specified in blocks, so
      comparing it to a number of blocks is correct.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      b2c87cae
  3. 04 4月, 2013 5 次提交
  4. 02 4月, 2013 1 次提交
    • A
      loop: prevent bdev freeing while device in use · c1681bf8
      Anatol Pomozov 提交于
      struct block_device lifecycle is defined by its inode (see fs/block_dev.c) -
      block_device allocated first time we access /dev/loopXX and deallocated on
      bdev_destroy_inode. When we create the device "losetup /dev/loopXX afile"
      we want that block_device stay alive until we destroy the loop device
      with "losetup -d".
      
      But because we do not hold /dev/loopXX inode its counter goes 0, and
      inode/bdev can be destroyed at any moment. Usually it happens at memory
      pressure or when user drops inode cache (like in the test below). When later in
      loop_clr_fd() we want to use bdev we have use-after-free error with following
      stack:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000280
        bd_set_size+0x10/0xa0
        loop_clr_fd+0x1f8/0x420 [loop]
        lo_ioctl+0x200/0x7e0 [loop]
        lo_compat_ioctl+0x47/0xe0 [loop]
        compat_blkdev_ioctl+0x341/0x1290
        do_filp_open+0x42/0xa0
        compat_sys_ioctl+0xc1/0xf20
        do_sys_open+0x16e/0x1d0
        sysenter_dispatch+0x7/0x1a
      
      To prevent use-after-free we need to grab the device in loop_set_fd()
      and put it later in loop_clr_fd().
      
      The issue is reprodusible on current Linus head and v3.3. Here is the test:
      
        dd if=/dev/zero of=loop.file bs=1M count=1
        while [ true ]; do
          losetup /dev/loop0 loop.file
          echo 2 > /proc/sys/vm/drop_caches
          losetup -d /dev/loop0
        done
      
      [ Doing bdgrab/bput in loop_set_fd/loop_clr_fd is safe, because every
        time we call loop_set_fd() we check that loop_device->lo_state is
        Lo_unbound and set it to Lo_bound If somebody will try to set_fd again
        it will get EBUSY.  And if we try to loop_clr_fd() on unbound loop
        device we'll get ENXIO.
      
        loop_set_fd/loop_clr_fd (and any other loop ioctl) is called under
        loop_device->lo_ctl_mutex. ]
      Signed-off-by: NAnatol Pomozov <anatol.pomozov@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1681bf8
  5. 30 3月, 2013 1 次提交
  6. 29 3月, 2013 1 次提交
  7. 28 3月, 2013 9 次提交
  8. 27 3月, 2013 7 次提交
    • E
      userns: Restrict when proc and sysfs can be mounted · 87a8ebd6
      Eric W. Biederman 提交于
      Only allow unprivileged mounts of proc and sysfs if they are already
      mounted when the user namespace is created.
      
      proc and sysfs are interesting because they have content that is
      per namespace, and so fresh mounts are needed when new namespaces
      are created while at the same time proc and sysfs have content that
      is shared between every instance.
      
      Respect the policy of who may see the shared content of proc and sysfs
      by only allowing new mounts if there was an existing mount at the time
      the user namespace was created.
      
      In practice there are only two interesting cases: proc and sysfs are
      mounted at their usual places, proc and sysfs are not mounted at all
      (some form of mount namespace jail).
      
      Cc: stable@vger.kernel.org
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      87a8ebd6
    • E
      vfs: Carefully propogate mounts across user namespaces · 132c94e3
      Eric W. Biederman 提交于
      As a matter of policy MNT_READONLY should not be changable if the
      original mounter had more privileges than creator of the mount
      namespace.
      
      Add the flag CL_UNPRIVILEGED to note when we are copying a mount from
      a mount namespace that requires more privileges to a mount namespace
      that requires fewer privileges.
      
      When the CL_UNPRIVILEGED flag is set cause clone_mnt to set MNT_NO_REMOUNT
      if any of the mnt flags that should never be changed are set.
      
      This protects both mount propagation and the initial creation of a less
      privileged mount namespace.
      
      Cc: stable@vger.kernel.org
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Reported-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      132c94e3
    • E
      vfs: Add a mount flag to lock read only bind mounts · 90563b19
      Eric W. Biederman 提交于
      When a read-only bind mount is copied from mount namespace in a higher
      privileged user namespace to a mount namespace in a lesser privileged
      user namespace, it should not be possible to remove the the read-only
      restriction.
      
      Add a MNT_LOCK_READONLY mount flag to indicate that a mount must
      remain read-only.
      
      CC: stable@vger.kernel.org
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      90563b19
    • E
      userns: Don't allow creation if the user is chrooted · 3151527e
      Eric W. Biederman 提交于
      Guarantee that the policy of which files may be access that is
      established by setting the root directory will not be violated
      by user namespaces by verifying that the root directory points
      to the root of the mount namespace at the time of user namespace
      creation.
      
      Changing the root is a privileged operation, and as a matter of policy
      it serves to limit unprivileged processes to files below the current
      root directory.
      
      For reasons of simplicity and comprehensibility the privilege to
      change the root directory is gated solely on the CAP_SYS_CHROOT
      capability in the user namespace.  Therefore when creating a user
      namespace we must ensure that the policy of which files may be access
      can not be violated by changing the root directory.
      
      Anyone who runs a processes in a chroot and would like to use user
      namespace can setup the same view of filesystems with a mount
      namespace instead.  With this result that this is not a practical
      limitation for using user namespaces.
      
      Cc: stable@vger.kernel.org
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Reported-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      3151527e
    • A
      Nest rename_lock inside vfsmount_lock · 7ea600b5
      Al Viro 提交于
      ... lest we get livelocks between path_is_under() and d_path() and friends.
      
      The thing is, wrt fairness lglocks are more similar to rwsems than to rwlocks;
      it is possible to have thread B spin on attempt to take lock shared while thread
      A is already holding it shared, if B is on lower-numbered CPU than A and there's
      a thread C spinning on attempt to take the same lock exclusive.
      
      As the result, we need consistent ordering between vfsmount_lock (lglock) and
      rename_lock (seq_lock), even though everything that takes both is going to take
      vfsmount_lock only shared.
      Spotted-by: NBrad Spengler <spender@grsecurity.net>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7ea600b5
    • J
      nfsd4: reject "negative" acl lengths · 64a817cf
      J. Bruce Fields 提交于
      Since we only enforce an upper bound, not a lower bound, a "negative"
      length can get through here.
      
      The symptom seen was a warning when we attempt to a kmalloc with an
      excessive size.
      Reported-by: NToralf Förster <toralf.foerster@gmx.de>
      Cc: stable@kernel.org
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      64a817cf
    • C
      Btrfs: fix race between mmap writes and compression · 4adaa611
      Chris Mason 提交于
      Btrfs uses page_mkwrite to ensure stable pages during
      crc calculations and mmap workloads.  We call clear_page_dirty_for_io
      before we do any crcs, and this forces any application with the file
      mapped to wait for the crc to finish before it is allowed to change
      the file.
      
      With compression on, the clear_page_dirty_for_io step is happening after
      we've compressed the pages.  This means the applications might be
      changing the pages while we are compressing them, and some of those
      modifications might not hit the disk.
      
      This commit adds the clear_page_dirty_for_io before compression starts
      and makes sure to redirty the page if we have to fallback to
      uncompressed IO as well.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      Reported-by: NAlexandre Oliva <oliva@gnu.org>
      cc: stable@vger.kernel.org
      4adaa611
  9. 23 3月, 2013 2 次提交
    • K
      nfsd: fix bad offset use · e49dbbf3
      Kent Overstreet 提交于
      vfs_writev() updates the offset argument - but the code then passes the
      offset to vfs_fsync_range(). Since offset now points to the offset after
      what was just written, this is probably not what was intended
      
      Introduced by face1502 "nfsd: use
      vfs_fsync_range(), not O_SYNC, for stable writes".
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      e49dbbf3
    • L
      vfs,proc: guarantee unique inodes in /proc · 51f0885e
      Linus Torvalds 提交于
      Dave Jones found another /proc issue with his Trinity tool: thanks to
      the namespace model, we can have multiple /proc dentries that point to
      the same inode, aliasing directories in /proc/<pid>/net/ for example.
      
      This ends up being a total disaster, because it acts like hardlinked
      directories, and causes locking problems.  We rely on the topological
      sort of the inodes pointed to by dentries, and if we have aliased
      directories, that odering becomes unreliable.
      
      In short: don't do this.  Multiple dentries with the same (directory)
      inode is just a bad idea, and the namespace code should never have
      exposed things this way.  But we're kind of stuck with it.
      
      This solves things by just always allocating a new inode during /proc
      dentry lookup, instead of using "iget_locked()" to look up existing
      inodes by superblock and number.  That actually simplies the code a bit,
      at the cost of potentially doing more inode [de]allocations.
      
      That said, the inode lookup wasn't free either (and did a lot of locking
      of inodes), so it is probably not that noticeable.  We could easily keep
      the old lookup model for non-directory entries, but rather than try to
      be excessively clever this just implements the minimal and simplest
      workaround for the problem.
      Reported-and-tested-by: NDave Jones <davej@redhat.com>
      Analyzed-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51f0885e
  10. 22 3月, 2013 7 次提交
  11. 21 3月, 2013 3 次提交