1. 01 7月, 2015 8 次提交
    • E
      mnt: Update fs_fully_visible to test for permanently empty directories · 7236c85e
      Eric W. Biederman 提交于
      fs_fully_visible attempts to make fresh mounts of proc and sysfs give
      the mounter no more access to proc and sysfs than if they could have
      by creating a bind mount.  One aspect of proc and sysfs that makes
      this particularly tricky is that there are other filesystems that
      typically mount on top of proc and sysfs.  As those filesystems are
      mounted on empty directories in practice it is safe to ignore them.
      However testing to ensure filesystems are mounted on empty directories
      has not been something the in kernel data structures have supported so
      the current test for an empty directory which checks to see
      if nlink <= 2 is a bit lacking.
      
      proc and sysfs have recently been modified to use the new empty_dir
      infrastructure to create all of their dedicated mount points.  Instead
      of testing for S_ISDIR(inode->i_mode) && i_nlink <= 2 to see if a
      directory is empty, test for is_empty_dir_inode(inode).  That small
      change guaranteess mounts found on proc and sysfs really are safe to
      ignore, because the directories are not only empty but nothing can
      ever be added to them.  This guarantees there is nothing to worry
      about when mounting proc and sysfs.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      7236c85e
    • E
      sysfs: Create mountpoints with sysfs_create_mount_point · f9bb4882
      Eric W. Biederman 提交于
      This allows for better documentation in the code and
      it allows for a simpler and fully correct version of
      fs_fully_visible to be written.
      
      The mount points converted and their filesystems are:
      /sys/hypervisor/s390/       s390_hypfs
      /sys/kernel/config/         configfs
      /sys/kernel/debug/          debugfs
      /sys/firmware/efi/efivars/  efivarfs
      /sys/fs/fuse/connections/   fusectl
      /sys/fs/pstore/             pstore
      /sys/kernel/tracing/        tracefs
      /sys/fs/cgroup/             cgroup
      /sys/kernel/security/       securityfs
      /sys/fs/selinux/            selinuxfs
      /sys/fs/smackfs/            smackfs
      
      Cc: stable@vger.kernel.org
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      f9bb4882
    • E
      sysfs: Add support for permanently empty directories to serve as mount points. · 87d2846f
      Eric W. Biederman 提交于
      Add two functions sysfs_create_mount_point and
      sysfs_remove_mount_point that hang a permanently empty directory off
      of a kobject or remove a permanently emptpy directory hanging from a
      kobject.  Export these new functions so modular filesystems can use
      them.
      
      Cc: stable@vger.kernel.org
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      87d2846f
    • E
      kernfs: Add support for always empty directories. · ea015218
      Eric W. Biederman 提交于
      Add a new function kernfs_create_empty_dir that can be used to create
      directory that can not be modified.
      
      Update the code to use make_empty_dir_inode when reporting a
      permanently empty directory to the vfs.
      
      Update the code to not allow adding to permanently empty directories.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      ea015218
    • E
      proc: Allow creating permanently empty directories that serve as mount points · eb6d38d5
      Eric W. Biederman 提交于
      Add a new function proc_create_mount_point that when used to creates a
      directory that can not be added to.
      
      Add a new function is_empty_pde to test if a function is a mount
      point.
      
      Update the code to use make_empty_dir_inode when reporting
      a permanently empty directory to the vfs.
      
      Update the code to not allow adding to permanently empty directories.
      
      Update /proc/openprom and /proc/fs/nfsd to be permanently empty directories.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      eb6d38d5
    • E
      sysctl: Allow creating permanently empty directories that serve as mountpoints. · f9bd6733
      Eric W. Biederman 提交于
      Add a magic sysctl table sysctl_mount_point that when used to
      create a directory forces that directory to be permanently empty.
      
      Update the code to use make_empty_dir_inode when accessing permanently
      empty directories.
      
      Update the code to not allow adding to permanently empty directories.
      
      Update /proc/sys/fs/binfmt_misc to be a permanently empty directory.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      f9bd6733
    • E
      fs: Add helper functions for permanently empty directories. · fbabfd0f
      Eric W. Biederman 提交于
      To ensure it is safe to mount proc and sysfs I need to check if
      filesystems that are mounted on top of them are mounted on truly empty
      directories.  Given that some directories can gain entries over time,
      knowing that a directory is empty right now is insufficient.
      
      Therefore add supporting infrastructure for permantently empty
      directories that proc and sysfs can use when they create mount points
      for filesystems and fs_fully_visible can use to test for permanently
      empty directories to ensure that nothing will be gained by mounting a
      fresh copy of proc or sysfs.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      fbabfd0f
    • E
      vfs: Ignore unlocked mounts in fs_fully_visible · ceeb0e5d
      Eric W. Biederman 提交于
      Limit the mounts fs_fully_visible considers to locked mounts.
      Unlocked can always be unmounted so considering them adds hassle
      but no security benefit.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      ceeb0e5d
  2. 04 6月, 2015 1 次提交
    • E
      mnt: Modify fs_fully_visible to deal with locked ro nodev and atime · 8c6cf9cc
      Eric W. Biederman 提交于
      Ignore an existing mount if the locked readonly, nodev or atime
      attributes are less permissive than the desired attributes
      of the new mount.
      
      On success ensure the new mount locks all of the same readonly, nodev and
      atime attributes as the old mount.
      
      The nosuid and noexec attributes are not checked here as this change
      is destined for stable and enforcing those attributes causes a
      regression in lxc and libvirt-lxc where those applications will not
      start and there are no known executables on sysfs or proc and no known
      way to create exectuables without code modifications
      
      Cc: stable@vger.kernel.org
      Fixes: e51db735 ("userns: Better restrictions on when proc and sysfs can be mounted")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      8c6cf9cc
  3. 14 5月, 2015 1 次提交
    • E
      mnt: Refactor the logic for mounting sysfs and proc in a user namespace · 1b852bce
      Eric W. Biederman 提交于
      Fresh mounts of proc and sysfs are a very special case that works very
      much like a bind mount.  Unfortunately the current structure can not
      preserve the MNT_LOCK... mount flags.  Therefore refactor the logic
      into a form that can be modified to preserve those lock bits.
      
      Add a new filesystem flag FS_USERNS_VISIBLE that requires some mount
      of the filesystem be fully visible in the current mount namespace,
      before the filesystem may be mounted.
      
      Move the logic for calling fs_fully_visible from proc and sysfs into
      fs/namespace.c where it has greater access to mount namespace state.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      1b852bce
  4. 10 5月, 2015 1 次提交
  5. 25 4月, 2015 4 次提交
    • A
      RCU pathwalk breakage when running into a symlink overmounting something · 3cab989a
      Al Viro 提交于
      Calling unlazy_walk() in walk_component() and do_last() when we find
      a symlink that needs to be followed doesn't acquire a reference to vfsmount.
      That's fine when the symlink is on the same vfsmount as the parent directory
      (which is almost always the case), but it's not always true - one _can_
      manage to bind a symlink on top of something.  And in such cases we end up
      with excessive mntput().
      
      Cc: stable@vger.kernel.org # since 2.6.39
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3cab989a
    • J
      direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0
      Jens Axboe 提交于
      do_blockdev_direct_IO() increments and decrements the inode
      ->i_dio_count for each IO operation. It does this to protect against
      truncate of a file. Block devices don't need this sort of protection.
      
      For a capable multiqueue setup, this atomic int is the only shared
      state between applications accessing the device for O_DIRECT, and it
      presents a scaling wall for that. In my testing, as much as 30% of
      system time is spent incrementing and decrementing this value. A mixed
      read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
      better latencies too. Before:
      
      clat percentiles (usec):
       |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
       | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
       | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
       | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
       | 99.99th=[  165]
      
      After:
      
      clat percentiles (usec):
       |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
       | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
       | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
       | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
       | 99.99th=[  438]
      
      In other setups, Robert Elliott reported seeing good performance
      improvements:
      
      https://lkml.org/lkml/2015/4/3/557
      
      The more applications accessing the device, the worse it gets.
      
      Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
      do_blockdev_direct_IO() that it need not worry about incrementing
      or decrementing the inode i_dio_count for this caller.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe0f07d0
    • J
      fs/9p: fix readdir() · 8e3c5005
      Johannes Berg 提交于
      Al Viro's IOV changes broke 9p readdir() because the new code
      didn't abort the read when it returned nothing. The original
      code checked if the combined error/length was <= 0 but in the
      new code that accidentally got changed to just an error check.
      
      Add back the return from the function when nothing is read.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Fixes: e1200fe6 ("9p: switch p9_client_read() to passing struct iov_iter *")
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8e3c5005
    • C
      Btrfs: prevent list corruption during free space cache processing · a3bdccc4
      Chris Mason 提交于
      __btrfs_write_out_cache is holding the ctl->tree_lock while it prepares
      a list of bitmaps to record in the free space cache.  It was dropping
      the lock while it worked on other components, which made a window for
      free_bitmap() to free the bitmap struct without removing it from the
      list.
      
      This changes things to hold the lock the whole time, and also makes sure
      we hold the lock during enospc cleanup.
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a3bdccc4
  6. 24 4月, 2015 15 次提交
  7. 22 4月, 2015 9 次提交
  8. 20 4月, 2015 1 次提交