1. 07 1月, 2012 1 次提交
    • M
      vfs: protect remounting superblock read-only · 4ed5e82f
      Miklos Szeredi 提交于
      Currently remouting superblock read-only is racy in a major way.
      
      With the per mount read-only infrastructure it is now possible to
      prevent most races, which this patch attempts.
      
      Before starting the remount read-only, iterate through all mounts
      belonging to the superblock and if none of them have any pending
      writes, set sb->s_readonly_remount.  This indicates that remount is in
      progress and no further write requests are allowed.  If the remount
      succeeds set MS_RDONLY and reset s_readonly_remount.
      
      If the remounting is unsuccessful just reset s_readonly_remount.
      This can result in transient EROFS errors, despite the fact the
      remount failed.  Unfortunately hodling off writes is difficult as
      remount itself may touch the filesystem (e.g. through load_nls())
      which would deadlock.
      
      A later patch deals with delayed writes due to nlink going to zero.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Tested-by: NToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4ed5e82f
  2. 04 1月, 2012 4 次提交
  3. 20 7月, 2011 2 次提交
    • D
      superblock: move pin_sb_for_writeback() to fs/super.c · 12ad3ab6
      Dave Chinner 提交于
      The per-sb shrinker has the same requirement as the writeback
      threads of ensuring that the superblock is usable and pinned for the
      time it takes to run the work. Both need to take a passive reference
      to the sb, take a read lock on the s_umount lock and then only
      continue if an unmount is not in progress.
      
      pin_sb_for_writeback() does this exactly, so move it to fs/super.c
      and rename it to grab_super_passive() and exporting it via
      fs/internal.h for all the VFS code to be able to use.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      12ad3ab6
    • A
      Make ->d_sb assign-once and always non-NULL · a4464dbc
      Al Viro 提交于
      New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name).
      Allocates dentry, sets its ->d_sb to given superblock and sets
      ->d_op accordingly.  Old d_alloc(NULL, name) callers are converted
      to that (all of them know what superblock they want).  d_alloc()
      itself is left only for parent != NULl case; uses __d_alloc(),
      inserts result into the list of parent's children.
      
      Note that now ->d_sb is assign-once and never NULL *and*
      ->d_parent is never NULL either.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a4464dbc
  4. 25 3月, 2011 2 次提交
  5. 22 3月, 2011 1 次提交
  6. 18 3月, 2011 1 次提交
    • A
      vfs: split off vfsmount-related parts of vfs_kern_mount() · 9d412a43
      Al Viro 提交于
      new function: mount_fs().  Does all work done by vfs_kern_mount()
      except the allocation and filling of vfsmount; returns root dentry
      or ERR_PTR().
      
      vfs_kern_mount() switched to using it and taken to fs/namespace.c,
      along with its wrappers.
      
      alloc_vfsmnt()/free_vfsmnt() made static.
      
      functions in namespace.c slightly reordered.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9d412a43
  7. 15 3月, 2011 1 次提交
  8. 14 3月, 2011 2 次提交
    • A
      open-style analog of vfs_path_lookup() · 73d049a4
      Al Viro 提交于
      new function: file_open_root(dentry, mnt, name, flags) opens the file
      vfs_path_lookup would arrive to.
      
      Note that name can be empty; in that case the usual requirement that
      dentry should be a directory is lifted.
      
      open-coded equivalents switched to it, may_open() got down exactly
      one caller and became static.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      73d049a4
    • A
      switch do_filp_open() to struct open_flags · 47c805dc
      Al Viro 提交于
      take calculation of open_flags by open(2) arguments into new helper
      in fs/open.c, move filp_open() over there, have it and do_sys_open()
      use that helper, switch exec.c callers of do_filp_open() to explicit
      (and constant) struct open_flags.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      47c805dc
  9. 24 2月, 2011 1 次提交
    • N
      Fix over-zealous flush_disk when changing device size. · 93b270f7
      NeilBrown 提交于
      There are two cases when we call flush_disk.
      In one, the device has disappeared (check_disk_change) so any
      data will hold becomes irrelevant.
      In the oter, the device has changed size (check_disk_size_change)
      so data we hold may be irrelevant.
      
      In both cases it makes sense to discard any 'clean' buffers,
      so they will be read back from the device if needed.
      
      In the former case it makes sense to discard 'dirty' buffers
      as there will never be anywhere safe to write the data.  In the
      second case it *does*not* make sense to discard dirty buffers
      as that will lead to file system corruption when you simply enlarge
      the containing devices.
      
      flush_disk calls __invalidate_devices.
      __invalidate_device calls both invalidate_inodes and invalidate_bdev.
      
      invalidate_inodes *does* discard I_DIRTY inodes and this does lead
      to fs corruption.
      
      invalidate_bev *does*not* discard dirty pages, but I don't really care
      about that at present.
      
      So this patch adds a flag to __invalidate_device (calling it
      __invalidate_device2) to indicate whether dirty buffers should be
      killed, and this is passed to invalidate_inodes which can choose to
      skip dirty inodes.
      
      flusk_disk then passes true from check_disk_change and false from
      check_disk_size_change.
      
      dm avoids tripping over this problem by calling i_size_write directly
      rathher than using check_disk_size_change.
      
      md does use check_disk_size_change and so is affected.
      
      This regression was introduced by commit 608aeef1 which causes
      check_disk_size_change to call flush_disk, so it is suitable for any
      kernel since 2.6.27.
      
      Cc: stable@kernel.org
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Andrew Patterson <andrew.patterson@hp.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      93b270f7
  10. 17 1月, 2011 3 次提交
    • A
      tidy up around finish_automount() · b1e75df4
      Al Viro 提交于
      do_add_mount() and mnt_clear_expiry() are not needed outside of
      namespace.c anymore, now that namei has finish_automount() to
      use.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b1e75df4
    • A
      Take the completion of automount into new helper · 19a167af
      Al Viro 提交于
      ... and shift it from namei.c to namespace.c
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      19a167af
    • A
      sanitize vfsmount refcounting changes · f03c6599
      Al Viro 提交于
      Instead of splitting refcount between (per-cpu) mnt_count
      and (SMP-only) mnt_longrefs, make all references contribute
      to mnt_count again and keep track of how many are longterm
      ones.
      
      Accounting rules for longterm count:
      	* 1 for each fs_struct.root.mnt
      	* 1 for each fs_struct.pwd.mnt
      	* 1 for having non-NULL ->mnt_ns
      	* decrement to 0 happens only under vfsmount lock exclusive
      
      That allows nice common case for mntput() - since we can't drop the
      final reference until after mnt_longterm has reached 0 due to the rules
      above, mntput() can grab vfsmount lock shared and check mnt_longterm.
      If it turns out to be non-zero (which is the common case), we know
      that this is not the final mntput() and can just blindly decrement
      percpu mnt_count.  Otherwise we grab vfsmount lock exclusive and
      do usual decrement-and-check of percpu mnt_count.
      
      For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
      namespace.c uses the latter in places where we don't already hold
      vfsmount lock exclusive and opencodes a few remaining spots where
      we need to manipulate mnt_longterm.
      
      Note that we mostly revert the code outside of fs/namespace.c back
      to what we used to have; in particular, normal code doesn't need
      to care about two kinds of references, etc.  And we get to keep
      the optimization Nick's variant had bought us...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f03c6599
  11. 16 1月, 2011 1 次提交
    • D
      Unexport do_add_mount() and add in follow_automount(), not ->d_automount() · ea5b778a
      David Howells 提交于
      Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
      added rather than calling do_add_mount() itself.  follow_automount() will then
      do the addition.
      
      This slightly complicates things as ->d_automount() normally wants to add the
      new vfsmount to an expiration list and start an expiration timer.  The problem
      with that is that the vfsmount will be deleted if it has a refcount of 1 and
      the timer will not repeat if the expiration list is empty.
      
      To this end, we require the vfsmount to be returned from d_automount() with a
      refcount of (at least) 2.  One of these refs will be dropped unconditionally.
      In addition, follow_automount() must get a 3rd ref around the call to
      do_add_mount() lest it eat a ref and return an error, leaving the mount we
      have open to being expired as we would otherwise have only 1 ref on it.
      
      d_automount() should also add the the vfsmount to the expiration list (by
      calling mnt_set_expiry()) and start the expiration timer before returning, if
      this mechanism is to be used.  The vfsmount will be unlinked from the
      expiration list by follow_automount() if do_add_mount() fails.
      
      This patch also fixes the call to do_add_mount() for AFS to propagate the mount
      flags from the parent vfsmount.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ea5b778a
  12. 07 1月, 2011 1 次提交
    • N
      fs: scale mntget/mntput · b3e19d92
      Nick Piggin 提交于
      The problem that this patch aims to fix is vfsmount refcounting scalability.
      We need to take a reference on the vfsmount for every successful path lookup,
      which often go to the same mount point.
      
      The fundamental difficulty is that a "simple" reference count can never be made
      scalable, because any time a reference is dropped, we must check whether that
      was the last reference. To do that requires communication with all other CPUs
      that may have taken a reference count.
      
      We can make refcounts more scalable in a couple of ways, involving keeping
      distributed counters, and checking for the global-zero condition less
      frequently.
      
      - check the global sum once every interval (this will delay zero detection
        for some interval, so it's probably a showstopper for vfsmounts).
      
      - keep a local count and only taking the global sum when local reaches 0 (this
        is difficult for vfsmounts, because we can't hold preempt off for the life of
        a reference, so a counter would need to be per-thread or tied strongly to a
        particular CPU which requires more locking).
      
      - keep a local difference of increments and decrements, which allows us to sum
        the total difference and hence find the refcount when summing all CPUs. Then,
        keep a single integer "long" refcount for slow and long lasting references,
        and only take the global sum of local counters when the long refcount is 0.
      
      This last scheme is what I implemented here. Attached mounts and process root
      and working directory references are "long" references, and everything else is
      a short reference.
      
      This allows scalable vfsmount references during path walking over mounted
      subtrees and unattached (lazy umounted) mounts with processes still running
      in them.
      
      This results in one fewer atomic op in the fastpath: mntget is now just a
      per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
      and non-atomic decrement in the common case. However code is otherwise bigger
      and heavier, so single threaded performance is basically a wash.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b3e19d92
  13. 29 10月, 2010 1 次提交
  14. 26 10月, 2010 3 次提交
  15. 18 8月, 2010 2 次提交
    • N
      fs: brlock vfsmount_lock · 99b7db7b
      Nick Piggin 提交于
      fs: brlock vfsmount_lock
      
      Use a brlock for the vfsmount lock. It must be taken for write whenever
      modifying the mount hash or associated fields, and may be taken for read when
      performing mount hash lookups.
      
      A new lock is added for the mnt-id allocator, so it doesn't need to take
      the heavy vfsmount write-lock.
      
      The number of atomics should remain the same for fastpath rlock cases, though
      code would be slightly slower due to per-cpu access. Scalability is not not be
      much improved in common cases yet, due to other locks (ie. dcache_lock) getting
      in the way. However path lookups crossing mountpoints should be one case where
      scalability is improved (currently requiring the global lock).
      
      The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
      Altix system (high latency to remote nodes), a simple umount microbenchmark
      (mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
      took 6.8s, afterwards took 7.1s, about 5% slower.
      
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      99b7db7b
    • N
      tty: fix fu_list abuse · d996b62a
      Nick Piggin 提交于
      tty: fix fu_list abuse
      
      tty code abuses fu_list, which causes a bug in remount,ro handling.
      
      If a tty device node is opened on a filesystem, then the last link to the inode
      removed, the filesystem will be allowed to be remounted readonly. This is
      because fs_may_remount_ro does not find the 0 link tty inode on the file sb
      list (because the tty code incorrectly removed it to use for its own purpose).
      This can result in a filesystem with errors after it is marked "clean".
      
      Taking idea from Christoph's initial patch, allocate a tty private struct
      at file->private_data and put our required list fields in there, linking
      file and tty. This makes tty nodes behave the same way as other device nodes
      and avoid meddling with the vfs, and avoids this bug.
      
      The error handling is not trivial in the tty code, so for this bugfix, I take
      the simple approach of using __GFP_NOFAIL and don't worry about memory errors.
      This is not a problem because our allocator doesn't fail small allocs as a rule
      anyway. So proper error handling is left as an exercise for tty hackers.
      
      [ Arguably filesystem's device inode would ideally be divorced from the
      driver's pseudo inode when it is opened, but in practice it's not clear whether
      that will ever be worth implementing. ]
      
      Cc: linux-kernel@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d996b62a
  16. 22 5月, 2010 1 次提交
  17. 04 3月, 2010 1 次提交
  18. 23 12月, 2009 1 次提交
  19. 17 12月, 2009 1 次提交
  20. 24 9月, 2009 1 次提交
    • V
      fs: fix overflow in sys_mount() for in-kernel calls · eca6f534
      Vegard Nossum 提交于
      sys_mount() reads/copies a whole page for its "type" parameter.  When
      do_mount_root() passes a kernel address that points to an object which is
      smaller than a whole page, copy_mount_options() will happily go past this
      memory object, possibly dereferencing "wild" pointers that could be in any
      state (hence the kmemcheck warning, which shows that parts of the next
      page are not even allocated).
      
      (The likelihood of something going wrong here is pretty low -- first of
      all this only applies to kernel calls to sys_mount(), which are mostly
      found in the boot code.  Secondly, I guess if the page was not mapped,
      exact_copy_from_user() _would_ in fact handle it correctly because of its
      access_ok(), etc.  checks.)
      
      But it is much nicer to avoid the dubious reads altogether, by stopping as
      soon as we find a NUL byte.  Is there a good reason why we can't do
      something like this, using the already existing strndup_from_user()?
      
      [akpm@linux-foundation.org: make copy_mount_string() static]
      [AV: fix compat mount breakage, which involves undoing akpm's change above]
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: Nal <al@dizzy.pdmi.ras.ru>
      eca6f534
  21. 12 6月, 2009 4 次提交
    • A
      Trim a bit of crap from fs.h · 62c6943b
      Al Viro 提交于
      do_remount_sb() is fs/internal.h fodder, fsync_no_super() is long gone.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      62c6943b
    • J
      vfs: Make sys_sync() use fsync_super() (version 4) · 5cee5815
      Jan Kara 提交于
      It is unnecessarily fragile to have two places (fsync_super() and do_sync())
      doing data integrity sync of the filesystem. Alter __fsync_super() to
      accommodate needs of both callers and use it. So after this patch
      __fsync_super() is the only place where we gather all the calls needed to
      properly send all data on a filesystem to disk.
      
      Nice bonus is that we get a complete livelock avoidance and write_supers()
      is now only used for periodic writeback of superblocks.
      
      sync_blockdevs() introduced a couple of patches ago is gone now.
      
      [build fixes folded]
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5cee5815
    • J
      vfs: Fix sys_sync() and fsync_super() reliability (version 4) · 5a3e5cb8
      Jan Kara 提交于
      So far, do_sync() called:
        sync_inodes(0);
        sync_supers();
        sync_filesystems(0);
        sync_filesystems(1);
        sync_inodes(1);
      
      This ordering makes it kind of hard for filesystems as sync_inodes(0) need not
      submit all the IO (for example it skips inodes with I_SYNC set) so e.g. forcing
      transaction to disk in ->sync_fs() is not really enough. Therefore sys_sync has
      not been completely reliable on some filesystems (ext3, ext4, reiserfs, ocfs2
      and others are hit by this) when racing e.g. with background writeback. A
      similar problem hits also other filesystems (e.g. ext2) because of
      write_supers() being called before the sync_inodes(1).
      
      Change the ordering of calls in do_sync() - this requires a new function
      sync_blockdevs() to preserve the property that block devices are always synced
      after write_super() / sync_fs() call.
      
      The same issue is fixed in __fsync_super() function used on umount /
      remount read-only.
      
      [AV: build fixes]
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5a3e5cb8
    • N
      fs: move mark_files_ro into file_table.c · 864d7c4c
      npiggin@suse.de 提交于
      This function walks the s_files lock, and operates primarily on the
      files in a superblock, so it better belongs here (eg. see also
      fs_may_remount_ro).
      
      [AV: ... and it shouldn't be static after that move]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      864d7c4c
  22. 01 4月, 2009 2 次提交
    • A
      New locking/refcounting for fs_struct · 498052bb
      Al Viro 提交于
      * all changes of current->fs are done under task_lock and write_lock of
        old fs->lock
      * refcount is not atomic anymore (same protection)
      * its decrements are done when removing reference from current; at the
        same time we decide whether to free it.
      * put_fs_struct() is gone
      * new field - ->in_exec.  Set by check_unsafe_exec() if we are trying to do
        execve() and only subthreads share fs_struct.  Cleared when finishing exec
        (success and failure alike).  Makes CLONE_FS fail with -EAGAIN if set.
      * check_unsafe_exec() may fail with -EAGAIN if another execve() from subthread
        is in progress.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      498052bb
    • A
      Take fs_struct handling to new file (fs/fs_struct.c) · 3e93cd67
      Al Viro 提交于
      Pure code move; two new helper functions for nfsd and daemonize
      (unshare_fs_struct() and daemonize_fs_struct() resp.; for now -
      the same code as used to be in callers).  unshare_fs_struct()
      exported (for nfsd, as copy_fs_struct()/exit_fs() used to be),
      copy_fs_struct() and exit_fs() don't need exports anymore.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3e93cd67
  23. 29 3月, 2009 1 次提交
    • H
      fix setuid sometimes doesn't · e426b64c
      Hugh Dickins 提交于
      Joe Malicki reports that setuid sometimes doesn't: very rarely,
      a setuid root program does not get root euid; and, by the way,
      they have a health check running lsof every few minutes.
      
      Right, check_unsafe_exec() notes whether the files_struct is being
      shared by more threads than will get killed by the exec, and if so
      sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid.
      But /proc/<pid>/fd and /proc/<pid>/fdinfo lookups make transient
      use of get_files_struct(), which also raises that sharing count.
      
      There's a rather simple fix for this: exec's check on files->count
      has been redundant ever since 2.6.1 made it unshare_files() (except
      while compat_do_execve() omitted to do so) - just remove that check.
      
      [Note to -stable: this patch will not apply before 2.6.29: earlier
      releases should just remove the files->count line from unsafe_exec().]
      Reported-by: NJoe Malicki <jmalicki@metacarta.com>
      Narrowed-down-by: NMichael Itz <mitz@metacarta.com>
      Tested-by: NJoe Malicki <jmalicki@metacarta.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e426b64c
  24. 07 2月, 2009 1 次提交
    • D
      CRED: Fix SUID exec regression · 0bf2f3ae
      David Howells 提交于
      The patch:
      
      	commit a6f76f23
      	CRED: Make execve() take advantage of copy-on-write credentials
      
      moved the place in which the 'safeness' of a SUID/SGID exec was performed to
      before de_thread() was called.  This means that LSM_UNSAFE_SHARE is now
      calculated incorrectly.  This flag is set if any of the usage counts for
      fs_struct, files_struct and sighand_struct are greater than 1 at the time the
      determination is made.  All of which are true for threads created by the
      pthread library.
      
      However, since we wish to make the security calculation before irrevocably
      damaging the process so that we can return it an error code in the case where
      we decide we want to reject the exec request on this basis, we have to make the
      determination before calling de_thread().
      
      So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
      our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
      (CLONE_SIGHAND/CLONE_THREAD) with us.  These will be killed by de_thread() and
      so can be discounted by check_unsafe_exec().
      
      We do have to be careful because CLONE_THREAD does not imply FS or FILES.
      
      We _assume_ that there will be no extra references to these structs held by the
      threads we're going to kill.
      
      This can be tested with the attached pair of programs.  Build the two programs
      using the Makefile supplied, and run ./test1 as a non-root user.  If
      successful, you should see something like:
      
      	[dhowells@andromeda tmp]$ ./test1
      	--TEST1--
      	uid=4043, euid=4043 suid=4043
      	exec ./test2
      	--TEST2--
      	uid=4043, euid=0 suid=0
      	SUCCESS - Correct effective user ID
      
      and if unsuccessful, something like:
      
      	[dhowells@andromeda tmp]$ ./test1
      	--TEST1--
      	uid=4043, euid=4043 suid=4043
      	exec ./test2
      	--TEST2--
      	uid=4043, euid=4043 suid=4043
      	ERROR - Incorrect effective user ID!
      
      The non-root user ID you see will depend on the user you run as.
      
      [test1.c]
      #include <stdio.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <pthread.h>
      
      static void *thread_func(void *arg)
      {
      	while (1) {}
      }
      
      int main(int argc, char **argv)
      {
      	pthread_t tid;
      	uid_t uid, euid, suid;
      
      	printf("--TEST1--\n");
      	getresuid(&uid, &euid, &suid);
      	printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
      
      	if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
      		perror("pthread_create");
      		exit(1);
      	}
      
      	printf("exec ./test2\n");
      	execlp("./test2", "test2", NULL);
      	perror("./test2");
      	_exit(1);
      }
      
      [test2.c]
      #include <stdio.h>
      #include <stdlib.h>
      #include <unistd.h>
      
      int main(int argc, char **argv)
      {
      	uid_t uid, euid, suid;
      
      	getresuid(&uid, &euid, &suid);
      	printf("--TEST2--\n");
      	printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
      
      	if (euid != 0) {
      		fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
      		exit(1);
      	}
      	printf("SUCCESS - Correct effective user ID\n");
      	exit(0);
      }
      
      [Makefile]
      CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
      all: test1 test2
      
      test1: test1.c
      	gcc $(CFLAGS) -o test1 test1.c -lpthread
      
      test2: test2.c
      	gcc $(CFLAGS) -o test2 test2.c
      	sudo chown root.root test2
      	sudo chmod +s test2
      Reported-by: NDavid Smith <dsmith@redhat.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NDavid Smith <dsmith@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      0bf2f3ae
  25. 14 11月, 2008 1 次提交
    • D
      CRED: Make execve() take advantage of copy-on-write credentials · a6f76f23
      David Howells 提交于
      Make execve() take advantage of copy-on-write credentials, allowing it to set
      up the credentials in advance, and then commit the whole lot after the point
      of no return.
      
      This patch and the preceding patches have been tested with the LTP SELinux
      testsuite.
      
      This patch makes several logical sets of alteration:
      
       (1) execve().
      
           The credential bits from struct linux_binprm are, for the most part,
           replaced with a single credentials pointer (bprm->cred).  This means that
           all the creds can be calculated in advance and then applied at the point
           of no return with no possibility of failure.
      
           I would like to replace bprm->cap_effective with:
      
      	cap_isclear(bprm->cap_effective)
      
           but this seems impossible due to special behaviour for processes of pid 1
           (they always retain their parent's capability masks where normally they'd
           be changed - see cap_bprm_set_creds()).
      
           The following sequence of events now happens:
      
           (a) At the start of do_execve, the current task's cred_exec_mutex is
           	 locked to prevent PTRACE_ATTACH from obsoleting the calculation of
           	 creds that we make.
      
           (a) prepare_exec_creds() is then called to make a copy of the current
           	 task's credentials and prepare it.  This copy is then assigned to
           	 bprm->cred.
      
        	 This renders security_bprm_alloc() and security_bprm_free()
           	 unnecessary, and so they've been removed.
      
           (b) The determination of unsafe execution is now performed immediately
           	 after (a) rather than later on in the code.  The result is stored in
           	 bprm->unsafe for future reference.
      
           (c) prepare_binprm() is called, possibly multiple times.
      
           	 (i) This applies the result of set[ug]id binaries to the new creds
           	     attached to bprm->cred.  Personality bit clearance is recorded,
           	     but now deferred on the basis that the exec procedure may yet
           	     fail.
      
               (ii) This then calls the new security_bprm_set_creds().  This should
      	     calculate the new LSM and capability credentials into *bprm->cred.
      
      	     This folds together security_bprm_set() and parts of
      	     security_bprm_apply_creds() (these two have been removed).
      	     Anything that might fail must be done at this point.
      
               (iii) bprm->cred_prepared is set to 1.
      
      	     bprm->cred_prepared is 0 on the first pass of the security
      	     calculations, and 1 on all subsequent passes.  This allows SELinux
      	     in (ii) to base its calculations only on the initial script and
      	     not on the interpreter.
      
           (d) flush_old_exec() is called to commit the task to execution.  This
           	 performs the following steps with regard to credentials:
      
      	 (i) Clear pdeath_signal and set dumpable on certain circumstances that
      	     may not be covered by commit_creds().
      
               (ii) Clear any bits in current->personality that were deferred from
                   (c.i).
      
           (e) install_exec_creds() [compute_creds() as was] is called to install the
           	 new credentials.  This performs the following steps with regard to
           	 credentials:
      
               (i) Calls security_bprm_committing_creds() to apply any security
                   requirements, such as flushing unauthorised files in SELinux, that
                   must be done before the credentials are changed.
      
      	     This is made up of bits of security_bprm_apply_creds() and
      	     security_bprm_post_apply_creds(), both of which have been removed.
      	     This function is not allowed to fail; anything that might fail
      	     must have been done in (c.ii).
      
               (ii) Calls commit_creds() to apply the new credentials in a single
                   assignment (more or less).  Possibly pdeath_signal and dumpable
                   should be part of struct creds.
      
      	 (iii) Unlocks the task's cred_replace_mutex, thus allowing
      	     PTRACE_ATTACH to take place.
      
               (iv) Clears The bprm->cred pointer as the credentials it was holding
                   are now immutable.
      
               (v) Calls security_bprm_committed_creds() to apply any security
                   alterations that must be done after the creds have been changed.
                   SELinux uses this to flush signals and signal handlers.
      
           (f) If an error occurs before (d.i), bprm_free() will call abort_creds()
           	 to destroy the proposed new credentials and will then unlock
           	 cred_replace_mutex.  No changes to the credentials will have been
           	 made.
      
       (2) LSM interface.
      
           A number of functions have been changed, added or removed:
      
           (*) security_bprm_alloc(), ->bprm_alloc_security()
           (*) security_bprm_free(), ->bprm_free_security()
      
           	 Removed in favour of preparing new credentials and modifying those.
      
           (*) security_bprm_apply_creds(), ->bprm_apply_creds()
           (*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
      
           	 Removed; split between security_bprm_set_creds(),
           	 security_bprm_committing_creds() and security_bprm_committed_creds().
      
           (*) security_bprm_set(), ->bprm_set_security()
      
           	 Removed; folded into security_bprm_set_creds().
      
           (*) security_bprm_set_creds(), ->bprm_set_creds()
      
           	 New.  The new credentials in bprm->creds should be checked and set up
           	 as appropriate.  bprm->cred_prepared is 0 on the first call, 1 on the
           	 second and subsequent calls.
      
           (*) security_bprm_committing_creds(), ->bprm_committing_creds()
           (*) security_bprm_committed_creds(), ->bprm_committed_creds()
      
           	 New.  Apply the security effects of the new credentials.  This
           	 includes closing unauthorised files in SELinux.  This function may not
           	 fail.  When the former is called, the creds haven't yet been applied
           	 to the process; when the latter is called, they have.
      
       	 The former may access bprm->cred, the latter may not.
      
       (3) SELinux.
      
           SELinux has a number of changes, in addition to those to support the LSM
           interface changes mentioned above:
      
           (a) The bprm_security_struct struct has been removed in favour of using
           	 the credentials-under-construction approach.
      
           (c) flush_unauthorized_files() now takes a cred pointer and passes it on
           	 to inode_has_perm(), file_has_perm() and dentry_open().
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      a6f76f23