1. 06 7月, 2016 3 次提交
    • S
      fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns · 81754357
      Seth Forshee 提交于
      For filesystems mounted from a user namespace on-disk ids should
      be translated relative to s_users_ns rather than init_user_ns.
      
      When an id in the filesystem doesn't exist in s_user_ns the
      associated id in the inode will be set to INVALID_[UG]ID, which
      turns these into de facto "nobody" ids. This actually maps pretty
      well into the way most code already works, and those places where
      it didn't were fixed in previous patches. Moving forward vfs code
      needs to be careful to handle instances where ids in inodes may
      be invalid.
      Signed-off-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      81754357
    • E
      quota: Ensure qids map to the filesystem · d49d3762
      Eric W. Biederman 提交于
      Introduce the helper qid_has_mapping and use it to ensure that the
      quota system only considers qids that map to the filesystems
      s_user_ns.
      
      In practice for quota supporting filesystems today this is the exact
      same check as qid_valid.  As only 0xffffffff aka (qid_t)-1 does not
      map into init_user_ns.
      
      Replace the qid_valid calls with qid_has_mapping as values come in
      from userspace.  This is harmless today and it prepares the quota
      system to work on filesystems with quotas but mounted by unprivileged
      users.
      
      Call qid_has_mapping from dqget.  This ensures the passed in qid has a
      prepresentation on the underlying filesystem.  Previously this was
      unnecessary as filesystesm never had qids that could not map.  With
      the introduction of filesystems outside of s_user_ns this will not
      remain true.
      
      All of this ensures the quota code never has to deal with qids that
      don't map to the underlying filesystem.
      
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      d49d3762
    • E
      vfs: Don't modify inodes with a uid or gid unknown to the vfs · 0bd23d09
      Eric W. Biederman 提交于
      When a filesystem outside of init_user_ns is mounted it could have
      uids and gids stored in it that do not map to init_user_ns.
      
      The plan is to allow those filesystems to set i_uid to INVALID_UID and
      i_gid to INVALID_GID for unmapped uids and gids and then to handle
      that strange case in the vfs to ensure there is consistent robust
      handling of the weirdness.
      
      Upon a careful review of the vfs and filesystems about the only case
      where there is any possibility of confusion or trouble is when the
      inode is written back to disk.  In that case filesystems typically
      read the inode->i_uid and inode->i_gid and write them to disk even
      when just an inode timestamp is being updated.
      
      Which leads to a rule that is very simple to implement and understand
      inodes whose i_uid or i_gid is not valid may not be written.
      
      In dealing with access times this means treat those inodes as if the
      inode flag S_NOATIME was set.  Reads of the inodes appear safe and
      useful, but any write or modification is disallowed.  The only inode
      write that is allowed is a chown that sets the uid and gid on the
      inode to valid values.  After such a chown the inode is normal and may
      be treated as such.
      
      Denying all writes to inodes with uids or gids unknown to the vfs also
      prevents several oddball cases where corruption would have occurred
      because the vfs does not have complete information.
      
      One problem case that is prevented is attempting to use the gid of a
      directory for new inodes where the directories sgid bit is set but the
      directories gid is not mapped.
      
      Another problem case avoided is attempting to update the evm hash
      after setxattr, removexattr, and setattr.  As the evm hash includeds
      the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
      a correct evm hash from being computed.  evm hash verification also
      fails when i_uid or i_gid is unknown but that is essentially harmless
      as it does not cause filesystem corruption.
      Acked-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      0bd23d09
  2. 01 7月, 2016 2 次提交
  3. 24 6月, 2016 7 次提交
    • A
      fs: Treat foreign mounts as nosuid · 380cf5ba
      Andy Lutomirski 提交于
      If a process gets access to a mount from a different user
      namespace, that process should not be able to take advantage of
      setuid files or selinux entrypoints from that filesystem.  Prevent
      this by treating mounts from other mount namespaces and those not
      owned by current_user_ns() or an ancestor as nosuid.
      
      This will make it safer to allow more complex filesystems to be
      mounted in non-root user namespaces.
      
      This does not remove the need for MNT_LOCK_NOSUID.  The setuid,
      setgid, and file capability bits can no longer be abused if code in
      a user namespace were to clear nosuid on an untrusted filesystem,
      but this patch, by itself, is insufficient to protect the system
      from abuse of files that, when execed, would increase MAC privilege.
      
      As a more concrete explanation, any task that can manipulate a
      vfsmount associated with a given user namespace already has
      capabilities in that namespace and all of its descendents.  If they
      can cause a malicious setuid, setgid, or file-caps executable to
      appear in that mount, then that executable will only allow them to
      elevate privileges in exactly the set of namespaces in which they
      are already privileges.
      
      On the other hand, if they can cause a malicious executable to
      appear with a dangerous MAC label, running it could change the
      caller's security context in a way that should not have been
      possible, even inside the namespace in which the task is confined.
      
      As a hardening measure, this would have made CVE-2014-5207 much
      more difficult to exploit.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NSeth Forshee <seth.forshee@canonical.com>
      Acked-by: NJames Morris <james.l.morris@oracle.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      380cf5ba
    • S
      fs: Limit file caps to the user namespace of the super block · d07b846f
      Seth Forshee 提交于
      Capability sets attached to files must be ignored except in the
      user namespaces where the mounter is privileged, i.e. s_user_ns
      and its descendants. Otherwise a vector exists for gaining
      privileges in namespaces where a user is not already privileged.
      
      Add a new helper function, current_in_user_ns(), to test whether a user
      namespace is the same as or a descendant of another namespace.
      Use this helper to determine whether a file's capability set
      should be applied to the caps constructed during exec.
      
      --EWB Replaced in_userns with the simpler current_in_userns.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      d07b846f
    • E
      userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag · cc50a07a
      Eric W. Biederman 提交于
      Now that SB_I_NODEV controls the nodev behavior devpts can just clear
      this flag during mount.  Simplifying the code and making it easier
      to audit how the code works.  While still preserving the invariant
      that s_iflags is only modified during mount.
      Acked-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      cc50a07a
    • E
      vfs: Generalize filesystem nodev handling. · a2982cc9
      Eric W. Biederman 提交于
      Introduce a function may_open_dev that tests MNT_NODEV and a new
      superblock flab SB_I_NODEV.  Use this new function in all of the
      places where MNT_NODEV was previously tested.
      
      Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
      filesystems should never support device nodes, and a simple superblock
      flags makes that very hard to get wrong.  With SB_I_NODEV set if any
      device nodes somehow manage to show up on on a filesystem those
      device nodes will be unopenable.
      Acked-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      a2982cc9
    • E
      fs: Add user namespace member to struct super_block · 6e4eab57
      Eric W. Biederman 提交于
      Start marking filesystems with a user namespace owner, s_user_ns.  In
      this change this is only used for permission checks of who may mount a
      filesystem.  Ultimately s_user_ns will be used for translating ids and
      checking capabilities for filesystems mounted from user namespaces.
      
      The default policy for setting s_user_ns is implemented in sget(),
      which arranges for s_user_ns to be set to current_user_ns() and to
      ensure that the mounter of the filesystem has CAP_SYS_ADMIN in that
      user_ns.
      
      The guts of sget are split out into another function sget_userns().
      The function sget_userns calls alloc_super with the specified user
      namespace or it verifies the existing superblock that was found
      has the expected user namespace, and fails with EBUSY when it is not.
      This failing prevents users with the wrong privileges mounting a
      filesystem.
      
      The reason for the split of sget_userns from sget is that in some
      cases such as mount_ns and kernfs_mount_ns a different policy for
      permission checking of mounts and setting s_user_ns is necessary, and
      the existence of sget_userns() allows those policies to be
      implemented.
      
      The helper mount_ns is expected to be used for filesystems such as
      proc and mqueuefs which present per namespace information.  The
      function mount_ns is modified to call sget_userns instead of sget to
      ensure the user namespace owner of the namespace whose information is
      presented by the filesystem is used on the superblock.
      
      For sysfs and cgroup the appropriate permission checks are already in
      place, and kernfs_mount_ns is modified to call sget_userns so that
      the init_user_ns is the only user namespace used.
      
      For the cgroup filesystem cgroup namespace mounts are bind mounts of a
      subset of the full cgroup filesystem and as such s_user_ns must be the
      same for all of them as there is only a single superblock.
      
      Mounts of sysfs that vary based on the network namespace could in principle
      change s_user_ns but it keeps the analysis and implementation of kernfs
      simpler if that is not supported, and at present there appear to be no
      benefits from supporting a different s_user_ns on any sysfs mount.
      
      Getting the details of setting s_user_ns correct has been
      a long process.  Thanks to Pavel Tikhorirorv who spotted a leak
      in sget_userns.  Thanks to Seth Forshee who has kept the work alive.
      
      Thanks-to: Seth Forshee <seth.forshee@canonical.com>
      Thanks-to: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Acked-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      6e4eab57
    • E
      vfs: Pass data, ns, and ns->userns to mount_ns · d91ee87d
      Eric W. Biederman 提交于
      Today what is normally called data (the mount options) is not passed
      to fill_super through mount_ns.
      
      Pass the mount options and the namespace separately to mount_ns so
      that filesystems such as proc that have mount options, can use
      mount_ns.
      
      Pass the user namespace to mount_ns so that the standard permission
      check that verifies the mounter has permissions over the namespace can
      be performed in mount_ns instead of in each filesystems .mount method.
      Thus removing the duplication between mqueuefs and proc in terms of
      permission checks.  The extra permission check does not currently
      affect the rpc_pipefs filesystem and the nfsd filesystem as those
      filesystems do not currently allow unprivileged mounts.  Without
      unpvileged mounts it is guaranteed that the caller has already passed
      capable(CAP_SYS_ADMIN) which guarantees extra permission check will
      pass.
      
      Update rpc_pipefs and the nfsd filesystem to ensure that the network
      namespace reference is always taken in fill_super and always put in kill_sb
      so that the logic is simpler and so that errors originating inside of
      fill_super do not cause a network namespace leak.
      Acked-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      d91ee87d
    • E
      mnt: Refactor fs_fully_visible into mount_too_revealing · 8654df4e
      Eric W. Biederman 提交于
      Replace the call of fs_fully_visible in do_new_mount from before the
      new superblock is allocated with a call of mount_too_revealing after
      the superblock is allocated.   This winds up being a much better location
      for maintainability of the code.
      
      The first change this enables is the replacement of FS_USERNS_VISIBLE
      with SB_I_USERNS_VISIBLE.  Moving the flag from struct filesystem_type
      to sb_iflags on the superblock.
      
      Unfortunately mount_too_revealing fundamentally needs to touch
      mnt_flags adding several MNT_LOCKED_XXX flags at the appropriate
      times.  If the mnt_flags did not need to be touched the code
      could be easily moved into the filesystem specific mount code.
      Acked-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      8654df4e
  4. 06 6月, 2016 1 次提交
    • E
      devpts: Make each mount of devpts an independent filesystem. · eedf265a
      Eric W. Biederman 提交于
      The /dev/ptmx device node is changed to lookup the directory entry "pts"
      in the same directory as the /dev/ptmx device node was opened in.  If
      there is a "pts" entry and that entry is a devpts filesystem /dev/ptmx
      uses that filesystem.  Otherwise the open of /dev/ptmx fails.
      
      The DEVPTS_MULTIPLE_INSTANCES configuration option is removed, so that
      userspace can now safely depend on each mount of devpts creating a new
      instance of the filesystem.
      
      Each mount of devpts is now a separate and equal filesystem.
      
      Reserved ttys are now available to all instances of devpts where the
      mounter is in the initial mount namespace.
      
      A new vfs helper path_pts is introduced that finds a directory entry
      named "pts" in the directory of the passed in path, and changes the
      passed in path to point to it.  The helper path_pts uses a function
      path_parent_directory that was factored out of follow_dotdot.
      
      In the implementation of devpts:
       - devpts_mnt is killed as it is no longer meaningful if all mounts of
         devpts are equal.
       - pts_sb_from_inode is replaced by just inode->i_sb as all cached
         inodes in the tty layer are now from the devpts filesystem.
       - devpts_add_ref is rolled into the new function devpts_ptmx.  And the
         unnecessary inode hold is removed.
       - devpts_del_ref is renamed devpts_release and reduced to just a
         deacrivate_super.
       - The newinstance mount option continues to be accepted but is now
         ignored.
      
      In devpts_fs.h definitions for when !CONFIG_UNIX98_PTYS are removed as
      they are never used.
      
      Documentation/filesystems/devices.txt is updated to describe the current
      situation.
      
      This has been verified to work properly on openwrt-15.05, centos5,
      centos6, centos7, debian-6.0.2, debian-7.9, debian-8.2, ubuntu-14.04.3,
      ubuntu-15.10, fedora23, magia-5, mint-17.3, opensuse-42.1,
      slackware-14.1, gentoo-20151225 (13.0?), archlinux-2015-12-01.  With the
      caveat that on centos6 and on slackware-14.1 that there wind up being
      two instances of the devpts filesystem mounted on /dev/pts, the lower
      copy does not end up getting used.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Aurelien Jarno <aurelien@aurel32.net>
      Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
      Cc: Jann Horn <jann@thejh.net>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eedf265a
  5. 04 6月, 2016 1 次提交
  6. 03 6月, 2016 2 次提交
  7. 02 6月, 2016 1 次提交
  8. 01 6月, 2016 5 次提交
  9. 31 5月, 2016 1 次提交
  10. 29 5月, 2016 6 次提交
    • G
      <linux/hash.h>: Add support for architecture-specific functions · 468a9428
      George Spelvin 提交于
      This is just the infrastructure; there are no users yet.
      
      This is modelled on CONFIG_ARCH_RANDOM; a CONFIG_ symbol declares
      the existence of <asm/hash.h>.
      
      That file may define its own versions of various functions, and define
      HAVE_* symbols (no CONFIG_ prefix!) to suppress the generic ones.
      
      Included is a self-test (in lib/test_hash.c) that verifies the basics.
      It is NOT in general required that the arch-specific functions compute
      the same thing as the generic, but if a HAVE_* symbol is defined with
      the value 1, then equality is tested.
      Signed-off-by: NGeorge Spelvin <linux@sciencehorizons.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Andreas Schwab <schwab@linux-m68k.org>
      Cc: Philippe De Muyter <phdm@macq.eu>
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: Alistair Francis <alistai@xilinx.com>
      Cc: Michal Simek <michal.simek@xilinx.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: uclinux-h8-devel@lists.sourceforge.jp
      468a9428
    • G
      Eliminate bad hash multipliers from hash_32() and hash_64() · ef703f49
      George Spelvin 提交于
      The "simplified" prime multipliers made very bad hash functions, so get rid
      of them.  This completes the work of 689de1d6.
      
      To avoid the inefficiency which was the motivation for the "simplified"
      multipliers, hash_64() on 32-bit systems is changed to use a different
      algorithm.  It makes two calls to hash_32() instead.
      
      drivers/media/usb/dvb-usb-v2/af9015.c uses the old GOLDEN_RATIO_PRIME_32
      for some horrible reason, so it inherits a copy of the old definition.
      Signed-off-by: NGeorge Spelvin <linux@sciencehorizons.net>
      Cc: Antti Palosaari <crope@iki.fi>
      Cc: Mauro Carvalho Chehab <m.chehab@samsung.com>
      ef703f49
    • G
      Change hash_64() return value to 32 bits · 92d56774
      George Spelvin 提交于
      That's all that's ever asked for, and it makes the return
      type of hash_long() consistent.
      
      It also allows (upcoming patch) an optimized implementation
      of hash_64 on 32-bit machines.
      
      I tried adding a BUILD_BUG_ON to ensure the number of bits requested
      was never more than 32 (most callers use a compile-time constant), but
      adding <linux/bug.h> to <linux/hash.h> breaks the tools/perf compiler
      unless tools/perf/MANIFEST is updated, and understanding that code base
      well enough to update it is too much trouble.  I did the rest of an
      allyesconfig build with such a check, and nothing tripped.
      Signed-off-by: NGeorge Spelvin <linux@sciencehorizons.net>
      92d56774
    • G
      <linux/sunrpc/svcauth.h>: Define hash_str() in terms of hashlen_string() · 917ea166
      George Spelvin 提交于
      Finally, the first use of previous two patches: eliminate the
      separate ad-hoc string hash functions in the sunrpc code.
      
      Now hash_str() is a wrapper around hash_string(), and hash_mem() is
      likewise a wrapper around full_name_hash().
      
      Note that sunrpc code *does* call hash_mem() with a zero length, which
      is why the previous patch needed to handle that in full_name_hash().
      (Thanks, Bruce, for finding that!)
      
      This also eliminates the only caller of hash_long which asks for
      more than 32 bits of output.
      
      The comment about the quality of hashlen_string() and full_name_hash()
      is jumping the gun by a few patches; they aren't very impressive now,
      but will be improved greatly later in the series.
      Signed-off-by: NGeorge Spelvin <linux@sciencehorizons.net>
      Tested-by: NJ. Bruce Fields <bfields@redhat.com>
      Acked-by: NJ. Bruce Fields <bfields@redhat.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: linux-nfs@vger.kernel.org
      917ea166
    • G
      fs/namei.c: Add hashlen_string() function · fcfd2fbf
      George Spelvin 提交于
      We'd like to make more use of the highly-optimized dcache hash functions
      throughout the kernel, rather than have every subsystem create its own,
      and a function that hashes basic null-terminated strings is required
      for that.
      
      (The name is to emphasize that it returns both hash and length.)
      
      It's actually useful in the dcache itself, specifically d_alloc_name().
      Other uses in the next patch.
      
      full_name_hash() is also tweaked to make it more generally useful:
      1) Take a "char *" rather than "unsigned char *" argument, to
         be consistent with hash_name().
      2) Handle zero-length inputs.  If we want more callers, we don't want
         to make them worry about corner cases.
      Signed-off-by: NGeorge Spelvin <linux@sciencehorizons.net>
      fcfd2fbf
    • G
      Pull out string hash to <linux/stringhash.h> · f4bcbe79
      George Spelvin 提交于
      ... so they can be used without the rest of <linux/dcache.h>
      
      The hashlen_* macros will make sense next patch.
      Signed-off-by: NGeorge Spelvin <linux@sciencehorizons.net>
      f4bcbe79
  11. 28 5月, 2016 5 次提交
  12. 27 5月, 2016 4 次提交
  13. 26 5月, 2016 2 次提交