1. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  2. 01 11月, 2017 2 次提交
    • C
      userns: bump idmap limits to 340 · 6397fac4
      Christian Brauner 提交于
      There are quite some use cases where users run into the current limit for
      {g,u}id mappings. Consider a user requesting us to map everything but 999, and
      1001 for a given range of 1000000000 with a sub{g,u}id layout of:
      
      some-user:100000:1000000000
      some-user:999:1
      some-user:1000:1
      some-user:1001:1
      some-user:1002:1
      
      This translates to:
      
      MAPPING-TYPE | CONTAINER |    HOST |     RANGE |
      -------------|-----------|---------|-----------|
               uid |       999 |     999 |         1 |
               uid |      1001 |    1001 |         1 |
               uid |         0 | 1000000 |       999 |
               uid |      1000 | 1001000 |         1 |
               uid |      1002 | 1001002 | 999998998 |
      ------------------------------------------------
               gid |       999 |     999 |         1 |
               gid |      1001 |    1001 |         1 |
               gid |         0 | 1000000 |       999 |
               gid |      1000 | 1001000 |         1 |
               gid |      1002 | 1001002 | 999998998 |
      
      which is already the current limit.
      
      As discussed at LPC simply bumping the number of limits is not going to work
      since this would mean that struct uid_gid_map won't fit into a single cache-line
      anymore thereby regressing performance for the base-cases. The same problem
      seems to arise when using a single pointer. So the idea is to use
      
      struct uid_gid_extent {
      	u32 first;
      	u32 lower_first;
      	u32 count;
      };
      
      struct uid_gid_map { /* 64 bytes -- 1 cache line */
      	u32 nr_extents;
      	union {
      		struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS];
      		struct {
      			struct uid_gid_extent *forward;
      			struct uid_gid_extent *reverse;
      		};
      	};
      };
      
      For the base cases we will only use the struct uid_gid_extent extent member. If
      we go over UID_GID_MAP_MAX_BASE_EXTENTS mappings we perform a single 4k
      kmalloc() which means we can have a maximum of 340 mappings
      (340 * size(struct uid_gid_extent) = 4080). For the latter case we use two
      pointers "forward" and "reverse". The forward pointer points to an array sorted
      by "first" and the reverse pointer points to an array sorted by "lower_first".
      We can then perform binary search on those arrays.
      
      Performance Testing:
      When Eric introduced the extent-based struct uid_gid_map approach he measured
      the performanc impact of his idmap changes:
      
      > My benchmark consisted of going to single user mode where nothing else was
      > running. On an ext4 filesystem opening 1,000,000 files and looping through all
      > of the files 1000 times and calling fstat on the individuals files. This was
      > to ensure I was benchmarking stat times where the inodes were in the kernels
      > cache, but the inode values were not in the processors cache. My results:
      
      > v3.4-rc1:         ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
      > v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
      > v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
      
      I used an identical approach on my laptop. Here's a thorough description of what
      I did. I built a 4.14.0-rc4 mainline kernel with my new idmap patches applied. I
      booted into single user mode and used an ext4 filesystem to open/create
      1,000,000 files. Then I looped through all of the files calling fstat() on each
      of them 1000 times and calculated the mean fstat() time for a single file. (The
      test program can be found below.)
      
      Here are the results. For fun, I compared the first version of my patch which
      scaled linearly with the new version of the patch:
      
      |   # MAPPINGS |   PATCH-V1 | PATCH-NEW |
      |--------------|------------|-----------|
      |   0 mappings |     158 ns |   158 ns  |
      |   1 mappings |     164 ns |   157 ns  |
      |   2 mappings |     170 ns |   158 ns  |
      |   3 mappings |     175 ns |   161 ns  |
      |   5 mappings |     187 ns |   165 ns  |
      |  10 mappings |     218 ns |   199 ns  |
      |  50 mappings |     528 ns |   218 ns  |
      | 100 mappings |     980 ns |   229 ns  |
      | 200 mappings |    1880 ns |   239 ns  |
      | 300 mappings |    2760 ns |   240 ns  |
      | 340 mappings | not tested |   248 ns  |
      
      Here's the test program I used. I asked Eric what he did and this is a more
      "advanced" implementation of the idea. It's pretty straight-forward:
      
       #define __GNU_SOURCE
       #define __STDC_FORMAT_MACROS
       #include <errno.h>
       #include <dirent.h>
       #include <fcntl.h>
       #include <inttypes.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <unistd.h>
       #include <sys/stat.h>
       #include <sys/time.h>
       #include <sys/types.h>
      
       int main(int argc, char *argv[])
       {
       	int ret;
       	size_t i, k;
       	int fd[1000000];
       	int times[1000];
       	char pathname[4096];
       	struct stat st;
       	struct timeval t1, t2;
       	uint64_t time_in_mcs;
       	uint64_t sum = 0;
      
       	if (argc != 2) {
       		fprintf(stderr, "Please specify a directory where to create "
       				"the test files\n");
       		exit(EXIT_FAILURE);
       	}
      
       	for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
       		sprintf(pathname, "%s/idmap_test_%zu", argv[1], i);
       		fd[i]= open(pathname, O_RDWR | O_CREAT, S_IXUSR | S_IXGRP | S_IXOTH);
       		if (fd[i] < 0) {
       			ssize_t j;
       			for (j = i; j >= 0; j--)
       				close(fd[j]);
       			exit(EXIT_FAILURE);
       		}
       	}
      
       	for (k = 0; k < 1000; k++) {
       		ret = gettimeofday(&t1, NULL);
       		if (ret < 0)
       			goto close_all;
      
       		for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
       			ret = fstat(fd[i], &st);
       			if (ret < 0)
       				goto close_all;
       		}
      
       		ret = gettimeofday(&t2, NULL);
       		if (ret < 0)
       			goto close_all;
      
       		time_in_mcs = (1000000 * t2.tv_sec + t2.tv_usec) -
       			      (1000000 * t1.tv_sec + t1.tv_usec);
       		printf("Total time in micro seconds:       %" PRIu64 "\n",
       		       time_in_mcs);
       		printf("Total time in nanoseconds:         %" PRIu64 "\n",
       		       time_in_mcs * 1000);
       		printf("Time per file in nanoseconds:      %" PRIu64 "\n",
       		       (time_in_mcs * 1000) / 1000000);
       		times[k] = (time_in_mcs * 1000) / 1000000;
       	}
      
       close_all:
       	for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++)
       		close(fd[i]);
      
       	if (ret < 0)
       		exit(EXIT_FAILURE);
      
       	for (k = 0; k < 1000; k++) {
       		sum += times[k];
       	}
      
       	printf("Mean time per file in nanoseconds: %" PRIu64 "\n", sum / 1000);
      
       	exit(EXIT_SUCCESS);;
       }
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      CC: Serge Hallyn <serge@hallyn.com>
      CC: Eric Biederman <ebiederm@xmission.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      6397fac4
    • C
      userns: use union in {g,u}idmap struct · aa4bf44d
      Christian Brauner 提交于
      - Add a struct containing two pointer to extents and wrap both the static extent
        array and the struct into a union. This is done in preparation for bumping the
        {g,u}idmap limits for user namespaces.
      - Add brackets around anonymous union when using designated initializers to
        initialize members in order to please gcc <= 4.4.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      aa4bf44d
  3. 20 7月, 2017 1 次提交
    • E
      userns,pidns: Verify the userns for new pid namespaces · a2b42626
      Eric W. Biederman 提交于
      It is pointless and confusing to allow a pid namespace hierarchy and
      the user namespace hierarchy to get out of sync.  The owner of a child
      pid namespace should be the owner of the parent pid namespace or
      a descendant of the owner of the parent pid namespace.
      
      Otherwise it is possible to construct scenarios where a process has a
      capability over a parent pid namespace but does not have the
      capability over a child pid namespace.  Which confusingly makes
      permission checks non-transitive.
      
      It requires use of setns into a pid namespace (but not into a user
      namespace) to create such a scenario.
      
      Add the function in_userns to help in making this determination.
      
      v2: Optimized in_userns by using level as suggested
          by: Kirill Tkhai <ktkhai@virtuozzo.com>
      
      Ref: 49f4d8b9 ("pidns: Capture the user namespace and filter ns_last_pid")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      a2b42626
  4. 01 7月, 2017 1 次提交
    • K
      randstruct: Mark various structs for randomization · 3859a271
      Kees Cook 提交于
      This marks many critical kernel structures for randomization. These are
      structures that have been targeted in the past in security exploits, or
      contain functions pointers, pointers to function pointer tables, lists,
      workqueues, ref-counters, credentials, permissions, or are otherwise
      sensitive. This initial list was extracted from Brad Spengler/PaX Team's
      code in the last public patch of grsecurity/PaX based on my understanding
      of the code. Changes or omissions from the original code are mine and
      don't reflect the original grsecurity/PaX code.
      
      Left out of this list is task_struct, which requires special handling
      and will be covered in a subsequent patch.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      3859a271
  5. 07 3月, 2017 1 次提交
  6. 03 3月, 2017 2 次提交
  7. 02 3月, 2017 1 次提交
    • I
      sched/headers: Prepare for the removal of various unrelated headers from <linux/sched.h> · cc5efc23
      Ingo Molnar 提交于
      We are going to remove the following header inclusions from <linux/sched.h>:
      
      	#include <asm/param.h>
      	#include <linux/threads.h>
      	#include <linux/kernel.h>
      	#include <linux/types.h>
      	#include <linux/timex.h>
      	#include <linux/jiffies.h>
      	#include <linux/rbtree.h>
      	#include <linux/thread_info.h>
      	#include <linux/cpumask.h>
      	#include <linux/errno.h>
      	#include <linux/nodemask.h>
      	#include <linux/preempt.h>
      	#include <asm/page.h>
      	#include <linux/smp.h>
      	#include <linux/compiler.h>
      	#include <linux/completion.h>
      	#include <linux/percpu.h>
      	#include <linux/topology.h>
      	#include <linux/rcupdate.h>
      	#include <linux/time.h>
      	#include <linux/timer.h>
      	#include <linux/llist.h>
      	#include <linux/uidgid.h>
      	#include <asm/processor.h>
      
      Fix up a single .h file that got hold of <linux/sysctl.h> via one of these headers.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      cc5efc23
  8. 24 1月, 2017 1 次提交
    • N
      inotify: Convert to using per-namespace limits · 1cce1eea
      Nikolay Borisov 提交于
      This patchset converts inotify to using the newly introduced
      per-userns sysctl infrastructure.
      
      Currently the inotify instances/watches are being accounted in the
      user_struct structure. This means that in setups where multiple
      users in unprivileged containers map to the same underlying
      real user (i.e. pointing to the same user_struct) the inotify limits
      are going to be shared as well, allowing one user(or application) to exhaust
      all others limits.
      
      Fix this by switching the inotify sysctls to using the
      per-namespace/per-user limits. This will allow the server admin to
      set sensible global limits, which can further be tuned inside every
      individual user namespace. Additionally, in order to preserve the
      sysctl ABI make the existing inotify instances/watches sysctls
      modify the values of the initial user namespace.
      Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      1cce1eea
  9. 23 9月, 2016 1 次提交
  10. 31 8月, 2016 1 次提交
  11. 09 8月, 2016 9 次提交
  12. 08 8月, 2016 1 次提交
  13. 24 6月, 2016 1 次提交
  14. 12 12月, 2014 1 次提交
    • E
      userns: Add a knob to disable setgroups on a per user namespace basis · 9cc46516
      Eric W. Biederman 提交于
      - Expose the knob to user space through a proc file /proc/<pid>/setgroups
      
        A value of "deny" means the setgroups system call is disabled in the
        current processes user namespace and can not be enabled in the
        future in this user namespace.
      
        A value of "allow" means the segtoups system call is enabled.
      
      - Descendant user namespaces inherit the value of setgroups from
        their parents.
      
      - A proc file is used (instead of a sysctl) as sysctls currently do
        not allow checking the permissions at open time.
      
      - Writing to the proc file is restricted to before the gid_map
        for the user namespace is set.
      
        This ensures that disabling setgroups at a user namespace
        level will never remove the ability to call setgroups
        from a process that already has that ability.
      
        A process may opt in to the setgroups disable for itself by
        creating, entering and configuring a user namespace or by calling
        setns on an existing user namespace with setgroups disabled.
        Processes without privileges already can not call setgroups so this
        is a noop.  Prodcess with privilege become processes without
        privilege when entering a user namespace and as with any other path
        to dropping privilege they would not have the ability to call
        setgroups.  So this remains within the bounds of what is possible
        without a knob to disable setgroups permanently in a user namespace.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      9cc46516
  15. 10 12月, 2014 1 次提交
  16. 05 12月, 2014 1 次提交
  17. 09 8月, 2014 1 次提交
  18. 24 9月, 2013 1 次提交
    • D
      KEYS: Add per-user_namespace registers for persistent per-UID kerberos caches · f36f8c75
      David Howells 提交于
      Add support for per-user_namespace registers of persistent per-UID kerberos
      caches held within the kernel.
      
      This allows the kerberos cache to be retained beyond the life of all a user's
      processes so that the user's cron jobs can work.
      
      The kerberos cache is envisioned as a keyring/key tree looking something like:
      
      	struct user_namespace
      	  \___ .krb_cache keyring		- The register
      		\___ _krb.0 keyring		- Root's Kerberos cache
      		\___ _krb.5000 keyring		- User 5000's Kerberos cache
      		\___ _krb.5001 keyring		- User 5001's Kerberos cache
      			\___ tkt785 big_key	- A ccache blob
      			\___ tkt12345 big_key	- Another ccache blob
      
      Or possibly:
      
      	struct user_namespace
      	  \___ .krb_cache keyring		- The register
      		\___ _krb.0 keyring		- Root's Kerberos cache
      		\___ _krb.5000 keyring		- User 5000's Kerberos cache
      		\___ _krb.5001 keyring		- User 5001's Kerberos cache
      			\___ tkt785 keyring	- A ccache
      				\___ krbtgt/REDHAT.COM@REDHAT.COM big_key
      				\___ http/REDHAT.COM@REDHAT.COM user
      				\___ afs/REDHAT.COM@REDHAT.COM user
      				\___ nfs/REDHAT.COM@REDHAT.COM user
      				\___ krbtgt/KERNEL.ORG@KERNEL.ORG big_key
      				\___ http/KERNEL.ORG@KERNEL.ORG big_key
      
      What goes into a particular Kerberos cache is entirely up to userspace.  Kernel
      support is limited to giving you the Kerberos cache keyring that you want.
      
      The user asks for their Kerberos cache by:
      
      	krb_cache = keyctl_get_krbcache(uid, dest_keyring);
      
      The uid is -1 or the user's own UID for the user's own cache or the uid of some
      other user's cache (requires CAP_SETUID).  This permits rpc.gssd or whatever to
      mess with the cache.
      
      The cache returned is a keyring named "_krb.<uid>" that the possessor can read,
      search, clear, invalidate, unlink from and add links to.  Active LSMs get a
      chance to rule on whether the caller is permitted to make a link.
      
      Each uid's cache keyring is created when it first accessed and is given a
      timeout that is extended each time this function is called so that the keyring
      goes away after a while.  The timeout is configurable by sysctl but defaults to
      three days.
      
      Each user_namespace struct gets a lazily-created keyring that serves as the
      register.  The cache keyrings are added to it.  This means that standard key
      search and garbage collection facilities are available.
      
      The user_namespace struct's register goes away when it does and anything left
      in it is then automatically gc'd.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NSimo Sorce <simo@redhat.com>
      cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
      cc: Eric W. Biederman <ebiederm@xmission.com>
      f36f8c75
  19. 27 8月, 2013 1 次提交
    • E
      userns: Better restrictions on when proc and sysfs can be mounted · e51db735
      Eric W. Biederman 提交于
      Rely on the fact that another flavor of the filesystem is already
      mounted and do not rely on state in the user namespace.
      
      Verify that the mounted filesystem is not covered in any significant
      way.  I would love to verify that the previously mounted filesystem
      has no mounts on top but there are at least the directories
      /proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
      for other filesystems to mount on top of.
      
      Refactor the test into a function named fs_fully_visible and call that
      function from the mount routines of proc and sysfs.  This makes this
      test local to the filesystems involved and the results current of when
      the mounts take place, removing a weird threading of the user
      namespace, the mount namespace and the filesystems themselves.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      e51db735
  20. 09 8月, 2013 1 次提交
  21. 27 3月, 2013 1 次提交
    • E
      userns: Restrict when proc and sysfs can be mounted · 87a8ebd6
      Eric W. Biederman 提交于
      Only allow unprivileged mounts of proc and sysfs if they are already
      mounted when the user namespace is created.
      
      proc and sysfs are interesting because they have content that is
      per namespace, and so fresh mounts are needed when new namespaces
      are created while at the same time proc and sysfs have content that
      is shared between every instance.
      
      Respect the policy of who may see the shared content of proc and sysfs
      by only allowing new mounts if there was an existing mount at the time
      the user namespace was created.
      
      In practice there are only two interesting cases: proc and sysfs are
      mounted at their usual places, proc and sysfs are not mounted at all
      (some form of mount namespace jail).
      
      Cc: stable@vger.kernel.org
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      87a8ebd6
  22. 27 1月, 2013 1 次提交
    • E
      userns: Avoid recursion in put_user_ns · c61a2810
      Eric W. Biederman 提交于
      When freeing a deeply nested user namespace free_user_ns calls
      put_user_ns on it's parent which may in turn call free_user_ns again.
      When -fno-optimize-sibling-calls is passed to gcc one stack frame per
      user namespace is left on the stack, potentially overflowing the
      kernel stack.  CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls
      so we can't count on gcc to optimize this code.
      
      Remove struct kref and use a plain atomic_t.  Making the code more
      flexible and easier to comprehend.  Make the loop in free_user_ns
      explict to guarantee that the stack does not overflow with
      CONFIG_FRAME_POINTER enabled.
      
      I have tested this fix with a simple program that uses unshare to
      create a deeply nested user namespace structure and then calls exit.
      With 1000 nesteuser namespaces before this change running my test
      program causes the kernel to die a horrible death.  With 10,000,000
      nested user namespaces after this change my test program runs to
      completion and causes no harm.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Pointed-out-by: NVasily Kulikov <segoon@openwall.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      c61a2810
  23. 20 11月, 2012 2 次提交
    • E
      proc: Usable inode numbers for the namespace file descriptors. · 98f842e6
      Eric W. Biederman 提交于
      Assign a unique proc inode to each namespace, and use that
      inode number to ensure we only allocate at most one proc
      inode for every namespace in proc.
      
      A single proc inode per namespace allows userspace to test
      to see if two processes are in the same namespace.
      
      This has been a long requested feature and only blocked because
      a naive implementation would put the id in a global space and
      would ultimately require having a namespace for the names of
      namespaces, making migration and certain virtualization tricks
      impossible.
      
      We still don't have per superblock inode numbers for proc, which
      appears necessary for application unaware checkpoint/restart and
      migrations (if the application is using namespace file descriptors)
      but that is now allowd by the design if it becomes important.
      
      I have preallocated the ipc and uts initial proc inode numbers so
      their structures can be statically initialized.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      98f842e6
    • E
      userns: Implement unshare of the user namespace · b2e0d987
      Eric W. Biederman 提交于
      - Add CLONE_THREAD to the unshare flags if CLONE_NEWUSER is selected
        As changing user namespaces is only valid if all there is only
        a single thread.
      - Restore the code to add CLONE_VM if CLONE_THREAD is selected and
        the code to addCLONE_SIGHAND if CLONE_VM is selected.
        Making the constraints in the code clear.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      b2e0d987
  24. 18 9月, 2012 1 次提交
    • E
      userns: Add kprojid_t and associated infrastructure in projid.h · f76d207a
      Eric W. Biederman 提交于
      Implement kprojid_t a cousin of the kuid_t and kgid_t.
      
      The per user namespace mapping of project id values can be set with
      /proc/<pid>/projid_map.
      
      A full compliment of helpers is provided: make_kprojid, from_kprojid,
      from_kprojid_munged, kporjid_has_mapping, projid_valid, projid_eq,
      projid_eq, projid_lt.
      
      Project identifiers are part of the generic disk quota interface,
      although it appears only xfs implements project identifiers currently.
      
      The xfs code allows anyone who has permission to set the project
      identifier on a file to use any project identifier so when
      setting up the user namespace project identifier mappings I do
      not require a capability.
      
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      f76d207a
  25. 03 5月, 2012 2 次提交
  26. 26 4月, 2012 2 次提交
    • E
      userns: Rework the user_namespace adding uid/gid mapping support · 22d917d8
      Eric W. Biederman 提交于
      - Convert the old uid mapping functions into compatibility wrappers
      - Add a uid/gid mapping layer from user space uid and gids to kernel
        internal uids and gids that is extent based for simplicty and speed.
        * Working with number space after mapping uids/gids into their kernel
          internal version adds only mapping complexity over what we have today,
          leaving the kernel code easy to understand and test.
      - Add proc files /proc/self/uid_map /proc/self/gid_map
        These files display the mapping and allow a mapping to be added
        if a mapping does not exist.
      - Allow entering the user namespace without a uid or gid mapping.
        Since we are starting with an existing user our uids and gids
        still have global mappings so are still valid and useful they just don't
        have local mappings.  The requirement for things to work are global uid
        and gid so it is odd but perfectly fine not to have a local uid
        and gid mapping.
        Not requiring global uid and gid mappings greatly simplifies
        the logic of setting up the uid and gid mappings by allowing
        the mappings to be set after the namespace is created which makes the
        slight weirdness worth it.
      - Make the mappings in the initial user namespace to the global
        uid/gid space explicit.  Today it is an identity mapping
        but in the future we may want to twist this for debugging, similar
        to what we do with jiffies.
      - Document the memory ordering requirements of setting the uid and
        gid mappings.  We only allow the mappings to be set once
        and there are no pointers involved so the requirments are
        trivial but a little atypical.
      
      Performance:
      
      In this scheme for the permission checks the performance is expected to
      stay the same as the actuall machine instructions should remain the same.
      
      The worst case I could think of is ls -l on a large directory where
      all of the stat results need to be translated with from kuids and
      kgids to uids and gids.  So I benchmarked that case on my laptop
      with a dual core hyperthread Intel i5-2520M cpu with 3M of cpu cache.
      
      My benchmark consisted of going to single user mode where nothing else
      was running. On an ext4 filesystem opening 1,000,000 files and looping
      through all of the files 1000 times and calling fstat on the
      individuals files.  This was to ensure I was benchmarking stat times
      where the inodes were in the kernels cache, but the inode values were
      not in the processors cache.  My results:
      
      v3.4-rc1:         ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
      v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
      v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
      
      All of the configurations ran in roughly 120ns when I performed tests
      that ran in the cpu cache.
      
      So in summary the performance impact is:
      1ns improvement in the worst case with user namespace support compiled out.
      8ns aka 5% slowdown in the worst case with user namespace support compiled in.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      22d917d8
    • E
      userns: Simplify the user_namespace by making userns->creator a kuid. · 783291e6
      Eric W. Biederman 提交于
      - Transform userns->creator from a user_struct reference to a simple
        kuid_t, kgid_t pair.
      
        In cap_capable this allows the check to see if we are the creator of
        a namespace to become the classic suser style euid permission check.
      
        This allows us to remove the need for a struct cred in the mapping
        functions and still be able to dispaly the user namespace creators
        uid and gid as 0.
      
      - Remove the now unnecessary delayed_work in free_user_ns.
      
        All that is left for free_user_ns to do is to call kmem_cache_free
        and put_user_ns.  Those functions can be called in any context
        so call them directly from free_user_ns removing the need for delayed work.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      783291e6
  27. 08 4月, 2012 1 次提交