1. 23 8月, 2018 2 次提交
  2. 23 2月, 2018 1 次提交
    • L
      efivarfs: Limit the rate for non-root to read files · bef3efbe
      Luck, Tony 提交于
      Each read from a file in efivarfs results in two calls to EFI
      (one to get the file size, another to get the actual data).
      
      On X86 these EFI calls result in broadcast system management
      interrupts (SMI) which affect performance of the whole system.
      A malicious user can loop performing reads from efivarfs bringing
      the system to its knees.
      
      Linus suggested per-user rate limit to solve this.
      
      So we add a ratelimit structure to "user_struct" and initialize
      it for the root user for no limit. When allocating user_struct for
      other users we set the limit to 100 per second. This could be used
      for other places that want to limit the rate of some detrimental
      user action.
      
      In efivarfs if the limit is exceeded when reading, we take an
      interruptible nap for 50ms and check the rate limit again.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bef3efbe
  3. 01 11月, 2017 1 次提交
  4. 02 3月, 2017 1 次提交
  5. 12 12月, 2014 1 次提交
    • E
      userns: Add a knob to disable setgroups on a per user namespace basis · 9cc46516
      Eric W. Biederman 提交于
      - Expose the knob to user space through a proc file /proc/<pid>/setgroups
      
        A value of "deny" means the setgroups system call is disabled in the
        current processes user namespace and can not be enabled in the
        future in this user namespace.
      
        A value of "allow" means the segtoups system call is enabled.
      
      - Descendant user namespaces inherit the value of setgroups from
        their parents.
      
      - A proc file is used (instead of a sysctl) as sysctls currently do
        not allow checking the permissions at open time.
      
      - Writing to the proc file is restricted to before the gid_map
        for the user namespace is set.
      
        This ensures that disabling setgroups at a user namespace
        level will never remove the ability to call setgroups
        from a process that already has that ability.
      
        A process may opt in to the setgroups disable for itself by
        creating, entering and configuring a user namespace or by calling
        setns on an existing user namespace with setgroups disabled.
        Processes without privileges already can not call setgroups so this
        is a noop.  Prodcess with privilege become processes without
        privilege when entering a user namespace and as with any other path
        to dropping privilege they would not have the ability to call
        setgroups.  So this remains within the bounds of what is possible
        without a knob to disable setgroups permanently in a user namespace.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      9cc46516
  6. 05 12月, 2014 2 次提交
  7. 05 6月, 2014 1 次提交
  8. 04 4月, 2014 1 次提交
    • P
      kernel: audit/fix non-modular users of module_init in core code · c96d6660
      Paul Gortmaker 提交于
      Code that is obj-y (always built-in) or dependent on a bool Kconfig
      (built-in or absent) can never be modular.  So using module_init as an
      alias for __initcall can be somewhat misleading.
      
      Fix these up now, so that we can relocate module_init from init.h into
      module.h in the future.  If we don't do this, we'd have to add module.h
      to obviously non-modular code, and that would be a worse thing.
      
      The audit targets the following module_init users for change:
       kernel/user.c                  obj-y
       kernel/kexec.c                 bool KEXEC (one instance per arch)
       kernel/profile.c               bool PROFILING
       kernel/hung_task.c             bool DETECT_HUNG_TASK
       kernel/sched/stats.c           bool SCHEDSTATS
       kernel/user_namespace.c        bool USER_NS
      
      Note that direct use of __initcall is discouraged, vs.  one of the
      priority categorized subgroups.  As __initcall gets mapped onto
      device_initcall, our use of subsys_initcall (which makes sense for these
      files) will thus change this registration from level 6-device to level
      4-subsys (i.e.  slightly earlier).  However no observable impact of that
      difference has been observed during testing.
      
      Also, two instances of missing ";" at EOL are fixed in kexec.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c96d6660
  9. 13 12月, 2013 1 次提交
    • X
      KEYS: fix uninitialized persistent_keyring_register_sem · 6bd364d8
      Xiao Guangrong 提交于
      We run into this bug:
      [ 2736.063245] Unable to handle kernel paging request for data at address 0x00000000
      [ 2736.063293] Faulting instruction address: 0xc00000000037efb0
      [ 2736.063300] Oops: Kernel access of bad area, sig: 11 [#1]
      [ 2736.063303] SMP NR_CPUS=2048 NUMA pSeries
      [ 2736.063310] Modules linked in: sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6table_security ip6table_raw ip6t_REJECT iptable_nat nf_nat_ipv4 iptable_mangle iptable_security iptable_raw ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ebtable_filter ebtables ip6table_filter iptable_filter ip_tables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 nf_nat nf_conntrack ip6_tables ibmveth pseries_rng nx_crypto nfsd auth_rpcgss nfs_acl lockd sunrpc binfmt_misc xfs libcrc32c dm_service_time sd_mod crc_t10dif crct10dif_common ibmvfc scsi_transport_fc scsi_tgt dm_mirror dm_region_hash dm_log dm_multipath dm_mod
      [ 2736.063383] CPU: 1 PID: 7128 Comm: ssh Not tainted 3.10.0-48.el7.ppc64 #1
      [ 2736.063389] task: c000000131930120 ti: c0000001319a0000 task.ti: c0000001319a0000
      [ 2736.063394] NIP: c00000000037efb0 LR: c0000000006c40f8 CTR: 0000000000000000
      [ 2736.063399] REGS: c0000001319a3870 TRAP: 0300   Not tainted  (3.10.0-48.el7.ppc64)
      [ 2736.063403] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 28824242  XER: 20000000
      [ 2736.063415] SOFTE: 0
      [ 2736.063418] CFAR: c00000000000908c
      [ 2736.063421] DAR: 0000000000000000, DSISR: 40000000
      [ 2736.063425]
      GPR00: c0000000006c40f8 c0000001319a3af0 c000000001074788 c0000001319a3bf0
      GPR04: 0000000000000000 0000000000000000 0000000000000020 000000000000000a
      GPR08: fffffffe00000002 00000000ffff0000 0000000080000001 c000000000924888
      GPR12: 0000000028824248 c000000007e00400 00001fffffa0f998 0000000000000000
      GPR16: 0000000000000022 00001fffffa0f998 0000010022e92470 0000000000000000
      GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      GPR24: 0000000000000000 c000000000f4a828 00003ffffe527108 0000000000000000
      GPR28: c000000000f4a730 c000000000f4a828 0000000000000000 c0000001319a3bf0
      [ 2736.063498] NIP [c00000000037efb0] .__list_add+0x30/0x110
      [ 2736.063504] LR [c0000000006c40f8] .rwsem_down_write_failed+0x78/0x264
      [ 2736.063508] PACATMSCRATCH [800000000280f032]
      [ 2736.063511] Call Trace:
      [ 2736.063516] [c0000001319a3af0] [c0000001319a3b80] 0xc0000001319a3b80 (unreliable)
      [ 2736.063523] [c0000001319a3b80] [c0000000006c40f8] .rwsem_down_write_failed+0x78/0x264
      [ 2736.063530] [c0000001319a3c50] [c0000000006c1bb0] .down_write+0x70/0x78
      [ 2736.063536] [c0000001319a3cd0] [c0000000002e5ffc] .keyctl_get_persistent+0x20c/0x320
      [ 2736.063542] [c0000001319a3dc0] [c0000000002e2388] .SyS_keyctl+0x238/0x260
      [ 2736.063548] [c0000001319a3e30] [c000000000009e7c] syscall_exit+0x0/0x7c
      [ 2736.063553] Instruction dump:
      [ 2736.063556] 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7cbd2b78 7c9e2378 7c7f1b78 f8010010
      [ 2736.063566] f821ff71 e8a50008 7fa52040 40de00c0 <e8be0000> 7fbd2840 40de0094 7fbff040
      [ 2736.063579] ---[ end trace 2708241785538296 ]---
      
      It's caused by uninitialized persistent_keyring_register_sem.
      
      The bug was introduced by commit f36f8c75, two typos are in that commit:
      CONFIG_KEYS_KERBEROS_CACHE should be CONFIG_PERSISTENT_KEYRINGS and
      krb_cache_register_sem should be persistent_keyring_register_sem.
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      6bd364d8
  10. 24 9月, 2013 1 次提交
    • D
      KEYS: Add per-user_namespace registers for persistent per-UID kerberos caches · f36f8c75
      David Howells 提交于
      Add support for per-user_namespace registers of persistent per-UID kerberos
      caches held within the kernel.
      
      This allows the kerberos cache to be retained beyond the life of all a user's
      processes so that the user's cron jobs can work.
      
      The kerberos cache is envisioned as a keyring/key tree looking something like:
      
      	struct user_namespace
      	  \___ .krb_cache keyring		- The register
      		\___ _krb.0 keyring		- Root's Kerberos cache
      		\___ _krb.5000 keyring		- User 5000's Kerberos cache
      		\___ _krb.5001 keyring		- User 5001's Kerberos cache
      			\___ tkt785 big_key	- A ccache blob
      			\___ tkt12345 big_key	- Another ccache blob
      
      Or possibly:
      
      	struct user_namespace
      	  \___ .krb_cache keyring		- The register
      		\___ _krb.0 keyring		- Root's Kerberos cache
      		\___ _krb.5000 keyring		- User 5000's Kerberos cache
      		\___ _krb.5001 keyring		- User 5001's Kerberos cache
      			\___ tkt785 keyring	- A ccache
      				\___ krbtgt/REDHAT.COM@REDHAT.COM big_key
      				\___ http/REDHAT.COM@REDHAT.COM user
      				\___ afs/REDHAT.COM@REDHAT.COM user
      				\___ nfs/REDHAT.COM@REDHAT.COM user
      				\___ krbtgt/KERNEL.ORG@KERNEL.ORG big_key
      				\___ http/KERNEL.ORG@KERNEL.ORG big_key
      
      What goes into a particular Kerberos cache is entirely up to userspace.  Kernel
      support is limited to giving you the Kerberos cache keyring that you want.
      
      The user asks for their Kerberos cache by:
      
      	krb_cache = keyctl_get_krbcache(uid, dest_keyring);
      
      The uid is -1 or the user's own UID for the user's own cache or the uid of some
      other user's cache (requires CAP_SETUID).  This permits rpc.gssd or whatever to
      mess with the cache.
      
      The cache returned is a keyring named "_krb.<uid>" that the possessor can read,
      search, clear, invalidate, unlink from and add links to.  Active LSMs get a
      chance to rule on whether the caller is permitted to make a link.
      
      Each uid's cache keyring is created when it first accessed and is given a
      timeout that is extended each time this function is called so that the keyring
      goes away after a while.  The timeout is configurable by sysctl but defaults to
      three days.
      
      Each user_namespace struct gets a lazily-created keyring that serves as the
      register.  The cache keyrings are added to it.  This means that standard key
      search and garbage collection facilities are available.
      
      The user_namespace struct's register goes away when it does and anything left
      in it is then automatically gc'd.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NSimo Sorce <simo@redhat.com>
      cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
      cc: Eric W. Biederman <ebiederm@xmission.com>
      f36f8c75
  11. 27 8月, 2013 1 次提交
    • E
      userns: Better restrictions on when proc and sysfs can be mounted · e51db735
      Eric W. Biederman 提交于
      Rely on the fact that another flavor of the filesystem is already
      mounted and do not rely on state in the user namespace.
      
      Verify that the mounted filesystem is not covered in any significant
      way.  I would love to verify that the previously mounted filesystem
      has no mounts on top but there are at least the directories
      /proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
      for other filesystems to mount on top of.
      
      Refactor the test into a function named fs_fully_visible and call that
      function from the mount routines of proc and sysfs.  This makes this
      test local to the filesystems involved and the results current of when
      the mounts take place, removing a weird threading of the user
      namespace, the mount namespace and the filesystems themselves.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      e51db735
  12. 02 5月, 2013 1 次提交
  13. 27 3月, 2013 1 次提交
    • E
      userns: Restrict when proc and sysfs can be mounted · 87a8ebd6
      Eric W. Biederman 提交于
      Only allow unprivileged mounts of proc and sysfs if they are already
      mounted when the user namespace is created.
      
      proc and sysfs are interesting because they have content that is
      per namespace, and so fresh mounts are needed when new namespaces
      are created while at the same time proc and sysfs have content that
      is shared between every instance.
      
      Respect the policy of who may see the shared content of proc and sysfs
      by only allowing new mounts if there was an existing mount at the time
      the user namespace was created.
      
      In practice there are only two interesting cases: proc and sysfs are
      mounted at their usual places, proc and sysfs are not mounted at all
      (some form of mount namespace jail).
      
      Cc: stable@vger.kernel.org
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      87a8ebd6
  14. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  15. 27 1月, 2013 1 次提交
    • E
      userns: Avoid recursion in put_user_ns · c61a2810
      Eric W. Biederman 提交于
      When freeing a deeply nested user namespace free_user_ns calls
      put_user_ns on it's parent which may in turn call free_user_ns again.
      When -fno-optimize-sibling-calls is passed to gcc one stack frame per
      user namespace is left on the stack, potentially overflowing the
      kernel stack.  CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls
      so we can't count on gcc to optimize this code.
      
      Remove struct kref and use a plain atomic_t.  Making the code more
      flexible and easier to comprehend.  Make the loop in free_user_ns
      explict to guarantee that the stack does not overflow with
      CONFIG_FRAME_POINTER enabled.
      
      I have tested this fix with a simple program that uses unshare to
      create a deeply nested user namespace structure and then calls exit.
      With 1000 nesteuser namespaces before this change running my test
      program causes the kernel to die a horrible death.  With 10,000,000
      nested user namespaces after this change my test program runs to
      completion and causes no harm.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Pointed-out-by: NVasily Kulikov <segoon@openwall.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      c61a2810
  16. 20 11月, 2012 1 次提交
    • E
      proc: Usable inode numbers for the namespace file descriptors. · 98f842e6
      Eric W. Biederman 提交于
      Assign a unique proc inode to each namespace, and use that
      inode number to ensure we only allocate at most one proc
      inode for every namespace in proc.
      
      A single proc inode per namespace allows userspace to test
      to see if two processes are in the same namespace.
      
      This has been a long requested feature and only blocked because
      a naive implementation would put the id in a global space and
      would ultimately require having a namespace for the names of
      namespaces, making migration and certain virtualization tricks
      impossible.
      
      We still don't have per superblock inode numbers for proc, which
      appears necessary for application unaware checkpoint/restart and
      migrations (if the application is using namespace file descriptors)
      but that is now allowd by the design if it becomes important.
      
      I have preallocated the ipc and uts initial proc inode numbers so
      their structures can be statically initialized.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      98f842e6
  17. 18 9月, 2012 1 次提交
    • E
      userns: Add kprojid_t and associated infrastructure in projid.h · f76d207a
      Eric W. Biederman 提交于
      Implement kprojid_t a cousin of the kuid_t and kgid_t.
      
      The per user namespace mapping of project id values can be set with
      /proc/<pid>/projid_map.
      
      A full compliment of helpers is provided: make_kprojid, from_kprojid,
      from_kprojid_munged, kporjid_has_mapping, projid_valid, projid_eq,
      projid_eq, projid_lt.
      
      Project identifiers are part of the generic disk quota interface,
      although it appears only xfs implements project identifiers currently.
      
      The xfs code allows anyone who has permission to set the project
      identifier on a file to use any project identifier so when
      setting up the user namespace project identifier mappings I do
      not require a capability.
      
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      f76d207a
  18. 20 5月, 2012 1 次提交
  19. 26 4月, 2012 2 次提交
    • E
      userns: Rework the user_namespace adding uid/gid mapping support · 22d917d8
      Eric W. Biederman 提交于
      - Convert the old uid mapping functions into compatibility wrappers
      - Add a uid/gid mapping layer from user space uid and gids to kernel
        internal uids and gids that is extent based for simplicty and speed.
        * Working with number space after mapping uids/gids into their kernel
          internal version adds only mapping complexity over what we have today,
          leaving the kernel code easy to understand and test.
      - Add proc files /proc/self/uid_map /proc/self/gid_map
        These files display the mapping and allow a mapping to be added
        if a mapping does not exist.
      - Allow entering the user namespace without a uid or gid mapping.
        Since we are starting with an existing user our uids and gids
        still have global mappings so are still valid and useful they just don't
        have local mappings.  The requirement for things to work are global uid
        and gid so it is odd but perfectly fine not to have a local uid
        and gid mapping.
        Not requiring global uid and gid mappings greatly simplifies
        the logic of setting up the uid and gid mappings by allowing
        the mappings to be set after the namespace is created which makes the
        slight weirdness worth it.
      - Make the mappings in the initial user namespace to the global
        uid/gid space explicit.  Today it is an identity mapping
        but in the future we may want to twist this for debugging, similar
        to what we do with jiffies.
      - Document the memory ordering requirements of setting the uid and
        gid mappings.  We only allow the mappings to be set once
        and there are no pointers involved so the requirments are
        trivial but a little atypical.
      
      Performance:
      
      In this scheme for the permission checks the performance is expected to
      stay the same as the actuall machine instructions should remain the same.
      
      The worst case I could think of is ls -l on a large directory where
      all of the stat results need to be translated with from kuids and
      kgids to uids and gids.  So I benchmarked that case on my laptop
      with a dual core hyperthread Intel i5-2520M cpu with 3M of cpu cache.
      
      My benchmark consisted of going to single user mode where nothing else
      was running. On an ext4 filesystem opening 1,000,000 files and looping
      through all of the files 1000 times and calling fstat on the
      individuals files.  This was to ensure I was benchmarking stat times
      where the inodes were in the kernels cache, but the inode values were
      not in the processors cache.  My results:
      
      v3.4-rc1:         ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
      v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
      v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
      
      All of the configurations ran in roughly 120ns when I performed tests
      that ran in the cpu cache.
      
      So in summary the performance impact is:
      1ns improvement in the worst case with user namespace support compiled out.
      8ns aka 5% slowdown in the worst case with user namespace support compiled in.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      22d917d8
    • E
      userns: Simplify the user_namespace by making userns->creator a kuid. · 783291e6
      Eric W. Biederman 提交于
      - Transform userns->creator from a user_struct reference to a simple
        kuid_t, kgid_t pair.
      
        In cap_capable this allows the check to see if we are the creator of
        a namespace to become the classic suser style euid permission check.
      
        This allows us to remove the need for a struct cred in the mapping
        functions and still be able to dispaly the user namespace creators
        uid and gid as 0.
      
      - Remove the now unnecessary delayed_work in free_user_ns.
      
        All that is left for free_user_ns to do is to call kmem_cache_free
        and put_user_ns.  Those functions can be called in any context
        so call them directly from free_user_ns removing the need for delayed work.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      783291e6
  20. 08 4月, 2012 2 次提交
  21. 31 10月, 2011 1 次提交
  22. 24 3月, 2011 1 次提交
    • S
      userns: add a user_namespace as creator/owner of uts_namespace · 59607db3
      Serge E. Hallyn 提交于
      The expected course of development for user namespaces targeted
      capabilities is laid out at https://wiki.ubuntu.com/UserNamespace.
      
      Goals:
      
      - Make it safe for an unprivileged user to unshare namespaces.  They
        will be privileged with respect to the new namespace, but this should
        only include resources which the unprivileged user already owns.
      
      - Provide separate limits and accounting for userids in different
        namespaces.
      
      Status:
      
        Currently (as of 2.6.38) you can clone with the CLONE_NEWUSER flag to
        get a new user namespace if you have the CAP_SYS_ADMIN, CAP_SETUID, and
        CAP_SETGID capabilities.  What this gets you is a whole new set of
        userids, meaning that user 500 will have a different 'struct user' in
        your namespace than in other namespaces.  So any accounting information
        stored in struct user will be unique to your namespace.
      
        However, throughout the kernel there are checks which
      
        - simply check for a capability.  Since root in a child namespace
          has all capabilities, this means that a child namespace is not
          constrained.
      
        - simply compare uid1 == uid2.  Since these are the integer uids,
          uid 500 in namespace 1 will be said to be equal to uid 500 in
          namespace 2.
      
        As a result, the lxc implementation at lxc.sf.net does not use user
        namespaces.  This is actually helpful because it leaves us free to
        develop user namespaces in such a way that, for some time, user
        namespaces may be unuseful.
      
      Bugs aside, this patchset is supposed to not at all affect systems which
      are not actively using user namespaces, and only restrict what tasks in
      child user namespace can do.  They begin to limit privilege to a user
      namespace, so that root in a container cannot kill or ptrace tasks in the
      parent user namespace, and can only get world access rights to files.
      Since all files currently belong to the initila user namespace, that means
      that child user namespaces can only get world access rights to *all*
      files.  While this temporarily makes user namespaces bad for system
      containers, it starts to get useful for some sandboxing.
      
      I've run the 'runltplite.sh' with and without this patchset and found no
      difference.
      
      This patch:
      
      copy_process() handles CLONE_NEWUSER before the rest of the namespaces.
      So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS) the new uts namespace
      will have the new user namespace as its owner.  That is what we want,
      since we want root in that new userns to be able to have privilege over
      it.
      
      Changelog:
      	Feb 15: don't set uts_ns->user_ns if we didn't create
      		a new uts_ns.
      	Feb 23: Move extern init_user_ns declaration from
      		init/version.c to utsname.h.
      Signed-off-by: NSerge E. Hallyn <serge.hallyn@canonical.com>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NDaniel Lezcano <daniel.lezcano@free.fr>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59607db3
  23. 30 12月, 2010 1 次提交
  24. 27 10月, 2010 1 次提交
  25. 10 5月, 2010 1 次提交
  26. 03 4月, 2010 1 次提交
  27. 16 3月, 2010 1 次提交
  28. 21 1月, 2010 1 次提交
  29. 02 11月, 2009 1 次提交
    • T
      uids: Prevent tear down race · b00bc0b2
      Thomas Gleixner 提交于
      Ingo triggered the following warning:
      
      WARNING: at lib/debugobjects.c:255 debug_print_object+0x42/0x50()
      Hardware name: System Product Name
      ODEBUG: init active object type: timer_list
      Modules linked in:
      Pid: 2619, comm: dmesg Tainted: G        W  2.6.32-rc5-tip+ #5298
      Call Trace:
       [<81035443>] warn_slowpath_common+0x6a/0x81
       [<8120e483>] ? debug_print_object+0x42/0x50
       [<81035498>] warn_slowpath_fmt+0x29/0x2c
       [<8120e483>] debug_print_object+0x42/0x50
       [<8120ec2a>] __debug_object_init+0x279/0x2d7
       [<8120ecb3>] debug_object_init+0x13/0x18
       [<810409d2>] init_timer_key+0x17/0x6f
       [<81041526>] free_uid+0x50/0x6c
       [<8104ed2d>] put_cred_rcu+0x61/0x72
       [<81067fac>] rcu_do_batch+0x70/0x121
      
      debugobjects warns about an enqueued timer being initialized. If
      CONFIG_USER_SCHED=y the user management code uses delayed work to
      remove the user from the hash table and tear down the sysfs objects.
      
      free_uid is called from RCU and initializes/schedules delayed work if
      the usage count of the user_struct is 0. The init/schedule happens
      outside of the uidhash_lock protected region which allows a concurrent
      caller of find_user() to reference the about to be destroyed
      user_struct w/o preventing the work from being scheduled. If the next
      free_uid call happens before the work timer expired then the active
      timer is initialized and the work scheduled again.
      
      The race was introduced in commit 5cb350ba (sched: group scheduling,
      sysfs tunables) and made more prominent by commit 3959214f (sched:
      delayed cleanup of user_struct)
      
      Move the init/schedule_delayed_work inside of the uidhash_lock
      protected region to prevent the race.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NDhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@us.ibm.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: stable@kernel.org
      b00bc0b2
  30. 16 6月, 2009 1 次提交
    • K
      sched: delayed cleanup of user_struct · 3959214f
      Kay Sievers 提交于
      During bootup performance tracing we see repeated occurrences of
      /sys/kernel/uid/* events for the same uid, leading to a,
      in this case, rather pointless userspace processing for the
      same uid over and over.
      
      This is usually caused by tools which change their uid to "nobody",
      to run without privileges to read data supplied by untrusted users.
      
      This change delays the execution of the (already existing) scheduled
      work, to cleanup the uid after one second, so the allocated and announced
      uid can possibly be re-used by another process.
      
      This is the current behavior, where almost every invocation of a
      binary, which changes the uid, creates two events:
        $ read START < /sys/kernel/uevent_seqnum; \
        for i in `seq 100`; do su --shell=/bin/true bin; done; \
        read END < /sys/kernel/uevent_seqnum; \
        echo $(($END - $START))
        178
      
      With the delayed cleanup, we get only two events, and userspace finishes
      a bit faster too:
        $ read START < /sys/kernel/uevent_seqnum; \
        for i in `seq 100`; do su --shell=/bin/true bin; done; \
        read END < /sys/kernel/uevent_seqnum; \
        echo $(($END - $START))
        1
      Acked-by: NDhaval Giani <dhaval@linux.vnet.ibm.com>
      Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      3959214f
  31. 11 3月, 2009 1 次提交
  32. 27 2月, 2009 2 次提交
  33. 14 2月, 2009 1 次提交
  34. 09 12月, 2008 1 次提交
  35. 08 12月, 2008 1 次提交