1. 19 11月, 2012 4 次提交
    • E
      pidns: Make the pidns proc mount/umount logic obvious. · 0a01f2cc
      Eric W. Biederman 提交于
      Track the number of pids in the proc hash table.  When the number of
      pids goes to 0 schedule work to unmount the kernel mount of proc.
      
      Move the mount of proc into alloc_pid when we allocate the pid for
      init.
      
      Remove the surprising calls of pid_ns_release proc in fork and
      proc_flush_task.  Those code paths really shouldn't know about proc
      namespace implementation details and people have demonstrated several
      times that finding and understanding those code paths is difficult and
      non-obvious.
      
      Because of the call path detach pid is alwasy called with the
      rtnl_lock held free_pid is not allowed to sleep, so the work to
      unmounting proc is moved to a work queue.  This has the side benefit
      of not blocking the entire world waiting for the unnecessary
      rcu_barrier in deactivate_locked_super.
      
      In the process of making the code clear and obvious this fixes a bug
      reported by Gao feng <gaofeng@cn.fujitsu.com> where we would leak a
      mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
      succeeded and copy_net_ns failed.
      Acked-by: N"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      0a01f2cc
    • E
      pidns: Use task_active_pid_ns where appropriate · 17cf22c3
      Eric W. Biederman 提交于
      The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
      aka ns_of_pid(task_pid(tsk)) should have the same number of
      cache line misses with the practical difference that
      ns_of_pid(task_pid(tsk)) is released later in a processes life.
      
      Furthermore by using task_active_pid_ns it becomes trivial
      to write an unshare implementation for the the pid namespace.
      
      So I have used task_active_pid_ns everywhere I can.
      
      In fork since the pid has not yet been attached to the
      process I use ns_of_pid, to achieve the same effect.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      17cf22c3
    • E
      procfs: Don't cache a pid in the root inode. · ae06c7c8
      Eric W. Biederman 提交于
      Now that we have s_fs_info pointing to our pid namespace
      the original reason for the proc root inode having a struct
      pid is gone.
      
      Caching a pid in the root inode has led to some complicated
      code.  Now that we don't need the struct pid, just remove it.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      ae06c7c8
    • E
      procfs: Use the proc generic infrastructure for proc/self. · e656d8a6
      Eric W. Biederman 提交于
      I had visions at one point of splitting proc into two filesystems.  If
      that had happened proc/self being the the part of proc that actually deals
      with pids would have been a nice cleanup.  As it is proc/self requires
      a lot of unnecessary infrastructure for a single file.
      
      The only user visible change is that a mounted /proc for a pid namespace
      that is dead now shows a broken proc symlink, instead of being completely
      invisible.  I don't think anyone will notice or care.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      e656d8a6
  2. 20 10月, 2012 1 次提交
  3. 17 10月, 2012 1 次提交
    • D
      mm, mempolicy: fix printing stack contents in numa_maps · 32f8516a
      David Rientjes 提交于
      When reading /proc/pid/numa_maps, it's possible to return the contents of
      the stack where the mempolicy string should be printed if the policy gets
      freed from beneath us.
      
      This happens because mpol_to_str() may return an error the
      stack-allocated buffer is then printed without ever being stored.
      
      There are two possible error conditions in mpol_to_str():
      
       - if the buffer allocated is insufficient for the string to be stored,
         and
      
       - if the mempolicy has an invalid mode.
      
      The first error condition is not triggered in any of the callers to
      mpol_to_str(): at least 50 bytes is always allocated on the stack and this
      is sufficient for the string to be written.  A future patch should convert
      this into BUILD_BUG_ON() since we know the maximum strlen possible, but
      that's not -rc material.
      
      The second error condition is possible if a race occurs in dropping a
      reference to a task's mempolicy causing it to be freed during the read().
      The slab poison value is then used for the mode and mpol_to_str() returns
      -EINVAL.
      
      This race is only possible because get_vma_policy() believes that
      mm->mmap_sem protects task->mempolicy, which isn't true.  The exit path
      does not hold mm->mmap_sem when dropping the reference or setting
      task->mempolicy to NULL: it uses task_lock(task) instead.
      
      Thus, it's required for the caller of a task mempolicy to hold
      task_lock(task) while grabbing the mempolicy and reading it.  Callers with
      a vma policy store their mempolicy earlier and can simply increment the
      reference count so it's guaranteed not to be freed.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32f8516a
  4. 13 10月, 2012 1 次提交
  5. 10 10月, 2012 1 次提交
  6. 09 10月, 2012 5 次提交
    • N
      kpageflags: fix wrong KPF_THP on non-huge compound pages · 7a71932d
      Naoya Horiguchi 提交于
      KPF_THP can be set on non-huge compound pages (like slab pages or pages
      allocated by drivers with __GFP_COMP) because PageTransCompound only
      checks PG_head and PG_tail.  Obviously this is a bug and breaks user space
      applications which look for thp via /proc/kpageflags.
      
      This patch rules out setting KPF_THP wrongly by additionally checking
      PageLRU on the head pages.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NFengguang Wu <fengguang.wu@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a71932d
    • M
      rbtree: fix incorrect rbtree node insertion in fs/proc/proc_sysctl.c · ea5272f5
      Michel Lespinasse 提交于
      The recently added code to use rbtrees in sysctl did not follow the proper
      rbtree interface on insertion - it was calling rb_link_node() which
      inserts a new node into the binary tree, but missed the call to
      rb_insert_color() which properly balances the rbtree and establishes all
      expected rbtree invariants.
      
      I found out about this only because faulty commit also used
      rb_init_node(), which I am removing within this patchset.  But I think
      it's an easy mistake to make, and it makes me wonder if we should change
      the rbtree API so that insertions would be done with a single rb_insert()
      call (even if its implementation could still inline the rb_link_node()
      part and call a private __rb_insert_color function to do the rebalancing).
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea5272f5
    • M
      rbtree: empty nodes have no color · 4c199a93
      Michel Lespinasse 提交于
      Empty nodes have no color.  We can make use of this property to simplify
      the code emitted by the RB_EMPTY_NODE and RB_CLEAR_NODE macros.  Also,
      we can get rid of the rb_init_node function which had been introduced by
      commit 88d19cf3 ("timers: Add rb_init_node() to allow for stack
      allocated rb nodes") to avoid some issue with the empty node's color not
      being initialized.
      
      I'm not sure what the RB_EMPTY_NODE checks in rb_prev() / rb_next() are
      doing there, though.  axboe introduced them in commit 10fd48f2
      ("rbtree: fixed reversed RB_EMPTY_NODE and rb_next/prev").  The way I
      see it, the 'empty node' abstraction is only used by rbtree users to
      flag nodes that they haven't inserted in any rbtree, so asking the
      predecessor or successor of such nodes doesn't make any sense.
      
      One final rb_init_node() caller was recently added in sysctl code to
      implement faster sysctl name lookups.  This code doesn't make use of
      RB_EMPTY_NODE at all, and from what I could see it only called
      rb_init_node() under the mistaken assumption that such initialization was
      required before node insertion.
      
      [sfr@canb.auug.org.au: fix net/ceph/osd_client.c build]
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c199a93
    • D
      oom: remove deprecated oom_adj · 01dc52eb
      Davidlohr Bueso 提交于
      The deprecated /proc/<pid>/oom_adj is scheduled for removal this month.
      Signed-off-by: NDavidlohr Bueso <dave@gnu.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01dc52eb
    • K
      mm: kill vma flag VM_RESERVED and mm->reserved_vm counter · 314e51b9
      Konstantin Khlebnikov 提交于
      A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
      currently it lost original meaning but still has some effects:
      
       | effect                 | alternative flags
      -+------------------------+---------------------------------------------
      1| account as reserved_vm | VM_IO
      2| skip in core dump      | VM_IO, VM_DONTDUMP
      3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      
      This patch removes reserved_vm counter from mm_struct.  Seems like nobody
      cares about it, it does not exported into userspace directly, it only
      reduces total_vm showed in proc.
      
      Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
      
      remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
      remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
      
      [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314e51b9
  7. 06 10月, 2012 6 次提交
  8. 27 9月, 2012 5 次提交
  9. 18 9月, 2012 3 次提交
    • E
      userns: Add kprojid_t and associated infrastructure in projid.h · f76d207a
      Eric W. Biederman 提交于
      Implement kprojid_t a cousin of the kuid_t and kgid_t.
      
      The per user namespace mapping of project id values can be set with
      /proc/<pid>/projid_map.
      
      A full compliment of helpers is provided: make_kprojid, from_kprojid,
      from_kprojid_munged, kporjid_has_mapping, projid_valid, projid_eq,
      projid_eq, projid_lt.
      
      Project identifiers are part of the generic disk quota interface,
      although it appears only xfs implements project identifiers currently.
      
      The xfs code allows anyone who has permission to set the project
      identifier on a file to use any project identifier so when
      setting up the user namespace project identifier mappings I do
      not require a capability.
      
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      f76d207a
    • E
      userns: Convert the audit loginuid to be a kuid · e1760bd5
      Eric W. Biederman 提交于
      Always store audit loginuids in type kuid_t.
      
      Print loginuids by converting them into uids in the appropriate user
      namespace, and then printing the resulting uid.
      
      Modify audit_get_loginuid to return a kuid_t.
      
      Modify audit_set_loginuid to take a kuid_t.
      
      Modify /proc/<pid>/loginuid on read to convert the loginuid into the
      user namespace of the opener of the file.
      
      Modify /proc/<pid>/loginud on write to convert the loginuid
      rom the user namespace of the opener of the file.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Paul Moore <paul@paul-moore.com> ?
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      e1760bd5
    • F
      fs/proc: fix potential unregister_sysctl_table hang · 6bf61045
      Francesco Ruggeri 提交于
      The unregister_sysctl_table() function hangs if all references to its
      ctl_table_header structure are not dropped.
      
      This can happen sometimes because of a leak in proc_sys_lookup():
      proc_sys_lookup() gets a reference to the table via lookup_entry(), but
      it does not release it when a subsequent call to sysctl_follow_link()
      fails.
      
      This patch fixes this leak by making sure the reference is always
      dropped on return.
      
      See also commit 076c3eed ("sysctl: Rewrite proc_sys_lookup
      introducing find_entry and lookup_entry") which reorganized this code in
      3.4.
      
      Tested in Linux 3.4.4.
      Signed-off-by: NFrancesco Ruggeri <fruggeri@aristanetworks.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6bf61045
  10. 31 7月, 2012 2 次提交
    • D
      proc: do not allow negative offsets on /proc/<pid>/environ · bc452b4b
      Djalal Harouni 提交于
      __mem_open() which is called by both /proc/<pid>/environ and
      /proc/<pid>/mem ->open() handlers will allow the use of negative offsets.
      /proc/<pid>/mem has negative offsets but not /proc/<pid>/environ.
      
      Clean this by moving the 'force FMODE_UNSIGNED_OFFSET flag' to mem_open()
      to allow negative offsets only on /proc/<pid>/mem.
      Signed-off-by: NDjalal Harouni <tixxdz@opendz.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Brad Spengler <spender@grsecurity.net>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc452b4b
    • D
      proc: environ_read() make sure offset points to environment address range · e8905ec2
      Djalal Harouni 提交于
      Currently the following offset and environment address range check in
      environ_read() of /proc/<pid>/environ is buggy:
      
        int this_len = mm->env_end - (mm->env_start + src);
        if (this_len <= 0)
          break;
      
      Large or negative offsets on /proc/<pid>/environ converted to 'unsigned
      long' may pass this check since '(mm->env_start + src)' can overflow and
      'this_len' will be positive.
      
      This can turn /proc/<pid>/environ to act like /proc/<pid>/mem since
      (mm->env_start + src) will point and read from another VMA.
      
      There are two fixes here plus some code cleaning:
      
      1) Fix the overflow by checking if the offset that was converted to
         unsigned long will always point to the [mm->env_start, mm->env_end]
         address range.
      
      2) Remove the truncation that was made to the result of the check,
         storing the result in 'int this_len' will alter its value and we can
         not depend on it.
      
      For kernels that have commit b409e578 ("proc: clean up
      /proc/<pid>/environ handling") which adds the appropriate ptrace check and
      saves the 'mm' at ->open() time, this is not a security issue.
      
      This patch is taken from the grsecurity patch since it was just made
      available.
      Signed-off-by: NDjalal Harouni <tixxdz@opendz.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Brad Spengler <spender@grsecurity.net>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8905ec2
  11. 14 7月, 2012 5 次提交
  12. 11 7月, 2012 1 次提交
  13. 05 6月, 2012 1 次提交
    • L
      vfs: Fix /proc/<tid>/fdinfo/<fd> file handling · 0640113b
      Linus Torvalds 提交于
      Cyrill Gorcunov reports that I broke the fdinfo files with commit
      30a08bf2 ("proc: move fd symlink i_mode calculations into
      tid_fd_revalidate()"), and he's quite right.
      
      The tid_fd_revalidate() function is not just used for the <tid>/fd
      symlinks, it's also used for the <tid>/fdinfo/<fd> files, and the
      permission model for those are different.
      
      So do the dynamic symlink permission handling just for symlinks, making
      the fdinfo files once more appear as the proper regular files they are.
      
      Of course, Al Viro argued (probably correctly) that we shouldn't do the
      symlink permission games at all, and make the symlinks always just be
      the normal 'lrwxrwxrwx'.  That would have avoided this issue too, but
      since somebody noticed that the permissions had changed (which was the
      reason for that original commit 30a08bf2 in the first place), people
      do apparently use this feature.
      
      [ Basically, you can use the symlink permission data as a cheap "fdinfo"
        replacement, since you see whether the file is open for reading and/or
        writing by just looking at st_mode of the symlink.  So the feature
        does make sense, even if the pain it has caused means we probably
        shouldn't have done it to begin with. ]
      Reported-and-tested-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0640113b
  14. 01 6月, 2012 4 次提交