1. 28 5月, 2010 1 次提交
  2. 12 5月, 2010 1 次提交
    • R
      revert "procfs: provide stack information for threads" and its fixup commits · 34441427
      Robin Holt 提交于
      Originally, commit d899bf7b ("procfs: provide stack information for
      threads") attempted to introduce a new feature for showing where the
      threadstack was located and how many pages are being utilized by the
      stack.
      
      Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
      applied to fix the NO_MMU case.
      
      Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
      64-bit") was applied to fix a bug in ia32 executables being loaded.
      
      Commit 9ebd4eba ("procfs: fix /proc/<pid>/stat stack pointer for kernel
      threads") was applied to fix a bug which had kernel threads printing a
      userland stack address.
      
      Commit 1306d603 ('proc: partially revert "procfs: provide stack
      information for threads"') was then applied to revert the stack pages
      being used to solve a significant performance regression.
      
      This patch nearly undoes the effect of all these patches.
      
      The reason for reverting these is it provides an unusable value in
      field 28.  For x86_64, a fork will result in the task->stack_start
      value being updated to the current user top of stack and not the stack
      start address.  This unpredictability of the stack_start value makes
      it worthless.  That includes the intended use of showing how much stack
      space a thread has.
      
      Other architectures will get different values.  As an example, ia64
      gets 0.  The do_fork() and copy_process() functions appear to treat the
      stack_start and stack_size parameters as architecture specific.
      
      I only partially reverted c44972f1 ("procfs: disable per-task stack usage
      on NOMMU") .  If I had completely reverted it, I would have had to change
      mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
      configured.  Since I could not test the builds without significant effort,
      I decided to not change mm/Makefile.
      
      I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
      information for threads on 64-bit") .  I left the KSTK_ESP() change in
      place as that seemed worthwhile.
      Signed-off-by: NRobin Holt <holt@sgi.com>
      Cc: Stefani Seibold <stefani@seibold.net>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34441427
  3. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  4. 13 3月, 2010 1 次提交
  5. 15 12月, 2009 1 次提交
  6. 04 11月, 2009 1 次提交
    • S
      x86, fs: Fix x86 procfs stack information for threads on 64-bit · 89240ba0
      Stefani Seibold 提交于
      This patch fixes two issues in the procfs stack information on
      x86-64 linux.
      
      The 32 bit loader compat_do_execve did not store stack
      start. (this was figured out by Alexey Dobriyan).
      
      The stack information on a x64_64 kernel always shows 0 kbyte
      stack usage, because of a missing implementation of the KSTK_ESP
      macro which always returned -1.
      
      The new implementation now returns the right value.
      Signed-off-by: NStefani Seibold <stefani@seibold.net>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <1257240160.4889.24.camel@wall-e>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      89240ba0
  7. 24 9月, 2009 1 次提交
    • V
      fs: fix overflow in sys_mount() for in-kernel calls · eca6f534
      Vegard Nossum 提交于
      sys_mount() reads/copies a whole page for its "type" parameter.  When
      do_mount_root() passes a kernel address that points to an object which is
      smaller than a whole page, copy_mount_options() will happily go past this
      memory object, possibly dereferencing "wild" pointers that could be in any
      state (hence the kmemcheck warning, which shows that parts of the next
      page are not even allocated).
      
      (The likelihood of something going wrong here is pretty low -- first of
      all this only applies to kernel calls to sys_mount(), which are mostly
      found in the boot code.  Secondly, I guess if the page was not mapped,
      exact_copy_from_user() _would_ in fact handle it correctly because of its
      access_ok(), etc.  checks.)
      
      But it is much nicer to avoid the dubious reads altogether, by stopping as
      soon as we find a NUL byte.  Is there a good reason why we can't do
      something like this, using the already existing strndup_from_user()?
      
      [akpm@linux-foundation.org: make copy_mount_string() static]
      [AV: fix compat mount breakage, which involves undoing akpm's change above]
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: Nal <al@dizzy.pdmi.ras.ru>
      eca6f534
  8. 23 9月, 2009 1 次提交
    • S
      fix compat_sys_utimensat() · d7d7561c
      Suzuki Poulose 提交于
      Compat utimensat() returns EINVAL when the tv_nsec is one of UTIME_OMIT or
      UTIME_NOW and the tv_sec is set to non-zero.  As per man pages, the tv_sec
      field should be ignored.
      
      sys_utimensat() works fine in this case.
      
      Test case:
      
      #define _GNU_SOURCE
      #define _ATFILE_SOURCE
      #include <stdio.h>
      #include <fcntl.h>
      #include <unistd.h>
      #include <sys/stat.h>
      #include <stdlib.h>
      
      main(int argc, char *argv[])
      {
      	struct timespec ts[2];
      	struct timespec *tsp;
      
      	if (argc < 2) {
      		fprintf(stderr, "Usage : %s filename\n", argv[0]);
      		exit (-1);
      	}
      
      	ts[0].tv_nsec = ts[1].tv_nsec = UTIME_NOW;
      	ts[0].tv_sec = ts[1].tv_sec = 1;
      
      	tsp = ts;
      
      	if (utimensat(AT_FDCWD, argv[1],tsp,0) == -1)
      		perror("utimensat");
      	else
      		fprintf(stdout, "utimensat success\n");
      	return 0;
      }
      mjs22lp5:~ # cc -m64 utimensat-test.c -o utimensat_test64
      mjs22lp5:~ # cc -m32 utimensat-test.c -o utimensat_test32
      mjs22lp5:~ # ./utimensat_test32 /tmp/utimensat_test
      utimensat: Invalid argument
      mjs22lp5:~ # ./utimensat_test64 /tmp/utimensat_test
      utimensat success
      mjs22lp5:~ # uname -r
      2.6.31-rc8
      
      With the patch :
      
      mjs22lp5:~ # ./utimensat_test64 /tmp/utimensat_test
      utimensat success
      mjs22lp5:~ # ./utimensat_test32 /tmp/utimensat_test
      utimensat success
      mjs22lp5:~ # uname -r
      2.6.31-rc8utimensat
      Signed-off-by: NSuzuki K P <suzuki@in.ibm.com>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7d7561c
  9. 06 9月, 2009 1 次提交
    • O
      exec: do not sleep in TASK_TRACED under ->cred_guard_mutex · a2a8474c
      Oleg Nesterov 提交于
      Tom Horsley reports that his debugger hangs when it tries to read
      /proc/pid_of_tracee/maps, this happens since
      
      	"mm_for_maps: take ->cred_guard_mutex to fix the race with exec"
      	04b836cbf19e885f8366bccb2e4b0474346c02d
      
      commit in 2.6.31.
      
      But the root of the problem lies in the fact that do_execve() path calls
      tracehook_report_exec() which can stop if the tracer sets PT_TRACE_EXEC.
      
      The tracee must not sleep in TASK_TRACED holding this mutex.  Even if we
      remove ->cred_guard_mutex from mm_for_maps() and proc_pid_attr_write(),
      another task doing PTRACE_ATTACH should not hang until it is killed or the
      tracee resumes.
      
      With this patch do_execve() does not use ->cred_guard_mutex directly and
      we do not hold it throughout, instead:
      
      	- introduce prepare_bprm_creds() helper, it locks the mutex
      	  and calls prepare_exec_creds() to initialize bprm->cred.
      
      	- install_exec_creds() drops the mutex after commit_creds(),
      	  and thus before tracehook_report_exec()->ptrace_stop().
      
      	  or, if exec fails,
      
      	  free_bprm() drops this mutex when bprm->cred != NULL which
      	  indicates install_exec_creds() was not called.
      Reported-by: NTom Horsley <tom.horsley@att.net>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2a8474c
  10. 13 7月, 2009 1 次提交
  11. 07 7月, 2009 1 次提交
  12. 13 6月, 2009 1 次提交
  13. 12 6月, 2009 1 次提交
  14. 11 5月, 2009 1 次提交
  15. 24 4月, 2009 1 次提交
    • O
      do_execve() must not clear fs->in_exec if it was set by another thread · 8c652f96
      Oleg Nesterov 提交于
      If do_execve() fails after check_unsafe_exec(), it clears fs->in_exec
      unconditionally. This is wrong if we race with our sub-thread which
      also does do_execve:
      
      	Two threads T1 and T2 and another process P, all share the same
      	->fs.
      
      	T1 starts do_execve(BAD_FILE). It calls check_unsafe_exec(), since
      	->fs is shared, we set LSM_UNSAFE but not ->in_exec.
      
      	P exits and decrements fs->users.
      
      	T2 starts do_execve(), calls check_unsafe_exec(), now ->fs is not
      	shared, we set fs->in_exec.
      
      	T1 continues, open_exec(BAD_FILE) fails, we clear ->in_exec and
      	return to the user-space.
      
      	T1 does clone(CLONE_FS /* without CLONE_THREAD */).
      
      	T2 continues without LSM_UNSAFE_SHARE while ->fs is shared with
      	another process.
      
      Change check_unsafe_exec() to return res = 1 if we set ->in_exec, and change
      do_execve() to clear ->in_exec depending on res.
      
      When do_execve() suceeds, it is safe to clear ->in_exec unconditionally.
      It can be set only if we don't share ->fs with another process, and since
      we already killed all sub-threads either ->in_exec == 0 or we are the
      only user of this ->fs.
      
      Also, we do not need fs->lock to clear fs->in_exec.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NRoland McGrath <roland@redhat.com>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c652f96
  16. 21 4月, 2009 2 次提交
  17. 05 4月, 2009 1 次提交
    • L
      Make non-compat preadv/pwritev use native register size · 601cc11d
      Linus Torvalds 提交于
      Instead of always splitting the file offset into 32-bit 'high' and 'low'
      parts, just split them into the largest natural word-size - which in C
      terms is 'unsigned long'.
      
      This allows 64-bit architectures to avoid the unnecessary 32-bit
      shifting and masking for native format (while the compat interfaces will
      obviously always have to do it).
      
      This also changes the order of 'high' and 'low' to be "low first".  Why?
      Because when we have it like this, the 64-bit system calls now don't use
      the "pos_high" argument at all, and it makes more sense for the native
      system call to simply match the user-mode prototype.
      
      This results in a much more natural calling convention, and allows the
      compiler to generate much more straightforward code.  On x86-64, we now
      generate
      
              testq   %rcx, %rcx      # pos_l
              js      .L122   #,
              movq    %rcx, -48(%rbp) # pos_l, pos
      
      from the C source
      
              loff_t pos = pos_from_hilo(pos_h, pos_l);
      	...
              if (pos < 0)
                      return -EINVAL;
      
      and the 'pos_h' register isn't even touched.  It used to generate code
      like
      
              mov     %r8d, %r8d      # pos_low, pos_low
              salq    $32, %rcx       #, tmp71
              movq    %r8, %rax       # pos_low, pos.386
              orq     %rcx, %rax      # tmp71, pos.386
              js      .L122   #,
              movq    %rax, -48(%rbp) # pos.386, pos
      
      which isn't _that_ horrible, but it does show how the natural word size
      is just a more sensible interface (same arguments will hold in the user
      level glibc wrapper function, of course, so the kernel side is just half
      of the equation!)
      
      Note: in all cases the user code wrapper can again be the same. You can
      just do
      
      	#define HALF_BITS (sizeof(unsigned long)*4)
      	__syscall(PWRITEV, fd, iov, count, offset, (offset >> HALF_BITS) >> HALF_BITS);
      
      or something like that.  That way the user mode wrapper will also be
      nicely passing in a zero (it won't actually have to do the shifts, the
      compiler will understand what is going on) for the last argument.
      
      And that is a good idea, even if nobody will necessarily ever care: if
      we ever do move to a 128-bit lloff_t, this particular system call might
      be left alone.  Of course, that will be the least of our worries if we
      really ever need to care, so this may not be worth really caring about.
      
      [ Fixed for lost 'loff_t' cast noticed by Andrew Morton ]
      Acked-by: NGerd Hoffmann <kraxel@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ralf Baechle <ralf@linux-mips.org>>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      601cc11d
  18. 03 4月, 2009 4 次提交
    • G
      preadv/pwritev: switch compat readv/preadv/writev/pwritev from fget to fget_light · 10c7db27
      Gerd Hoffmann 提交于
      Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <linux-api@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10c7db27
    • G
      preadv/pwritev: Add preadv and pwritev system calls. · f3554f4b
      Gerd Hoffmann 提交于
      This patch adds preadv and pwritev system calls.  These syscalls are a
      pretty straightforward combination of pread and readv (same for write).
      They are quite useful for doing vectored I/O in threaded applications.
      Using lseek+readv instead opens race windows you'll have to plug with
      locking.
      
      Other systems have such system calls too, for example NetBSD, check
      here: http://www.daemon-systems.org/man/preadv.2.html
      
      The application-visible interface provided by glibc should look like
      this to be compatible to the existing implementations in the *BSD family:
      
        ssize_t preadv(int d, const struct iovec *iov, int iovcnt, off_t offset);
        ssize_t pwritev(int d, const struct iovec *iov, int iovcnt, off_t offset);
      
      This prototype has one problem though: On 32bit archs is the (64bit)
      offset argument unaligned, which the syscall ABI of several archs doesn't
      allow to do.  At least s390 needs a wrapper in glibc to handle this.  As
      we'll need a wrappers in glibc anyway I've decided to push problem to
      glibc entriely and use a syscall prototype which works without
      arch-specific wrappers inside the kernel: The offset argument is
      explicitly splitted into two 32bit values.
      
      The patch sports the actual system call implementation and the windup in
      the x86 system call tables.  Other archs follow as separate patches.
      Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <linux-api@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3554f4b
    • G
      preadv/pwritev: create compat_writev() · 6949a631
      Gerd Hoffmann 提交于
      Factor out some code from compat_sys_writev() which can be shared with the
      upcoming compat_sys_pwritev().
      Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <linux-api@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6949a631
    • G
      preadv/pwritev: create compat_readv() · dac12138
      Gerd Hoffmann 提交于
      This patch series:
      
      Implement the preadv() and pwritev() syscalls.  *BSD has this syscall for
      quite some time.
      
      Test code:
      
      #if 0
      set -x
      gcc -Wall -O2 -o preadv $0
      exit 0
      #endif
      /*
       * preadv demo / test
       *
       * (c) 2008 Gerd Hoffmann <kraxel@redhat.com>
       *
       * build with "sh $thisfile"
       */
      
      #include <stdio.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <errno.h>
      #include <inttypes.h>
      #include <sys/uio.h>
      
      /* ----------------------------------------------------------------- */
      /* syscall windup                                                    */
      
      #include <sys/syscall.h>
      #if 0
      /* WARNING: Be sure you know what you are doing if you enable this.
       * linux syscall code isn't upstream yet, syscall numbers are subject
       * to change */
      # ifndef __NR_preadv
      #  ifdef __i386__
      #   define __NR_preadv  333
      #   define __NR_pwritev 334
      #  endif
      #  ifdef __x86_64__
      #   define __NR_preadv  295
      #   define __NR_pwritev 296
      #  endif
      # endif
      #endif
      #ifndef __NR_preadv
      # error preadv/pwritev syscall numbers are unknown
      #endif
      
      static ssize_t preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset)
      {
          uint32_t pos_high = (offset >> 32) & 0xffffffff;
          uint32_t pos_low  =  offset        & 0xffffffff;
      
          return syscall(__NR_preadv, fd, iov, iovcnt, pos_high, pos_low);
      }
      
      static ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset)
      {
          uint32_t pos_high = (offset >> 32) & 0xffffffff;
          uint32_t pos_low  =  offset        & 0xffffffff;
      
          return syscall(__NR_pwritev, fd, iov, iovcnt, pos_high, pos_low);
      }
      
      /* ----------------------------------------------------------------- */
      /* demo/test app                                                     */
      
      static char filename[] = "/tmp/preadv-XXXXXX";
      static char outbuf[11] = "0123456789";
      static char inbuf[11]  = "----------";
      
      static struct iovec ovec[2] = {{
              .iov_base = outbuf + 5,
              .iov_len  = 5,
          },{
              .iov_base = outbuf + 0,
              .iov_len  = 5,
          }};
      
      static struct iovec ivec[3] = {{
              .iov_base = inbuf + 6,
              .iov_len  = 2,
          },{
              .iov_base = inbuf + 4,
              .iov_len  = 2,
          },{
              .iov_base = inbuf + 2,
              .iov_len  = 2,
          }};
      
      void cleanup(void)
      {
          unlink(filename);
      }
      
      int main(int argc, char **argv)
      {
          int fd, rc;
      
          fd = mkstemp(filename);
          if (-1 == fd) {
              perror("mkstemp");
              exit(1);
          }
          atexit(cleanup);
      
          /* write to file: "56789-01234" */
          rc = pwritev(fd, ovec, 2, 0);
          if (rc < 0) {
              perror("pwritev");
              exit(1);
          }
      
          /* read from file: "78-90-12" */
          rc = preadv(fd, ivec, 3, 2);
          if (rc < 0) {
              perror("preadv");
              exit(1);
          }
      
          printf("result  : %s\n", inbuf);
          printf("expected: %s\n", "--129078--");
          exit(0);
      }
      
      This patch:
      
      Factor out some code from compat_sys_readv() which can be shared with the
      upcoming compat_sys_preadv().
      Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <linux-api@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dac12138
  19. 01 4月, 2009 1 次提交
    • A
      New locking/refcounting for fs_struct · 498052bb
      Al Viro 提交于
      * all changes of current->fs are done under task_lock and write_lock of
        old fs->lock
      * refcount is not atomic anymore (same protection)
      * its decrements are done when removing reference from current; at the
        same time we decide whether to free it.
      * put_fs_struct() is gone
      * new field - ->in_exec.  Set by check_unsafe_exec() if we are trying to do
        execve() and only subthreads share fs_struct.  Cleared when finishing exec
        (success and failure alike).  Makes CLONE_FS fail with -EAGAIN if set.
      * check_unsafe_exec() may fail with -EAGAIN if another execve() from subthread
        is in progress.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      498052bb
  20. 29 3月, 2009 2 次提交
    • H
      fix setuid sometimes doesn't · e426b64c
      Hugh Dickins 提交于
      Joe Malicki reports that setuid sometimes doesn't: very rarely,
      a setuid root program does not get root euid; and, by the way,
      they have a health check running lsof every few minutes.
      
      Right, check_unsafe_exec() notes whether the files_struct is being
      shared by more threads than will get killed by the exec, and if so
      sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid.
      But /proc/<pid>/fd and /proc/<pid>/fdinfo lookups make transient
      use of get_files_struct(), which also raises that sharing count.
      
      There's a rather simple fix for this: exec's check on files->count
      has been redundant ever since 2.6.1 made it unshare_files() (except
      while compat_do_execve() omitted to do so) - just remove that check.
      
      [Note to -stable: this patch will not apply before 2.6.29: earlier
      releases should just remove the files->count line from unsafe_exec().]
      Reported-by: NJoe Malicki <jmalicki@metacarta.com>
      Narrowed-down-by: NMichael Itz <mitz@metacarta.com>
      Tested-by: NJoe Malicki <jmalicki@metacarta.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e426b64c
    • H
      compat_do_execve should unshare_files · 53e9309e
      Hugh Dickins 提交于
      2.6.26's commit fd8328be
      "sanitize handling of shared descriptor tables in failing execve()"
      moved the unshare_files() from flush_old_exec() and several binfmts
      to the head of do_execve(); but forgot to make the same change to
      compat_do_execve(), leaving a CLONE_FILES files_struct shared across
      exec from a 32-bit process on a 64-bit kernel.
      
      It's arguable whether the files_struct really ought to be unshared
      across exec; but 2.6.1 made that so to stop the loading binary's fd
      leaking into other threads, and a 32-bit process on a 64-bit kernel
      ought to behave in the same way as 32 on 32 and 64 on 64.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53e9309e
  21. 28 3月, 2009 1 次提交
    • C
      generic compat_sys_ustat · 2b1c6bd7
      Christoph Hellwig 提交于
      Due to a different size of ino_t ustat needs a compat handler, but
      currently only x86 and mips provide one.  Add a generic compat_sys_ustat
      and switch all architectures over to it.  Instead of doing various
      user copy hacks compat_sys_ustat just reimplements sys_ustat as
      it's trivial.  This was suggested by Arnd Bergmann.
      
      Found by Eric Sandeen when running xfstests/017 on ppc64, which causes
      stack smashing warnings on RHEL/Fedora due to the too large amount of
      data writen by the syscall.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2b1c6bd7
  22. 12 2月, 2009 1 次提交
  23. 07 2月, 2009 1 次提交
    • D
      CRED: Fix SUID exec regression · 0bf2f3ae
      David Howells 提交于
      The patch:
      
      	commit a6f76f23
      	CRED: Make execve() take advantage of copy-on-write credentials
      
      moved the place in which the 'safeness' of a SUID/SGID exec was performed to
      before de_thread() was called.  This means that LSM_UNSAFE_SHARE is now
      calculated incorrectly.  This flag is set if any of the usage counts for
      fs_struct, files_struct and sighand_struct are greater than 1 at the time the
      determination is made.  All of which are true for threads created by the
      pthread library.
      
      However, since we wish to make the security calculation before irrevocably
      damaging the process so that we can return it an error code in the case where
      we decide we want to reject the exec request on this basis, we have to make the
      determination before calling de_thread().
      
      So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
      our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
      (CLONE_SIGHAND/CLONE_THREAD) with us.  These will be killed by de_thread() and
      so can be discounted by check_unsafe_exec().
      
      We do have to be careful because CLONE_THREAD does not imply FS or FILES.
      
      We _assume_ that there will be no extra references to these structs held by the
      threads we're going to kill.
      
      This can be tested with the attached pair of programs.  Build the two programs
      using the Makefile supplied, and run ./test1 as a non-root user.  If
      successful, you should see something like:
      
      	[dhowells@andromeda tmp]$ ./test1
      	--TEST1--
      	uid=4043, euid=4043 suid=4043
      	exec ./test2
      	--TEST2--
      	uid=4043, euid=0 suid=0
      	SUCCESS - Correct effective user ID
      
      and if unsuccessful, something like:
      
      	[dhowells@andromeda tmp]$ ./test1
      	--TEST1--
      	uid=4043, euid=4043 suid=4043
      	exec ./test2
      	--TEST2--
      	uid=4043, euid=4043 suid=4043
      	ERROR - Incorrect effective user ID!
      
      The non-root user ID you see will depend on the user you run as.
      
      [test1.c]
      #include <stdio.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <pthread.h>
      
      static void *thread_func(void *arg)
      {
      	while (1) {}
      }
      
      int main(int argc, char **argv)
      {
      	pthread_t tid;
      	uid_t uid, euid, suid;
      
      	printf("--TEST1--\n");
      	getresuid(&uid, &euid, &suid);
      	printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
      
      	if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
      		perror("pthread_create");
      		exit(1);
      	}
      
      	printf("exec ./test2\n");
      	execlp("./test2", "test2", NULL);
      	perror("./test2");
      	_exit(1);
      }
      
      [test2.c]
      #include <stdio.h>
      #include <stdlib.h>
      #include <unistd.h>
      
      int main(int argc, char **argv)
      {
      	uid_t uid, euid, suid;
      
      	getresuid(&uid, &euid, &suid);
      	printf("--TEST2--\n");
      	printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);
      
      	if (euid != 0) {
      		fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
      		exit(1);
      	}
      	printf("SUCCESS - Correct effective user ID\n");
      	exit(0);
      }
      
      [Makefile]
      CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
      all: test1 test2
      
      test1: test1.c
      	gcc $(CFLAGS) -o test1 test1.c -lpthread
      
      test2: test2.c
      	gcc $(CFLAGS) -o test2 test2.c
      	sudo chown root.root test2
      	sudo chmod +s test2
      Reported-by: NDavid Smith <dsmith@redhat.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NDavid Smith <dsmith@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      0bf2f3ae
  24. 14 1月, 2009 1 次提交
  25. 07 1月, 2009 1 次提交
  26. 14 11月, 2008 1 次提交
    • D
      CRED: Make execve() take advantage of copy-on-write credentials · a6f76f23
      David Howells 提交于
      Make execve() take advantage of copy-on-write credentials, allowing it to set
      up the credentials in advance, and then commit the whole lot after the point
      of no return.
      
      This patch and the preceding patches have been tested with the LTP SELinux
      testsuite.
      
      This patch makes several logical sets of alteration:
      
       (1) execve().
      
           The credential bits from struct linux_binprm are, for the most part,
           replaced with a single credentials pointer (bprm->cred).  This means that
           all the creds can be calculated in advance and then applied at the point
           of no return with no possibility of failure.
      
           I would like to replace bprm->cap_effective with:
      
      	cap_isclear(bprm->cap_effective)
      
           but this seems impossible due to special behaviour for processes of pid 1
           (they always retain their parent's capability masks where normally they'd
           be changed - see cap_bprm_set_creds()).
      
           The following sequence of events now happens:
      
           (a) At the start of do_execve, the current task's cred_exec_mutex is
           	 locked to prevent PTRACE_ATTACH from obsoleting the calculation of
           	 creds that we make.
      
           (a) prepare_exec_creds() is then called to make a copy of the current
           	 task's credentials and prepare it.  This copy is then assigned to
           	 bprm->cred.
      
        	 This renders security_bprm_alloc() and security_bprm_free()
           	 unnecessary, and so they've been removed.
      
           (b) The determination of unsafe execution is now performed immediately
           	 after (a) rather than later on in the code.  The result is stored in
           	 bprm->unsafe for future reference.
      
           (c) prepare_binprm() is called, possibly multiple times.
      
           	 (i) This applies the result of set[ug]id binaries to the new creds
           	     attached to bprm->cred.  Personality bit clearance is recorded,
           	     but now deferred on the basis that the exec procedure may yet
           	     fail.
      
               (ii) This then calls the new security_bprm_set_creds().  This should
      	     calculate the new LSM and capability credentials into *bprm->cred.
      
      	     This folds together security_bprm_set() and parts of
      	     security_bprm_apply_creds() (these two have been removed).
      	     Anything that might fail must be done at this point.
      
               (iii) bprm->cred_prepared is set to 1.
      
      	     bprm->cred_prepared is 0 on the first pass of the security
      	     calculations, and 1 on all subsequent passes.  This allows SELinux
      	     in (ii) to base its calculations only on the initial script and
      	     not on the interpreter.
      
           (d) flush_old_exec() is called to commit the task to execution.  This
           	 performs the following steps with regard to credentials:
      
      	 (i) Clear pdeath_signal and set dumpable on certain circumstances that
      	     may not be covered by commit_creds().
      
               (ii) Clear any bits in current->personality that were deferred from
                   (c.i).
      
           (e) install_exec_creds() [compute_creds() as was] is called to install the
           	 new credentials.  This performs the following steps with regard to
           	 credentials:
      
               (i) Calls security_bprm_committing_creds() to apply any security
                   requirements, such as flushing unauthorised files in SELinux, that
                   must be done before the credentials are changed.
      
      	     This is made up of bits of security_bprm_apply_creds() and
      	     security_bprm_post_apply_creds(), both of which have been removed.
      	     This function is not allowed to fail; anything that might fail
      	     must have been done in (c.ii).
      
               (ii) Calls commit_creds() to apply the new credentials in a single
                   assignment (more or less).  Possibly pdeath_signal and dumpable
                   should be part of struct creds.
      
      	 (iii) Unlocks the task's cred_replace_mutex, thus allowing
      	     PTRACE_ATTACH to take place.
      
               (iv) Clears The bprm->cred pointer as the credentials it was holding
                   are now immutable.
      
               (v) Calls security_bprm_committed_creds() to apply any security
                   alterations that must be done after the creds have been changed.
                   SELinux uses this to flush signals and signal handlers.
      
           (f) If an error occurs before (d.i), bprm_free() will call abort_creds()
           	 to destroy the proposed new credentials and will then unlock
           	 cred_replace_mutex.  No changes to the credentials will have been
           	 made.
      
       (2) LSM interface.
      
           A number of functions have been changed, added or removed:
      
           (*) security_bprm_alloc(), ->bprm_alloc_security()
           (*) security_bprm_free(), ->bprm_free_security()
      
           	 Removed in favour of preparing new credentials and modifying those.
      
           (*) security_bprm_apply_creds(), ->bprm_apply_creds()
           (*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
      
           	 Removed; split between security_bprm_set_creds(),
           	 security_bprm_committing_creds() and security_bprm_committed_creds().
      
           (*) security_bprm_set(), ->bprm_set_security()
      
           	 Removed; folded into security_bprm_set_creds().
      
           (*) security_bprm_set_creds(), ->bprm_set_creds()
      
           	 New.  The new credentials in bprm->creds should be checked and set up
           	 as appropriate.  bprm->cred_prepared is 0 on the first call, 1 on the
           	 second and subsequent calls.
      
           (*) security_bprm_committing_creds(), ->bprm_committing_creds()
           (*) security_bprm_committed_creds(), ->bprm_committed_creds()
      
           	 New.  Apply the security effects of the new credentials.  This
           	 includes closing unauthorised files in SELinux.  This function may not
           	 fail.  When the former is called, the creds haven't yet been applied
           	 to the process; when the latter is called, they have.
      
       	 The former may access bprm->cred, the latter may not.
      
       (3) SELinux.
      
           SELinux has a number of changes, in addition to those to support the LSM
           interface changes mentioned above:
      
           (a) The bprm_security_struct struct has been removed in favour of using
           	 the credentials-under-construction approach.
      
           (c) flush_unauthorized_files() now takes a cred pointer and passes it on
           	 to inode_has_perm(), file_has_perm() and dentry_open().
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      a6f76f23
  27. 27 10月, 2008 1 次提交
  28. 23 10月, 2008 1 次提交
  29. 17 10月, 2008 2 次提交
  30. 06 9月, 2008 2 次提交
  31. 25 8月, 2008 1 次提交
  32. 27 7月, 2008 1 次提交
    • A
      [PATCH] sanitize __user_walk_fd() et.al. · 2d8f3038
      Al Viro 提交于
      * do not pass nameidata; struct path is all the callers want.
      * switch to new helpers:
      	user_path_at(dfd, pathname, flags, &path)
      	user_path(pathname, &path)
      	user_lpath(pathname, &path)
      	user_path_dir(pathname, &path)  (fail if not a directory)
        The last 3 are trivial macro wrappers for the first one.
      * remove nameidata in callers.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2d8f3038
  33. 25 7月, 2008 1 次提交
    • U
      flag parameters: signalfd · 9deb27ba
      Ulrich Drepper 提交于
      This patch adds the new signalfd4 syscall.  It extends the old signalfd
      syscall by one parameter which is meant to hold a flag value.  In this
      patch the only flag support is SFD_CLOEXEC which causes the close-on-exec
      flag for the returned file descriptor to be set.
      
      A new name SFD_CLOEXEC is introduced which in this implementation must
      have the same value as O_CLOEXEC.
      
      The following test must be adjusted for architectures other than x86 and
      x86-64 and in case the syscall numbers changed.
      
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      #include <fcntl.h>
      #include <signal.h>
      #include <stdio.h>
      #include <unistd.h>
      #include <sys/syscall.h>
      
      #ifndef __NR_signalfd4
      # ifdef __x86_64__
      #  define __NR_signalfd4 289
      # elif defined __i386__
      #  define __NR_signalfd4 327
      # else
      #  error "need __NR_signalfd4"
      # endif
      #endif
      
      #define SFD_CLOEXEC O_CLOEXEC
      
      int
      main (void)
      {
        sigset_t ss;
        sigemptyset (&ss);
        sigaddset (&ss, SIGUSR1);
        int fd = syscall (__NR_signalfd4, -1, &ss, 8, 0);
        if (fd == -1)
          {
            puts ("signalfd4(0) failed");
            return 1;
          }
        int coe = fcntl (fd, F_GETFD);
        if (coe == -1)
          {
            puts ("fcntl failed");
            return 1;
          }
        if (coe & FD_CLOEXEC)
          {
            puts ("signalfd4(0) set close-on-exec flag");
            return 1;
          }
        close (fd);
      
        fd = syscall (__NR_signalfd4, -1, &ss, 8, SFD_CLOEXEC);
        if (fd == -1)
          {
            puts ("signalfd4(SFD_CLOEXEC) failed");
            return 1;
          }
        coe = fcntl (fd, F_GETFD);
        if (coe == -1)
          {
            puts ("fcntl failed");
            return 1;
          }
        if ((coe & FD_CLOEXEC) == 0)
          {
            puts ("signalfd4(SFD_CLOEXEC) does not set close-on-exec flag");
            return 1;
          }
        close (fd);
      
        puts ("OK");
      
        return 0;
      }
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      [akpm@linux-foundation.org: add sys_ni stub]
      Signed-off-by: NUlrich Drepper <drepper@redhat.com>
      Acked-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9deb27ba