1. 01 5月, 2013 28 次提交
    • A
      kernel/sys.c: make prctl(PR_SET_MM) generally available · 52b36941
      Amnon Shiloh 提交于
      The purpose of this patch is to allow privileged processes to set
      their own per-memory memory-region fields:
      
            start_code, end_code, start_data, end_data, start_brk, brk,
            start_stack, arg_start, arg_end, env_start, env_end.
      
      This functionality is needed by any application or package that needs to
      reconstruct Linux processes, that is, to start them in any way other than
      by means of an "execve()" from an executable file.  This includes:
      
      1. Restoring processes from a checkpoint-file (by all potential
         user-level checkpointing packages, not only CRIU's).
      2. Restarting processes on another node after process migration.
      3. Starting duplicated copies of a running process (for reliability
         and high-availablity).
      4. Starting a process from an executable format that is not supported
         by Linux, thus requiring a "manual execve" by a user-level utility.
      5. Similarly, starting a process from a networked and/or crypted
         executable that, for confidentiality, licensing or other reasons,
         may not be written to the local file-systems.
      
      The code that does that was already included in the Linux kernel by the
      CRIU group, in the form of "prctl(PR_SET_MM)", but prior to this was
      enclosed within their private "#ifdef CONFIG_CHECKPOINT_RESTORE", which is
      normally disabled.  The patch removes those ifdefs.
      Signed-off-by: NAmnon Shiloh <u3557@miso.sublimeip.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52b36941
    • Z
      relay: use macro PAGE_ALIGN instead of FIX_SIZE · a05342cb
      zhangwei(Jovi) 提交于
      Macro FIX_SIZE is same as PAGE_ALIGN at present, so use PAGE_ALIGN
      instead.
      
      Thanks Andrew found this.
      Signed-off-by: Nzhangwei(Jovi) <jovi.zhangwei@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a05342cb
    • Z
      kernel/relay.c: move FIX_SIZE macro into relay.c · 536b39ec
      zhangwei(Jovi) 提交于
      It's better to place FIX_SIZE macro in relay.c, instead of relay.h
      Signed-off-by: Nzhangwei(Jovi) <jovi.zhangwei@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      536b39ec
    • Z
      kernel/relay.c: remove unused function argument actor · 8359f689
      zhangwei(Jovi) 提交于
      Currently argument `actor' is never used in the relay reading path, so
      remove it.
      Signed-off-by: Nzhangwei(Jovi) <jovi.zhangwei@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8359f689
    • L
      semaphore: use `bool' type for semaphore_waiter's up · 06a6ea37
      liguang 提交于
      Signed-off-by: Nliguang <lig.fnst@cn.fujitsu.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06a6ea37
    • L
      semaphore: use unlikely() for down's timeout · c74f66ce
      liguang 提交于
      Signed-off-by: Nliguang <lig.fnst@cn.fujitsu.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c74f66ce
    • R
      pid_namespace.c/.h: simplify defines · 5cc54451
      Raphael S.Carvalho 提交于
      Move BITS_PER_PAGE from pid_namespace.c to pid_namespace.h, since we can
      simplify the define PID_MAP_ENTRIES by using the BITS_PER_PAGE.
      
      [akpm@linux-foundation.org: kernel/pid.c:54:1: warning: "BITS_PER_PAGE" redefined]
      Signed-off-by: NRaphael S.Carvalho <raphael.scarv@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5cc54451
    • R
      kernel/pid.c: improve flow of a loop inside alloc_pidmap. · 8db049b3
      Raphael S. Carvalho 提交于
      find_next_offset() searches for an available "cleaned bit" in the
      respective pid bitmap (page), so returns the offset if found, otherwise
      it returns a value equals to BITS_PER_PAGE.
      
      For example, suppose find_next_offset didn't find any available bit, so
      there's no purpose to call mk_pid (Wasteful Cpu Cycles).
      
      Therefore, I found it could be better to call mk_pid after the checking
      (offset < BITS_PER_PAGE) returned sucessfully! Another point: If (offset
      < BITS_PER_PAGE) results in a "failure", then mk_pid would be called
      again afterwards.
      
      [akpm@linux-foundation.org: simplify code]
      Signed-off-by: NRaphael S. Carvalho <raphael.scarv@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8db049b3
    • Z
      kexec: Use min() and min_t() to simplify logic · 31c3a3fe
      Zhang Yanfei 提交于
      Simplify the logic of variable assignments.
      
      [akpm@linux-foundation.org: replace min_t with min, remove unneeded casts]
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Reviewed-by: NSimon Horman <horms@verge.net.au>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31c3a3fe
    • Z
      kexec: fix wrong types of some local variables · 310faaa9
      Zhang Yanfei 提交于
      The types of the following local variables:
      
      - ubytes/mbytes in kimage_load_crash_segment()/kimage_load_normal_segment()
      
      - r in vmcoreinfo_append_str()
      
      are wrong, so fix them.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      310faaa9
    • O
      coredump: only SIGKILL should interrupt the coredumping task · 403bad72
      Oleg Nesterov 提交于
      There are 2 well known and ancient problems with coredump/signals, and a
      lot of related bug reports:
      
      - do_coredump() clears TIF_SIGPENDING but of course this can't help
        if, say, SIGCHLD comes after that.
      
        In this case the coredump can fail unexpectedly. See for example
        wait_for_dump_helper()->signal_pending() check but there are other
        reasons.
      
      - At the same time, dumping a huge core on the slow media can take a
        lot of time/resources and there is no way to kill the coredumping
        task reliably. In particular this is not oom_kill-friendly.
      
      This patch tries to fix the 1st problem, and makes the preparation for the
      next changes.
      
      We add the new SIGNAL_GROUP_COREDUMP flag set by zap_threads() to indicate
      that this process dumps the core.  prepare_signal() checks this flag and
      nacks any signal except SIGKILL.
      
      Note that this check tries to be conservative, in the long term we should
      probably treat the SIGNAL_GROUP_EXIT case equally but this needs more
      discussion.  See marc.info/?l=linux-kernel&m=120508897917439
      
      Notes:
      	- recalc_sigpending() doesn't check SIGNAL_GROUP_COREDUMP.
      	  The patch assumes that dump_write/etc paths should never
      	  call it, but we can change it as well.
      
      	- There is another source of TIF_SIGPENDING, freezer. This
      	  will be addressed separately.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Tested-by: NMandeep Singh Baines <msb@chromium.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Neil Horman <nhorman@redhat.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      403bad72
    • L
      kmod: remove call_usermodehelper_fns() · 66e5b7e1
      Lucas De Marchi 提交于
      This function suffers from not being able to determine if the cleanup is
      called in case it returns -ENOMEM.  Nobody is using it anymore, so let's
      remove it.
      Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66e5b7e1
    • L
      kmod: split call to call_usermodehelper_fns() · f634460c
      Lucas De Marchi 提交于
      Use call_usermodehelper_setup() + call_usermodehelper_exec() instead of
      calling call_usermodehelper_fns().  In case the latter returns -ENOMEM the
      cleanup function may had not been called - in this case we would not free
      argv and module_name.
      Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f634460c
    • L
      usermodehelper: export call_usermodehelper_exec() and call_usermodehelper_setup() · 938e4b22
      Lucas De Marchi 提交于
      call_usermodehelper_setup() + call_usermodehelper_exec() need to be
      called instead of call_usermodehelper_fns() when the cleanup function
      needs to be called even when an ENOMEM error occurs.  In this case using
      call_usermodehelper_fns() the user can't distinguish if the cleanup
      function was called or not.
      
      [akpm@linux-foundation.org: export call_usermodehelper_setup() to modules]
      Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      938e4b22
    • A
      ptrace: add ability to retrieve signals without removing from a queue (v4) · 84c751bd
      Andrey Vagin 提交于
      This patch adds a new ptrace request PTRACE_PEEKSIGINFO.
      
      This request is used to retrieve information about pending signals
      starting with the specified sequence number.  Siginfo_t structures are
      copied from the child into the buffer starting at "data".
      
      The argument "addr" is a pointer to struct ptrace_peeksiginfo_args.
      struct ptrace_peeksiginfo_args {
      	u64 off;	/* from which siginfo to start */
      	u32 flags;
      	s32 nr;		/* how may siginfos to take */
      };
      
      "nr" has type "s32", because ptrace() returns "long", which has 32 bits on
      i386 and a negative values is used for errors.
      
      Currently here is only one flag PTRACE_PEEKSIGINFO_SHARED for dumping
      signals from process-wide queue.  If this flag is not set, signals are
      read from a per-thread queue.
      
      The request PTRACE_PEEKSIGINFO returns a number of dumped signals.  If a
      signal with the specified sequence number doesn't exist, ptrace returns
      zero.  The request returns an error, if no signal has been dumped.
      
      Errors:
      EINVAL - one or more specified flags are not supported or nr is negative
      EFAULT - buf or addr is outside your accessible address space.
      
      A result siginfo contains a kernel part of si_code which usually striped,
      but it's required for queuing the same siginfo back during restore of
      pending signals.
      
      This functionality is required for checkpointing pending signals.  Pedro
      Alves suggested using it in "gdb" to peek at pending signals.  gdb already
      uses PTRACE_GETSIGINFO to get the siginfo for the signal which was already
      dequeued.  This functionality allows gdb to look at the pending signals
      which were not reported yet.
      
      The prototype of this code was developed by Oleg Nesterov.
      Signed-off-by: NAndrew Vagin <avagin@openvz.org>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pedro Alves <palves@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84c751bd
    • S
      kernel/timer.c: move some non timer related syscalls to kernel/sys.c · 4a22f166
      Stephen Rothwell 提交于
      Andrew Morton noted:
      
      	akpm3:/usr/src/25> grep SYSCALL kernel/timer.c
      	SYSCALL_DEFINE1(alarm, unsigned int, seconds)
      	SYSCALL_DEFINE0(getpid)
      	SYSCALL_DEFINE0(getppid)
      	SYSCALL_DEFINE0(getuid)
      	SYSCALL_DEFINE0(geteuid)
      	SYSCALL_DEFINE0(getgid)
      	SYSCALL_DEFINE0(getegid)
      	SYSCALL_DEFINE0(gettid)
      	SYSCALL_DEFINE1(sysinfo, struct sysinfo __user *, info)
      	COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo __user *, info)
      
      	Only one of those should be in kernel/timer.c.  Who wrote this thing?
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a22f166
    • S
      kernel/timer.c: convert compat_sys_sysinfo to COMPAT_SYSCALL_DEFINE · 1043f65a
      Stephen Rothwell 提交于
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1043f65a
    • S
      kernel/compat.c: make do_sysinfo() static · 1a0df594
      Stephen Rothwell 提交于
      The only use outside of kernel/timer.c was in kernel/compat.c, so move
      compat_sys_sysinfo() next to sys_sysinfo() in kernel/timer.c.
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a0df594
    • A
      kernel/smp.c: cleanups · e1d12f32
      Andrew Morton 提交于
      We sometimes use "struct call_single_data *data" and sometimes "struct
      call_single_data *csd".  Use "csd" consistently.
      
      We sometimes use "struct call_function_data *data" and sometimes "struct
      call_function_data *cfd".  Use "cfd" consistently.
      
      Also, avoid some 80-col layout tricks.
      
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1d12f32
    • L
      kernel/smp.c: remove 'priv' of call_single_data · 3440a1ca
      liguang 提交于
      The 'priv' field is redundant; we can pass data via 'info'.
      Signed-off-by: Nliguang <lig.fnst@cn.fujitsu.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3440a1ca
    • L
      kernel/smp.c: use '|=' for csd_lock · 1def1dc9
      liguang 提交于
      csd_lock() uses assignment to data->flags rather than |=.  That is not
      buggy at present because only one bit (CSD_FLAG_LOCK) is defined in
      call_single_data.flags.
      
      But it will become buggy if we later add another flag, so fix it now.
      Signed-off-by: Nliguang <lig.fnst@cn.fujitsu.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1def1dc9
    • T
      workqueue: include workqueue info when printing debug dump of a worker task · 3d1cb205
      Tejun Heo 提交于
      One of the problems that arise when converting dedicated custom
      threadpool to workqueue is that the shared worker pool used by workqueue
      anonimizes each worker making it more difficult to identify what the
      worker was doing on which target from the output of sysrq-t or debug
      dump from oops, BUG() and friends.
      
      This patch implements set_worker_desc() which can be called from any
      workqueue work function to set its description.  When the worker task is
      dumped for whatever reason - sysrq-t, WARN, BUG, oops, lockdep assertion
      and so on - the description will be printed out together with the
      workqueue name and the worker function pointer.
      
      The printing side is implemented by print_worker_info() which is called
      from functions in task dump paths - sched_show_task() and
      dump_stack_print_info().  print_worker_info() can be safely called on
      any task in any state as long as the task struct itself is accessible.
      It uses probe_*() functions to access worker fields.  It may print
      garbage if something went very wrong, but it wouldn't cause (another)
      oops.
      
      The description is currently limited to 24bytes including the
      terminating \0.  worker->desc_valid and workder->desc[] are added and
      the 64 bytes marker which was already incorrect before adding the new
      fields is moved to the correct position.
      
      Here's an example dump with writeback updated to set the bdi name as
      worker desc.
      
       Hardware name: Bochs
       Modules linked in:
       Pid: 7, comm: kworker/u9:0 Not tainted 3.9.0-rc1-work+ #1
       Workqueue: writeback bdi_writeback_workfn (flush-8:0)
        ffffffff820a3ab0 ffff88000f6e9cb8 ffffffff81c61845 ffff88000f6e9cf8
        ffffffff8108f50f 0000000000000000 0000000000000000 ffff88000cde16b0
        ffff88000cde1aa8 ffff88001ee19240 ffff88000f6e9fd8 ffff88000f6e9d08
       Call Trace:
        [<ffffffff81c61845>] dump_stack+0x19/0x1b
        [<ffffffff8108f50f>] warn_slowpath_common+0x7f/0xc0
        [<ffffffff8108f56a>] warn_slowpath_null+0x1a/0x20
        [<ffffffff81200150>] bdi_writeback_workfn+0x2a0/0x3b0
       ...
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d1cb205
    • T
      kthread: implement probe_kthread_data() · cd42d559
      Tejun Heo 提交于
      One of the problems that arise when converting dedicated custom threadpool
      to workqueue is that the shared worker pool used by workqueue anonimizes
      each worker making it more difficult to identify what the worker was doing
      on which target from the output of sysrq-t or debug dump from oops, BUG()
      and friends.
      
      For example, after writeback is converted to use workqueue instead of
      priviate thread pool, there's no easy to tell which backing device a
      writeback work item was working on at the time of task dump, which,
      according to our writeback brethren, is important in tracking down issues
      with a lot of mounted file systems on a lot of different devices.
      
      This patchset implements a way for a work function to mark its execution
      instance so that task dump of the worker task includes information to
      indicate what the work item was doing.
      
      An example WARN dump would look like the following.
      
       WARNING: at fs/fs-writeback.c:1015 bdi_writeback_workfn+0x2b4/0x3c0()
       Modules linked in:
       CPU: 0 Pid: 28 Comm: kworker/u18:0 Not tainted 3.9.0-rc1-work+ #24
       Hardware name: empty empty/S3992, BIOS 080011  10/26/2007
       Workqueue: writeback bdi_writeback_workfn (flush-8:16)
        ffffffff820a3a98 ffff88015b927cb8 ffffffff81c61855 ffff88015b927cf8
        ffffffff8108f500 0000000000000000 ffff88007a171948 ffff88007a1716b0
        ffff88015b49df00 ffff88015b8d3940 0000000000000000 ffff88015b927d08
       Call Trace:
        [<ffffffff81c61855>] dump_stack+0x19/0x1b
        [<ffffffff8108f500>] warn_slowpath_common+0x70/0xa0
        ...
      
      This patch:
      
      Implement probe_kthread_data() which returns kthread_data if accessible.
      The function is equivalent to kthread_data() except that the specified
      @task may not be a kthread or its vfork_done is already cleared rendering
      struct kthread inaccessible.  In the former case, probe_kthread_data() may
      return any value.  In the latter, NULL.
      
      This will be used to safely print debug information without affecting
      synchronization in the normal paths.  Workqueue debug info printing on
      dump_stack() and friends will make use of it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd42d559
    • V
      arc, print-fatal-signals: reduce duplicated information · 681a90ff
      Vineet Gupta 提交于
      After the recent generic debug info on dump_stack() and friends, arc
      is printing duplicate information on debug dumps.
      
       [ARCLinux]$ ./crash
       crash/50: potentially unexpected fatal signal 11.	<-- [1]
       /sbin/crash, TGID 50					<-- [2]
       Pid: 50, comm: crash Not tainted 3.9.0-rc4+ #132 	<-- [3]
       ...
      
      Remove them.
      
      [tj@kernel.org: updated patch desc]
      Signed-off-by: NVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      681a90ff
    • T
      dump_stack: unify debug information printed by show_regs() · a43cb95d
      Tejun Heo 提交于
      show_regs() is inherently arch-dependent but it does make sense to print
      generic debug information and some archs already do albeit in slightly
      different forms.  This patch introduces a generic function to print debug
      information from show_regs() so that different archs print out the same
      information and it's much easier to modify what's printed.
      
      show_regs_print_info() prints out the same debug info as dump_stack()
      does plus task and thread_info pointers.
      
      * Archs which didn't print debug info now do.
      
        alpha, arc, blackfin, c6x, cris, frv, h8300, hexagon, ia64, m32r,
        metag, microblaze, mn10300, openrisc, parisc, score, sh64, sparc,
        um, xtensa
      
      * Already prints debug info.  Replaced with show_regs_print_info().
        The printed information is superset of what used to be there.
      
        arm, arm64, avr32, mips, powerpc, sh32, tile, unicore32, x86
      
      * s390 is special in that it used to print arch-specific information
        along with generic debug info.  Heiko and Martin think that the
        arch-specific extra isn't worth keeping s390 specfic implementation.
        Converted to use the generic version.
      
      Note that now all archs print the debug info before actual register
      dumps.
      
      An example BUG() dump follows.
      
       kernel BUG at /work/os/work/kernel/workqueue.c:4841!
       invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
       Modules linked in:
       CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #7
       Hardware name: empty empty/S3992, BIOS 080011  10/26/2007
       task: ffff88007c85e040 ti: ffff88007c860000 task.ti: ffff88007c860000
       RIP: 0010:[<ffffffff8234a07e>]  [<ffffffff8234a07e>] init_workqueues+0x4/0x6
       RSP: 0000:ffff88007c861ec8  EFLAGS: 00010246
       RAX: ffff88007c861fd8 RBX: ffffffff824466a8 RCX: 0000000000000001
       RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffffffff8234a07a
       RBP: ffff88007c861ec8 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff8234a07a
       R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
       FS:  0000000000000000(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: ffff88015f7ff000 CR3: 00000000021f1000 CR4: 00000000000007f0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
       Stack:
        ffff88007c861ef8 ffffffff81000312 ffffffff824466a8 ffff88007c85e650
        0000000000000003 0000000000000000 ffff88007c861f38 ffffffff82335e5d
        ffff88007c862080 ffffffff8223d8c0 ffff88007c862080 ffffffff81c47760
       Call Trace:
        [<ffffffff81000312>] do_one_initcall+0x122/0x170
        [<ffffffff82335e5d>] kernel_init_freeable+0x9b/0x1c8
        [<ffffffff81c47760>] ? rest_init+0x140/0x140
        [<ffffffff81c4776e>] kernel_init+0xe/0xf0
        [<ffffffff81c6be9c>] ret_from_fork+0x7c/0xb0
        [<ffffffff81c47760>] ? rest_init+0x140/0x140
        ...
      
      v2: Typo fix in x86-32.
      
      v3: CPU number dropped from show_regs_print_info() as
          dump_stack_print_info() has been updated to print it.  s390
          specific implementation dropped as requested by s390 maintainers.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NJesper Nilsson <jesper.nilsson@axis.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Acked-by: Chris Metcalf <cmetcalf@tilera.com>		[tile bits]
      Acked-by: Richard Kuo <rkuo@codeaurora.org>		[hexagon bits]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a43cb95d
    • T
      dump_stack: implement arch-specific hardware description in task dumps · 98e5e1bf
      Tejun Heo 提交于
      x86 and ia64 can acquire extra hardware identification information
      from DMI and print it along with task dumps; however, the usage isn't
      consistent.
      
      * x86 show_regs() collects vendor, product and board strings and print
        them out with PID, comm and utsname.  Some of the information is
        printed again later in the same dump.
      
      * warn_slowpath_common() explicitly accesses the DMI board and prints
        it out with "Hardware name:" label.  This applies to both x86 and
        ia64 but is irrelevant on all other archs.
      
      * ia64 doesn't show DMI information on other non-WARN dumps.
      
      This patch introduces arch-specific hardware description used by
      dump_stack().  It can be set by calling dump_stack_set_arch_desc()
      during boot and, if exists, printed out in a separate line with
      "Hardware name:" label.
      
      dmi_set_dump_stack_arch_desc() is added which sets arch-specific
      description from DMI data.  It uses dmi_ids_string[] which is set from
      dmi_present() used for DMI debug message.  It is superset of the
      information x86 show_regs() is using.  The function is called from x86
      and ia64 boot code right after dmi_scan_machine().
      
      This makes the explicit DMI handling in warn_slowpath_common()
      unnecessary.  Removed.
      
      show_regs() isn't yet converted to use generic debug information
      printing and this patch doesn't remove the duplicate DMI handling in
      x86 show_regs().  The next patch will unify show_regs() handling and
      remove the duplication.
      
      An example WARN dump follows.
      
       WARNING: at kernel/workqueue.c:4841 init_workqueues+0x35/0x505()
       Modules linked in:
       CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #3
       Hardware name: empty empty/S3992, BIOS 080011  10/26/2007
        0000000000000009 ffff88007c861e08 ffffffff81c614dc ffff88007c861e48
        ffffffff8108f500 ffffffff82228240 0000000000000040 ffffffff8234a08e
        0000000000000000 0000000000000000 0000000000000000 ffff88007c861e58
       Call Trace:
        [<ffffffff81c614dc>] dump_stack+0x19/0x1b
        [<ffffffff8108f500>] warn_slowpath_common+0x70/0xa0
        [<ffffffff8108f54a>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8234a0c3>] init_workqueues+0x35/0x505
        ...
      
      v2: Use the same string as the debug message from dmi_present() which
          also contains BIOS information.  Move hardware name into its own
          line as warn_slowpath_common() did.  This change was suggested by
          Bjorn Helgaas.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98e5e1bf
    • T
      dump_stack: consolidate dump_stack() implementations and unify their behaviors · 196779b9
      Tejun Heo 提交于
      Both dump_stack() and show_stack() are currently implemented by each
      architecture.  show_stack(NULL, NULL) dumps the backtrace for the
      current task as does dump_stack().  On some archs, dump_stack() prints
      extra information - pid, utsname and so on - in addition to the
      backtrace while the two are identical on other archs.
      
      The usages in arch-independent code of the two functions indicate
      show_stack(NULL, NULL) should print out bare backtrace while
      dump_stack() is used for debugging purposes when something went wrong,
      so it does make sense to print additional information on the task which
      triggered dump_stack().
      
      There's no reason to require archs to implement two separate but mostly
      identical functions.  It leads to unnecessary subtle information.
      
      This patch expands the dummy fallback dump_stack() implementation in
      lib/dump_stack.c such that it prints out debug information (taken from
      x86) and invokes show_stack(NULL, NULL) and drops arch-specific
      dump_stack() implementations in all archs except blackfin.  Blackfin's
      dump_stack() does something wonky that I don't understand.
      
      Debug information can be printed separately by calling
      dump_stack_print_info() so that arch-specific dump_stack()
      implementation can still emit the same debug information.  This is used
      in blackfin.
      
      This patch brings the following behavior changes.
      
      * On some archs, an extra level in backtrace for show_stack() could be
        printed.  This is because the top frame was determined in
        dump_stack() on those archs while generic dump_stack() can't do that
        reliably.  It can be compensated by inlining dump_stack() but not
        sure whether that'd be necessary.
      
      * Most archs didn't use to print debug info on dump_stack().  They do
        now.
      
      An example WARN dump follows.
      
       WARNING: at kernel/workqueue.c:4841 init_workqueues+0x35/0x505()
       Hardware name: empty
       Modules linked in:
       CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #9
        0000000000000009 ffff88007c861e08 ffffffff81c614dc ffff88007c861e48
        ffffffff8108f50f ffffffff82228240 0000000000000040 ffffffff8234a03c
        0000000000000000 0000000000000000 0000000000000000 ffff88007c861e58
       Call Trace:
        [<ffffffff81c614dc>] dump_stack+0x19/0x1b
        [<ffffffff8108f50f>] warn_slowpath_common+0x7f/0xc0
        [<ffffffff8108f56a>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8234a071>] init_workqueues+0x35/0x505
        ...
      
      v2: CPU number added to the generic debug info as requested by s390
          folks and dropped the s390 specific dump_stack().  This loses %ksp
          from the debug message which the maintainers think isn't important
          enough to keep the s390-specific dump_stack() implementation.
      
          dump_stack_print_info() is moved to kernel/printk.c from
          lib/dump_stack.c.  Because linkage is per objecct file,
          dump_stack_print_info() living in the same lib file as generic
          dump_stack() means that archs which implement custom dump_stack()
          - at this point, only blackfin - can't use dump_stack_print_info()
          as that will bring in the generic version of dump_stack() too.  v1
          The v1 patch broke build on blackfin due to this issue.  The build
          breakage was reported by Fengguang Wu.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NVineet Gupta <vgupta@synopsys.com>
      Acked-by: NJesper Nilsson <jesper.nilsson@axis.com>
      Acked-by: NVineet Gupta <vgupta@synopsys.com>
      Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>	[s390 bits]
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Acked-by: Richard Kuo <rkuo@codeaurora.org>		[hexagon bits]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      196779b9
    • L
      kernel/range.c: subtract_range: fix the broken phrase issued by printk · 7fba2c27
      Lin Feng 提交于
      Also replace deprecated printk(KERN_ERR...) with pr_err() as suggested
      by Yinghai, attaching the function name to provide plenty info.
      Signed-off-by: NLin Feng <linfeng@cn.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7fba2c27
  2. 30 4月, 2013 12 次提交
    • S
      tracing: Fix small merge bug · 6c24499f
      Steven Rostedt 提交于
      During the 3.10 merge, a conflict happened and the resolution was
      almost, but not quite, correct. An if statement was reversed.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      [ Duh. That was just silly of me  - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c24499f
    • A
      kernel/: rename random32() to prandom_u32() · 6d65df33
      Akinobu Mita 提交于
      Use preferable function name which implies using a pseudo-random
      number generator.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d65df33
    • N
      printk: fix failure to return error in devkmsg_poll() · 0a285317
      Nicolas Kaiser 提交于
      Error value got overwritten instantly.
      Signed-off-by: NNicolas Kaiser <nikai@nikai.net>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a285317
    • T
      early_printk: consolidate random copies of identical code · d0380e6c
      Thomas Gleixner 提交于
      The early console implementations are the same all over the place.  Move
      the print function to kernel/printk and get rid of the copies.
      
      [akpm@linux-foundation.org: arch/mips/kernel/early_printk.c needs kernel.h for va_list]
      [paul.gortmaker@windriver.com: sh4: make the bios early console support depend on EARLY_PRINTK]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: NMike Frysinger <vapier@gentoo.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Richard Weinberger <richard@nod.at>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Tested-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0380e6c
    • Z
      printk/tracing: rework console tracing · 07c65f4d
      zhangwei(Jovi) 提交于
      Commit 7ff9554b ("printk: convert byte-buffer to variable-length
      record buffer") removed start and end parameters from
      call_console_drivers, but those parameters still exist in
      include/trace/events/printk.h.
      
      Without start and end parameters handling, printk tracing became more
      simple as: trace_console(text, len);
      Signed-off-by: Nzhangwei(Jovi) <jovi.zhangwei@huawei.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07c65f4d
    • Y
      mem hotunplug: fix kfree() of bootmem memory · ebff7d8f
      Yasuaki Ishimatsu 提交于
      When hot removing memory presented at boot time, following messages are shown:
      
        kernel BUG at mm/slub.c:3409!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: ebtable_nat ebtables xt_CHECKSUM iptable_mangle bridge stp llc ipmi_devintf ipmi_msghandler sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode pcspkr sg i2c_i801 lpc_ich mfd_core igb i2c_algo_bit i2c_core e1000e ptp pps_core tpm_infineon ioatdma dca sr_mod cdrom sd_mod crc_t10dif usb_storage megaraid_sas lpfc scsi_transport_fc scsi_tgt scsi_mod
        CPU 0
        Pid: 5091, comm: kworker/0:2 Tainted: G        W    3.9.0-rc6+ #15
        RIP: kfree+0x232/0x240
        Process kworker/0:2 (pid: 5091, threadinfo ffff88084678c000, task ffff88083928ca80)
        Call Trace:
          __release_region+0xd4/0xe0
          __remove_pages+0x52/0x110
          arch_remove_memory+0x89/0xd0
          remove_memory+0xc4/0x100
          acpi_memory_device_remove+0x6d/0xb1
          acpi_device_remove+0x89/0xab
          __device_release_driver+0x7c/0xf0
          device_release_driver+0x2f/0x50
          acpi_bus_device_detach+0x6c/0x70
          acpi_ns_walk_namespace+0x11a/0x250
          acpi_walk_namespace+0xee/0x137
          acpi_bus_trim+0x33/0x7a
          acpi_bus_hot_remove_device+0xc4/0x1a1
          acpi_os_execute_deferred+0x27/0x34
          process_one_work+0x1f7/0x590
          worker_thread+0x11a/0x370
          kthread+0xee/0x100
          ret_from_fork+0x7c/0xb0
        RIP  [<ffffffff811c41d2>] kfree+0x232/0x240
         RSP <ffff88084678d968>
      
      The reason why the messages are shown is to release a resource
      structure, allocated by bootmem, by kfree().  So when we release a
      resource structure, we should check whether it is allocated by bootmem
      or not.
      
      But even if we know a resource structure is allocated by bootmem, we
      cannot release it since SLxB cannot treat it.  So for reusing a resource
      structure, this patch remembers it by using bootmem_resource as follows:
      
      When releasing a resource structure by free_resource(), free_resource()
      checks whether the resource structure is allocated by bootmem or not.
      If it is allocated by bootmem, free_resource() adds it to
      bootmem_resource.  If it is not allocated by bootmem, free_resource()
      release it by kfree().
      
      And when getting a new resource structure by get_resource(),
      get_resource() checks whether bootmem_resource has released resource
      structures or not.  If there is a released resource structure,
      get_resource() returns it.  If there is not a releaed resource
      structure, get_resource() returns new resource structure allocated by
      kzalloc().
      
      [akpm@linux-foundation.org: s/get_resource/alloc_resource/]
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebff7d8f
    • T
      resource: add release_mem_region_adjustable() · 825f787b
      Toshi Kani 提交于
      Add release_mem_region_adjustable(), which releases a requested region
      from a currently busy memory resource.  This interface adjusts the
      matched memory resource accordingly even if the requested region does
      not match exactly but still fits into.
      
      This new interface is intended for memory hot-delete.  During bootup,
      memory resources are inserted from the boot descriptor table, such as
      EFI Memory Table and e820.  Each memory resource entry usually covers
      the whole contigous memory range.  Memory hot-delete request, on the
      other hand, may target to a particular range of memory resource, and its
      size can be much smaller than the whole contiguous memory.  Since the
      existing release interfaces like __release_region() require a requested
      region to be exactly matched to a resource entry, they do not allow a
      partial resource to be released.
      
      This new interface is restrictive (i.e.  release under certain
      conditions), which is consistent with other release interfaces,
      __release_region() and __release_resource().  Additional release
      conditions, such as an overlapping region to a resource entry, can be
      supported after they are confirmed as valid cases.
      
      There is no change to the existing interfaces since their restriction is
      valid for I/O resources.
      
      [akpm@linux-foundation.org: use GFP_ATOMIC under write_lock()]
      [akpm@linux-foundation.org: switch back to GFP_KERNEL, less buggily]
      [akpm@linux-foundation.org: remove unneeded and wrong kfree(), per Toshi]
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Reviewed-by : Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Reviewed-by: NRam Pai <linuxram@us.ibm.com>
      Cc: T Makphaibulchoke <tmac@hp.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      825f787b
    • T
      resource: add __adjust_resource() for internal use · ae8e3a91
      Toshi Kani 提交于
      Add __adjust_resource(), which is called by adjust_resource() internally
      after the resource_lock is held.  There is no interface change to
      adjust_resource().  This change allows other functions to call
      __adjust_resource() internally while the resource_lock is held.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: T Makphaibulchoke <tmac@hp.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae8e3a91
    • A
      mm: replace hardcoded 3% with admin_reserve_pages knob · 4eeab4f5
      Andrew Shewmaker 提交于
      Add an admin_reserve_kbytes knob to allow admins to change the hardcoded
      memory reserve to something other than 3%, which may be multiple
      gigabytes on large memory systems.  Only about 8MB is necessary to
      enable recovery in the default mode, and only a few hundred MB are
      required even when overcommit is disabled.
      
      This affects OVERCOMMIT_GUESS and OVERCOMMIT_NEVER.
      
      admin_reserve_kbytes is initialized to min(3% free pages, 8MB)
      
      I arrived at 8MB by summing the RSS of sshd or login, bash, and top.
      
      Please see first patch in this series for full background, motivation,
      testing, and full changelog.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: make init_admin_reserve() static]
      Signed-off-by: NAndrew Shewmaker <agshew@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4eeab4f5
    • A
      mm: limit growth of 3% hardcoded other user reserve · c9b1d098
      Andrew Shewmaker 提交于
      Add user_reserve_kbytes knob.
      
      Limit the growth of the memory reserved for other user processes to
      min(3% current process size, user_reserve_pages).  Only about 8MB is
      necessary to enable recovery in the default mode, and only a few hundred
      MB are required even when overcommit is disabled.
      
      user_reserve_pages defaults to min(3% free pages, 128MB)
      
      I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
      then adding the RSS of each.
      
      This only affects OVERCOMMIT_NEVER mode.
      
      Background
      
      1. user reserve
      
      __vm_enough_memory reserves a hardcoded 3% of the current process size for
      other applications when overcommit is disabled.  This was done so that a
      user could recover if they launched a memory hogging process.  Without the
      reserve, a user would easily run into a message such as:
      
      bash: fork: Cannot allocate memory
      
      2. admin reserve
      
      Additionally, a hardcoded 3% of free memory is reserved for root in both
      overcommit 'guess' and 'never' modes.  This was intended to prevent a
      scenario where root-cant-log-in and perform recovery operations.
      
      Note that this reserve shrinks, and doesn't guarantee a useful reserve.
      
      Motivation
      
      The two hardcoded memory reserves should be updated to account for current
      memory sizes.
      
      Also, the admin reserve would be more useful if it didn't shrink too much.
      
      When the current code was originally written, 1GB was considered
      "enterprise".  Now the 3% reserve can grow to multiple GB on large memory
      systems, and it only needs to be a few hundred MB at most to enable a user
      or admin to recover a system with an unwanted memory hogging process.
      
      I've found that reducing these reserves is especially beneficial for a
      specific type of application load:
      
       * single application system
       * one or few processes (e.g. one per core)
       * allocating all available memory
       * not initializing every page immediately
       * long running
      
      I've run scientific clusters with this sort of load.  A long running job
      sometimes failed many hours (weeks of CPU time) into a calculation.  They
      weren't initializing all of their memory immediately, and they weren't
      using calloc, so I put systems into overcommit 'never' mode.  These
      clusters run diskless and have no swap.
      
      However, with the current reserves, a user wishing to allocate as much
      memory as possible to one process may be prevented from using, for
      example, almost 2GB out of 32GB.
      
      The effect is less, but still significant when a user starts a job with
      one process per core.  I have repeatedly seen a set of processes
      requesting the same amount of memory fail because one of them could not
      allocate the amount of memory a user would expect to be able to allocate.
      For example, Message Passing Interfce (MPI) processes, one per core.  And
      it is similar for other parallel programming frameworks.
      
      Changing this reserve code will make the overcommit never mode more useful
      by allowing applications to allocate nearly all of the available memory.
      
      Also, the new admin_reserve_kbytes will be safer than the current behavior
      since the hardcoded 3% of available memory reserve can shrink to something
      useless in the case where applications have grabbed all available memory.
      
      Risks
      
      * "bash: fork: Cannot allocate memory"
      
        The downside of the first patch-- which creates a tunable user reserve
        that is only used in overcommit 'never' mode--is that an admin can set
        it so low that a user may not be able to kill their process, even if
        they already have a shell prompt.
      
        Of course, a user can get in the same predicament with the current 3%
        reserve--they just have to launch processes until 3% becomes negligible.
      
      * root-cant-log-in problem
      
        The second patch, adding the tunable rootuser_reserve_pages, allows
        the admin to shoot themselves in the foot by setting it too small.  They
        can easily get the system into a state where root-can't-log-in.
      
        However, the new admin_reserve_kbytes will be safer than the current
        behavior since the hardcoded 3% of available memory reserve can shrink
        to something useless in the case where applications have grabbed all
        available memory.
      
      Alternatives
      
       * Memory cgroups provide a more flexible way to limit application memory.
      
         Not everyone wants to set up cgroups or deal with their overhead.
      
       * We could create a fourth overcommit mode which provides smaller reserves.
      
         The size of useful reserves may be drastically different depending
         on the whether the system is embedded or enterprise.
      
       * Force users to initialize all of their memory or use calloc.
      
         Some users don't want/expect the system to overcommit when they malloc.
         Overcommit 'never' mode is for this scenario, and it should work well.
      
      The new user and admin reserve tunables are simple to use, with low
      overhead compared to cgroups.  The patches preserve current behavior where
      3% of memory is less than 128MB, except that the admin reserve doesn't
      shrink to an unusable size under pressure.  The code allows admins to tune
      for embedded and enterprise usage.
      
      FAQ
      
       * How is the root-cant-login problem addressed?
         What happens if admin_reserve_pages is set to 0?
      
         Root is free to shoot themselves in the foot by setting
         admin_reserve_kbytes too low.
      
         On x86_64, the minimum useful reserve is:
           8MB for overcommit 'guess'
         128MB for overcommit 'never'
      
         admin_reserve_pages defaults to min(3% free memory, 8MB)
      
         So, anyone switching to 'never' mode needs to adjust
         admin_reserve_pages.
      
       * How do you calculate a minimum useful reserve?
      
         A user or the admin needs enough memory to login and perform
         recovery operations, which includes, at a minimum:
      
         sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
      
         For overcommit 'guess', we can sum resident set sizes (RSS)
         because we only need enough memory to handle what the recovery
         programs will typically use. On x86_64 this is about 8MB.
      
         For overcommit 'never', we can take the max of their virtual sizes (VSZ)
         and add the sum of their RSS. We use VSZ instead of RSS because mode
         forces us to ensure we can fulfill all of the requested memory allocations--
         even if the programs only use a fraction of what they ask for.
         On x86_64 this is about 128MB.
      
         When swap is enabled, reserves are useful even when they are as
         small as 10MB, regardless of overcommit mode.
      
         When both swap and overcommit are disabled, then the admin should
         tune the reserves higher to be absolutley safe. Over 230MB each
         was safest in my testing.
      
       * What happens if user_reserve_pages is set to 0?
      
         Note, this only affects overcomitt 'never' mode.
      
         Then a user will be able to allocate all available memory minus
         admin_reserve_kbytes.
      
         However, they will easily see a message such as:
      
         "bash: fork: Cannot allocate memory"
      
         And they won't be able to recover/kill their application.
         The admin should be able to recover the system if
         admin_reserve_kbytes is set appropriately.
      
       * What's the difference between overcommit 'guess' and 'never'?
      
         "Guess" allows an allocation if there are enough free + reclaimable
         pages. It has a hardcoded 3% of free pages reserved for root.
      
         "Never" allows an allocation if there is enough swap + a configurable
         percentage (default is 50) of physical RAM. It has a hardcoded 3% of
         free pages reserved for root, like "Guess" mode. It also has a
         hardcoded 3% of the current process size reserved for additional
         applications.
      
       * Why is overcommit 'guess' not suitable even when an app eventually
         writes to every page? It takes free pages, file pages, available
         swap pages, reclaimable slab pages into consideration. In other words,
         these are all pages available, then why isn't overcommit suitable?
      
         Because it only looks at the present state of the system. It
         does not take into account the memory that other applications have
         malloced, but haven't initialized yet. It overcommits the system.
      
      Test Summary
      
      There was little change in behavior in the default overcommit 'guess'
      mode with swap enabled before and after the patch. This was expected.
      
      Systems run most predictably (i.e. no oom kills) in overcommit 'never'
      mode with swap enabled. This also allowed the most memory to be allocated
      to a user application.
      
      Overcommit 'guess' mode without swap is a bad idea. It is easy to
      crash the system. None of the other tested combinations crashed.
      This matches my experience on the Roadrunner supercomputer.
      
      Without the tunable user reserve, a system in overcommit 'never' mode
      and without swap does not allow the admin to recover, although the
      admin can.
      
      With the new tunable reserves, a system in overcommit 'never' mode
      and without swap can be configured to:
      
      1. maximize user-allocatable memory, running close to the edge of
      recoverability
      
      2. maximize recoverability, sacrificing allocatable memory to
      ensure that a user cannot take down a system
      
      Test Description
      
      Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
      
      System is booted into multiuser console mode, with unnecessary services
      turned off. Caches were dropped before each test.
      
      Hogs are user memtester processes that attempt to allocate all free memory
      as reported by /proc/meminfo
      
      In overcommit 'never' mode, memory_ratio=100
      
      Test Results
      
      3.9.0-rc1-mm1
      
      Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
      ----------   ----   ----   -------------   ----   -------------   --------------
      guess        yes    1      5432/5432       no     yes             yes
      guess        yes    4      5444/5444       1      yes             yes
      guess        no     1      5302/5449       no     yes             yes
      guess        no     4      -               crash  no              no
      
      never        yes    1      5460/5460       1      yes             yes
      never        yes    4      5460/5460       1      yes             yes
      never        no     1      5218/5432       no     no              yes
      never        no     4      5203/5448       no     no              yes
      
      3.9.0-rc1-mm1-tunablereserves
      
      User and Admin Recovery show their respective reserves, if applicable.
      
      Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
      ----------   ----   ----   -------------   ----   -------------   --------------
      guess        yes    1      5419/5419       no     - yes           8MB yes
      guess        yes    4      5436/5436       1      - yes           8MB yes
      guess        no     1      5440/5440       *      - yes           8MB yes
      guess        no     4      -               crash  - no            8MB no
      
      * process would successfully mlock, then the oom killer would pick it
      
      never        yes    1      5446/5446       no     10MB yes        20MB yes
      never        yes    4      5456/5456       no     10MB yes        20MB yes
      never        no     1      5387/5429       no     128MB no        8MB barely
      never        no     1      5323/5428       no     226MB barely    8MB barely
      never        no     1      5323/5428       no     226MB barely    8MB barely
      
      never        no     1      5359/5448       no     10MB no         10MB barely
      
      never        no     1      5323/5428       no     0MB no          10MB barely
      never        no     1      5332/5428       no     0MB no          50MB yes
      never        no     1      5293/5429       no     0MB no          90MB yes
      
      never        no     1      5001/5427       no     230MB yes       338MB yes
      never        no     4*     4998/5424       no     230MB yes       338MB yes
      
      * more memtesters were launched, able to allocate approximately another 100MB
      
      Future Work
      
       - Test larger memory systems.
      
       - Test an embedded image.
      
       - Test other architectures.
      
       - Time malloc microbenchmarks.
      
       - Would it be useful to be able to set overcommit policy for
         each memory cgroup?
      
       - Some lines are slightly above 80 chars.
         Perhaps define a macro to convert between pages and kb?
         Other places in the kernel do this.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: make init_user_reserve() static]
      Signed-off-by: NAndrew Shewmaker <agshew@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9b1d098
    • A
      kernel/cpuset.c: use register_hotmemory_notifier() · d8f10cb3
      Andrew Morton 提交于
      Use the new interface, remove one ifdef.  No code size changes.
      
      We could/should have been using __meminit/__meminitdata here but there's
      now no point in doing that because all this code is elided at compile time.
      
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8f10cb3
    • A
      kexec, vmalloc: export additional vmalloc layer information · 13ba3fcb
      Atsushi Kumagai 提交于
      Now, vmap_area_list is exported as VMCOREINFO for makedumpfile to get
      the start address of vmalloc region (vmalloc_start).  The address which
      contains vmalloc_start value is represented as below:
      
        vmap_area_list.next - OFFSET(vmap_area.list) + OFFSET(vmap_area.va_start)
      
      However, both OFFSET(vmap_area.va_start) and OFFSET(vmap_area.list)
      aren't exported as VMCOREINFO.
      
      So this patch exports them externally with small cleanup.
      
      [akpm@linux-foundation.org: vmalloc.h should include list.h for list_head]
      Signed-off-by: NAtsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Dave Anderson <anderson@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13ba3fcb