1. 28 1月, 2013 6 次提交
    • F
      cputime: Safely read cputime of full dynticks CPUs · 6a61671b
      Frederic Weisbecker 提交于
      While remotely reading the cputime of a task running in a
      full dynticks CPU, the values stored in utime/stime fields
      of struct task_struct may be stale. Its values may be those
      of the last kernel <-> user transition time snapshot and
      we need to add the tickless time spent since this snapshot.
      
      To fix this, flush the cputime of the dynticks CPUs on
      kernel <-> user transition and record the time / context
      where we did this. Then on top of this snapshot and the current
      time, perform the fixup on the reader side from task_times()
      accessors.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      [fixed kvm module related build errors]
      Signed-off-by: NSedat Dilek <sedat.dilek@gmail.com>
      6a61671b
    • F
      kvm: Prepare to add generic guest entry/exit callbacks · c11f11fc
      Frederic Weisbecker 提交于
      Do some ground preparatory work before adding guest_enter()
      and guest_exit() context tracking callbacks. Those will
      be later used to read the guest cputime safely when we
      run in full dynticks mode.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      c11f11fc
    • F
      cputime: Use accessors to read task cputime stats · 6fac4829
      Frederic Weisbecker 提交于
      This is in preparation for the full dynticks feature. While
      remotely reading the cputime of a task running in a full
      dynticks CPU, we'll need to do some extra-computation. This
      way we can account the time it spent tickless in userspace
      since its last cputime snapshot.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      6fac4829
    • F
      cputime: Allow dynamic switch between tick/virtual based cputime accounting · 3f4724ea
      Frederic Weisbecker 提交于
      Allow to dynamically switch between tick and virtual based
      cputime accounting. This way we can provide a kind of "on-demand"
      virtual based cputime accounting. In this mode, the kernel relies
      on the context tracking subsystem to dynamically probe on kernel
      boundaries.
      
      This is in preparation for being able to stop the timer tick in
      more places than just the idle state. Doing so will depend on
      CONFIG_VIRT_CPU_ACCOUNTING_GEN which makes it possible to account
      the cputime without the tick by hooking on kernel/user boundaries.
      
      Depending whether the tick is stopped or not, we can switch between
      tick and vtime based accounting anytime in order to minimize the
      overhead associated to user hooks.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      3f4724ea
    • F
      cputime: Generic on-demand virtual cputime accounting · abf917cd
      Frederic Weisbecker 提交于
      If we want to stop the tick further idle, we need to be
      able to account the cputime without using the tick.
      
      Virtual based cputime accounting solves that problem by
      hooking into kernel/user boundaries.
      
      However implementing CONFIG_VIRT_CPU_ACCOUNTING require
      low level hooks and involves more overhead. But we already
      have a generic context tracking subsystem that is required
      for RCU needs by archs which plan to shut down the tick
      outside idle.
      
      This patch implements a generic virtual based cputime
      accounting that relies on these generic kernel/user hooks.
      
      There are some upsides of doing this:
      
      - This requires no arch code to implement CONFIG_VIRT_CPU_ACCOUNTING
      if context tracking is already built (already necessary for RCU in full
      tickless mode).
      
      - We can rely on the generic context tracking subsystem to dynamically
      (de)activate the hooks, so that we can switch anytime between virtual
      and tick based accounting. This way we don't have the overhead
      of the virtual accounting when the tick is running periodically.
      
      And one downside:
      
      - There is probably more overhead than a native virtual based cputime
      accounting. But this relies on hooks that are already set anyway.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      abf917cd
    • F
      cputime: Move default nsecs_to_cputime() to jiffies based cputime file · ae8dda5c
      Frederic Weisbecker 提交于
      If the architecture doesn't provide an implementation of
      nsecs_to_cputime(), the cputime accounting core uses a
      default one that converts the nanoseconds to jiffies. However
      this only makes sense if we use the jiffies based cputime.
      
      For now it doesn't matter much because this API is only
      called on code that uses jiffies based cputime accounting.
      
      But the code may evolve and this API may be used more
      broadly in the future. Keeping this default implementation
      around is very error prone as it may introduce a bug and
      hide it on architectures that don't override this API.
      
      Fix this by moving this definition to the jiffies based
      cputime headers as it is the only place where it belongs to.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      ae8dda5c
  2. 27 1月, 2013 1 次提交
    • F
      context_tracking: Export context state for generic vtime · 95a79fd4
      Frederic Weisbecker 提交于
      Export the context state: whether we run in user / kernel
      from the context tracking subsystem point of view.
      
      This is going to be used by the generic virtual cputime
      accounting subsystem that is needed to implement the full
      dynticks.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      95a79fd4
  3. 17 1月, 2013 1 次提交
    • T
      module, async: async_synchronize_full() on module init iff async is used · 774a1221
      Tejun Heo 提交于
      If the default iosched is built as module, the kernel may deadlock
      while trying to load the iosched module on device probe if the probing
      was running off async.  This is because async_synchronize_full() at
      the end of module init ends up waiting for the async job which
      initiated the module loading.
      
       async A				modprobe
      
       1. finds a device
       2. registers the block device
       3. request_module(default iosched)
      					4. modprobe in userland
      					5. load and init module
      					6. async_synchronize_full()
      
      Async A waits for modprobe to finish in request_module() and modprobe
      waits for async A to finish in async_synchronize_full().
      
      Because there's no easy to track dependency once control goes out to
      userland, implementing properly nested flushing is difficult.  For
      now, make module init perform async_synchronize_full() iff module init
      has queued async jobs as suggested by Linus.
      
      This avoids the described deadlock because iosched module doesn't use
      async and thus wouldn't invoke async_synchronize_full().  This is
      hacky and incomplete.  It will deadlock if async module loading nests;
      however, this works around the known problem case and seems to be the
      best of bad options.
      
      For more details, please refer to the following thread.
      
        http://thread.gmane.org/gmane.linux.kernel/1420814Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NAlex Riesen <raa.lkml@gmail.com>
      Tested-by: NMing Lei <ming.lei@canonical.com>
      Tested-by: NAlex Riesen <raa.lkml@gmail.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      774a1221
  4. 15 1月, 2013 1 次提交
  5. 12 1月, 2013 5 次提交
    • A
      kernel/audit.c: avoid negative sleep durations · 82919919
      Andrew Morton 提交于
      audit_log_start() performs the same jiffies comparison in two places.
      If sufficient time has elapsed between the two comparisons, the second
      one produces a negative sleep duration:
      
        schedule_timeout: wrong timeout value fffffffffffffff0
        Pid: 6606, comm: trinity-child1 Not tainted 3.8.0-rc1+ #43
        Call Trace:
          schedule_timeout+0x305/0x340
          audit_log_start+0x311/0x470
          audit_log_exit+0x4b/0xfb0
          __audit_syscall_exit+0x25f/0x2c0
          sysret_audit+0x17/0x21
      
      Fix it by performing the comparison a single time.
      Reported-by: NDave Jones <davej@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@redhat.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82919919
    • K
      audit: catch possible NULL audit buffers · 0644ec0c
      Kees Cook 提交于
      It's possible for audit_log_start() to return NULL.  Handle it in the
      various callers.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Julien Tinnes <jln@google.com>
      Cc: Will Drewry <wad@google.com>
      Cc: Steve Grubb <sgrubb@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0644ec0c
    • K
      audit: create explicit AUDIT_SECCOMP event type · 7b9205bd
      Kees Cook 提交于
      The seccomp path was using AUDIT_ANOM_ABEND from when seccomp mode 1
      could only kill a process.  While we still want to make sure an audit
      record is forced on a kill, this should use a separate record type since
      seccomp mode 2 introduces other behaviors.
      
      In the case of "handled" behaviors (process wasn't killed), only emit a
      record if the process is under inspection.  This change also fixes
      userspace examination of seccomp audit events, since it was considered
      malformed due to missing fields of the AUDIT_ANOM_ABEND event type.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Julien Tinnes <jln@google.com>
      Acked-by: NWill Drewry <wad@chromium.org>
      Acked-by: NSteve Grubb <sgrubb@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b9205bd
    • J
      lockdep, rwsem: provide down_write_nest_lock() · 1b963c81
      Jiri Kosina 提交于
      down_write_nest_lock() provides a means to annotate locking scenario
      where an outer lock is guaranteed to serialize the order nested locks
      are being acquired.
      
      This is analogoue to already existing mutex_lock_nest_lock() and
      spin_lock_nest_lock().
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b963c81
    • S
      tracing: Fix regression with irqsoff tracer and tracing_on file · 2df8f8a6
      Steven Rostedt 提交于
      Commit 02404baf "tracing: Remove deprecated tracing_enabled file"
      removed the tracing_enabled file as it never worked properly and
      the tracing_on file should be used instead. But the tracing_on file
      didn't call into the tracers start/stop routines like the
      tracing_enabled file did. This caused trace-cmd to break when it
      enabled the irqsoff tracer.
      
      If you just did "echo irqsoff > current_tracer" then it would work
      properly. But the tool trace-cmd disables tracing first by writing
      "0" into the tracing_on file. Then it writes "irqsoff" into
      current_tracer and then writes "1" into tracing_on. Unfortunately,
      the above commit changed the irqsoff tracer to check the tracing_on
      status instead of the tracing_enabled status. If it's disabled then
      it does not start the tracer internals.
      
      The problem is that writing "1" into tracing_on does not call the
      tracers "start" routine like writing "1" into tracing_enabled did.
      This makes the irqsoff tracer not start when using the trace-cmd
      tool, and is a regression for userspace.
      
      Simple fix is to have the tracing_on file call the tracers start()
      method when being enabled (and the stop() method when disabled).
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      2df8f8a6
  6. 11 1月, 2013 1 次提交
  7. 10 1月, 2013 1 次提交
    • S
      tracing: Fix regression of trace_options file setting · a8dd2176
      Steven Rostedt 提交于
      The latest change to allow trace options to be set on the command
      line also broke the trace_options file.
      
      The zeroing of the last byte of the option name that is echoed into
      the trace_option file was removed with the consolidation of some
      of the code. The compare between the option and what was written to
      the trace_options file fails because the string holding the data
      written doesn't terminate with a null character.
      
      A zero needs to be added to the end of the string copied from
      user space.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      a8dd2176
  8. 06 1月, 2013 2 次提交
  9. 05 1月, 2013 1 次提交
    • R
      printk: fix incorrect length from print_time() when seconds > 99999 · 35dac27c
      Roland Dreier 提交于
      print_prefix() passes a NULL buf to print_time() to get the length of
      the time prefix; when printk times are enabled, the current code just
      returns the constant 15, which matches the format "[%5lu.%06lu] " used
      to print the time value.  However, this is obviously incorrect when the
      whole seconds part of the time gets beyond 5 digits (100000 seconds is a
      bit more than a day of uptime).
      
      The simple fix is to use snprintf(NULL, 0, ...) to calculate the actual
      length of the time prefix.  This could be micro-optimized but it seems
      better to have simpler, more readable code here.
      
      The bug leads to the syslog system call miscomputing which messages fit
      into the userspace buffer.  If there are enough messages to fill
      log_buf_len and some have a timestamp >= 100000, dmesg may fail with:
      
          # dmesg
          klogctl: Bad address
      
      When this happens, strace shows that the failure is indeed EFAULT due to
      the kernel mistakenly accessing past the end of dmesg's buffer, since
      dmesg asks the kernel how big a buffer it needs, allocates a bit more,
      and then gets an error when it asks the kernel to fill it:
      
          syslog(0xa, 0, 0)                       = 1048576
          mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa4d25d2000
          syslog(0x3, 0x7fa4d25d2010, 0x100008)   = -1 EFAULT (Bad address)
      
      As far as I can see, the bug has been there as long as print_time(),
      which comes from commit 084681d1 ("printk: flush continuation lines
      immediately to console") in 3.5-rc5.
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Sylvain Munaut <s.munaut@whatever-company.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35dac27c
  10. 26 12月, 2012 1 次提交
    • E
      pidns: Stop pid allocation when init dies · c876ad76
      Eric W. Biederman 提交于
      Oleg pointed out that in a pid namespace the sequence.
      - pid 1 becomes a zombie
      - setns(thepidns), fork,...
      - reaping pid 1.
      - The injected processes exiting.
      
      Can lead to processes attempting access their child reaper and
      instead following a stale pointer.
      
      That waitpid for init can return before all of the processes in
      the pid namespace have exited is also unfortunate.
      
      Avoid these problems by disabling the allocation of new pids in a pid
      namespace when init dies, instead of when the last process in a pid
      namespace is reaped.
      Pointed-out-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      c876ad76
  11. 25 12月, 2012 1 次提交
  12. 21 12月, 2012 2 次提交
  13. 20 12月, 2012 7 次提交
  14. 19 12月, 2012 3 次提交
  15. 18 12月, 2012 7 次提交