1. 07 8月, 2014 7 次提交
    • L
      printk: move power of 2 practice of ring buffer size to a helper · c0a318a3
      Luis R. Rodriguez 提交于
      In practice the power of 2 practice of the size of the kernel ring
      buffer remains purely historical but not a requirement, specially now
      that we have LOG_ALIGN and use it for both static and dynamic
      allocations.  It could have helped with implicit alignment back in the
      days given the even the dynamically sized ring buffer was guaranteed to
      be aligned so long as CONFIG_LOG_BUF_SHIFT was set to produce a
      __LOG_BUF_LEN which is architecture aligned, since log_buf_len=n would
      be allowed only if it was > __LOG_BUF_LEN and we always ended up
      rounding the log_buf_len=n to the next power of 2 with
      roundup_pow_of_two(), any multiple of 2 then should be also architecture
      aligned.  These assumptions of course relied heavily on
      CONFIG_LOG_BUF_SHIFT producing an aligned value but users can always
      change this.
      
      We now have precise alignment requirements set for the log buffer size
      for both static and dynamic allocations, but lets upkeep the old
      practice of using powers of 2 for its size to help with easy expected
      scalable values and the allocators for dynamic allocations.  We'll reuse
      this later so move this into a helper.
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Arun KS <arunks.linux@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0a318a3
    • L
      printk: make dynamic kernel ring buffer alignment explicit · 70300177
      Luis R. Rodriguez 提交于
      We have to consider alignment for the ring buffer both for the default
      static size, and then also for when an dynamic allocation is made when
      the log_buf_len=n kernel parameter is passed to set the size
      specifically to a size larger than the default size set by the
      architecture through CONFIG_LOG_BUF_SHIFT.
      
      The default static kernel ring buffer can be aligned properly if
      architectures set CONFIG_LOG_BUF_SHIFT properly, we provide ranges for
      the size though so even if CONFIG_LOG_BUF_SHIFT has a sensible aligned
      value it can be reduced to a non aligned value.  Commit 6ebb017d
      ("printk: Fix alignment of buf causing crash on ARM EABI") by Andrew
      Lunn ensures the static buffer is always aligned and the decision of
      alignment is done by the compiler by using __alignof__(struct log).
      
      When log_buf_len=n is used we allocate the ring buffer dynamically.
      Dynamic allocation varies, for the early allocation called before
      setup_arch() memblock_virt_alloc() requests a page aligment and for the
      default kernel allocation memblock_virt_alloc_nopanic() requests no
      special alignment, which in turn ends up aligning the allocation to
      SMP_CACHE_BYTES, which is L1 cache aligned.
      
      Since we already have the required alignment for the kernel ring buffer
      though we can do better and request explicit alignment for LOG_ALIGN.
      This does that to be safe and make dynamic allocation alignment
      explicit.
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Tested-by: NPetr Mladek <pmladek@suse.cz>
      Acked-by: NPetr Mladek <pmladek@suse.cz>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Arun KS <arunks.linux@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70300177
    • S
      kernel/smp.c:on_each_cpu_cond(): fix warning in fallback path · 618fde87
      Sasha Levin 提交于
      The rarely-executed memry-allocation-failed callback path generates a
      WARN_ON_ONCE() when smp_call_function_single() succeeds.  Presumably
      it's supposed to warn on failures.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@gentwo.org>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      618fde87
    • D
      mm, oom: remove unnecessary exit_state check · fb794bcb
      David Rientjes 提交于
      The oom killer scans each process and determines whether it is eligible
      for oom kill or whether the oom killer should abort because of
      concurrent memory freeing.  It will abort when an eligible process is
      found to have TIF_MEMDIE set, meaning it has already been oom killed and
      we're waiting for it to exit.
      
      Processes with task->mm == NULL should not be considered because they
      are either kthreads or have already detached their memory and killing
      them would not lead to memory freeing.  That memory is only freed after
      exit_mm() has returned, however, and not when task->mm is first set to
      NULL.
      
      Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process
      is no longer considered for oom kill, but only until exit_mm() has
      returned.  This was fragile in the past because it relied on
      exit_notify() to be reached before no longer considering TIF_MEMDIE
      processes.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb794bcb
    • D
      mm, hugetlb: remove hugetlb_zero and hugetlb_infinity · ed4d4902
      David Rientjes 提交于
      They are unnecessary: "zero" can be used in place of "hugetlb_zero" and
      passing extra2 == NULL is equivalent to infinity.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed4d4902
    • F
      kernel/watchdog.c: convert printk/pr_warning to pr_foo() · 656c3b79
      Fabian Frederick 提交于
      Replace some obsolete functions.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      656c3b79
    • F
      kernel/auditfilter.c: replace count*size kmalloc by kcalloc · bab5e2d6
      Fabian Frederick 提交于
      kcalloc manages count*sizeof overflow.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Eric Paris <eparis@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bab5e2d6
  2. 03 8月, 2014 3 次提交
    • A
      net: filter: split 'struct sk_filter' into socket and bpf parts · 7ae457c1
      Alexei Starovoitov 提交于
      clean up names related to socket filtering and bpf in the following way:
      - everything that deals with sockets keeps 'sk_*' prefix
      - everything that is pure BPF is changed to 'bpf_*' prefix
      
      split 'struct sk_filter' into
      struct sk_filter {
      	atomic_t        refcnt;
      	struct rcu_head rcu;
      	struct bpf_prog *prog;
      };
      and
      struct bpf_prog {
              u32                     jited:1,
                                      len:31;
              struct sock_fprog_kern  *orig_prog;
              unsigned int            (*bpf_func)(const struct sk_buff *skb,
                                                  const struct bpf_insn *filter);
              union {
                      struct sock_filter      insns[0];
                      struct bpf_insn         insnsi[0];
                      struct work_struct      work;
              };
      };
      so that 'struct bpf_prog' can be used independent of sockets and cleans up
      'unattached' bpf use cases
      
      split SK_RUN_FILTER macro into:
          SK_RUN_FILTER to be used with 'struct sk_filter *' and
          BPF_PROG_RUN to be used with 'struct bpf_prog *'
      
      __sk_filter_release(struct sk_filter *) gains
      __bpf_prog_release(struct bpf_prog *) helper function
      
      also perform related renames for the functions that work
      with 'struct bpf_prog *', since they're on the same lines:
      
      sk_filter_size -> bpf_prog_size
      sk_filter_select_runtime -> bpf_prog_select_runtime
      sk_filter_free -> bpf_prog_free
      sk_unattached_filter_create -> bpf_prog_create
      sk_unattached_filter_destroy -> bpf_prog_destroy
      sk_store_orig_filter -> bpf_prog_store_orig_filter
      sk_release_orig_filter -> bpf_release_orig_filter
      __sk_migrate_filter -> bpf_migrate_filter
      __sk_prepare_filter -> bpf_prepare_filter
      
      API for attaching classic BPF to a socket stays the same:
      sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
      and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
      which is used by sockets, tun, af_packet
      
      API for 'unattached' BPF programs becomes:
      bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
      and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
      which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ae457c1
    • A
      net: filter: rename sk_convert_filter() -> bpf_convert_filter() · 8fb575ca
      Alexei Starovoitov 提交于
      to indicate that this function is converting classic BPF into eBPF
      and not related to sockets
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fb575ca
    • A
      net: filter: rename sk_chk_filter() -> bpf_check_classic() · 4df95ff4
      Alexei Starovoitov 提交于
      trivial rename to indicate that this functions performs classic BPF checking
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4df95ff4
  3. 01 8月, 2014 3 次提交
    • J
      timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks · 504d5874
      Jan Kara 提交于
      clockevents_increase_min_delta() calls printk() from under
      hrtimer_bases.lock. That causes lock inversion on scheduler locks because
      printk() can call into the scheduler. Lockdep puts it as:
      
      ======================================================
      [ INFO: possible circular locking dependency detected ]
      3.15.0-rc8-06195-g939f04be #2 Not tainted
      -------------------------------------------------------
      trinity-main/74 is trying to acquire lock:
       (&port_lock_key){-.....}, at: [<811c60be>] serial8250_console_write+0x8c/0x10c
      
      but task is already holding lock:
       (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #5 (hrtimer_bases.lock){-.-...}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
             [<8103c918>] __hrtimer_start_range_ns+0x1c/0x197
             [<8107ec20>] perf_swevent_start_hrtimer.part.41+0x7a/0x85
             [<81080792>] task_clock_event_start+0x3a/0x3f
             [<810807a4>] task_clock_event_add+0xd/0x14
             [<8108259a>] event_sched_in+0xb6/0x17a
             [<810826a2>] group_sched_in+0x44/0x122
             [<81082885>] ctx_sched_in.isra.67+0x105/0x11f
             [<810828e6>] perf_event_sched_in.isra.70+0x47/0x4b
             [<81082bf6>] __perf_install_in_context+0x8b/0xa3
             [<8107eb8e>] remote_function+0x12/0x2a
             [<8105f5af>] smp_call_function_single+0x2d/0x53
             [<8107e17d>] task_function_call+0x30/0x36
             [<8107fb82>] perf_install_in_context+0x87/0xbb
             [<810852c9>] SYSC_perf_event_open+0x5c6/0x701
             [<810856f9>] SyS_perf_event_open+0x17/0x19
             [<8142f8ee>] syscall_call+0x7/0xb
      
      -> #4 (&ctx->lock){......}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f04c>] _raw_spin_lock+0x21/0x30
             [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
             [<8142cacc>] __schedule+0x4c6/0x4cb
             [<8142cae0>] schedule+0xf/0x11
             [<8142f9a6>] work_resched+0x5/0x30
      
      -> #3 (&rq->lock){-.-.-.}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f04c>] _raw_spin_lock+0x21/0x30
             [<81040873>] __task_rq_lock+0x33/0x3a
             [<8104184c>] wake_up_new_task+0x25/0xc2
             [<8102474b>] do_fork+0x15c/0x2a0
             [<810248a9>] kernel_thread+0x1a/0x1f
             [<814232a2>] rest_init+0x1a/0x10e
             [<817af949>] start_kernel+0x303/0x308
             [<817af2ab>] i386_start_kernel+0x79/0x7d
      
      -> #2 (&p->pi_lock){-.-...}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
             [<810413dd>] try_to_wake_up+0x1d/0xd6
             [<810414cd>] default_wake_function+0xb/0xd
             [<810461f3>] __wake_up_common+0x39/0x59
             [<81046346>] __wake_up+0x29/0x3b
             [<811b8733>] tty_wakeup+0x49/0x51
             [<811c3568>] uart_write_wakeup+0x17/0x19
             [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
             [<811c5f28>] serial8250_handle_irq+0x54/0x6a
             [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
             [<811c56d8>] serial8250_interrupt+0x38/0x9e
             [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
             [<81051296>] handle_irq_event+0x2c/0x43
             [<81052cee>] handle_level_irq+0x57/0x80
             [<81002a72>] handle_irq+0x46/0x5c
             [<810027df>] do_IRQ+0x32/0x89
             [<8143036e>] common_interrupt+0x2e/0x33
             [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
             [<811c25a4>] uart_start+0x2d/0x32
             [<811c2c04>] uart_write+0xc7/0xd6
             [<811bc6f6>] n_tty_write+0xb8/0x35e
             [<811b9beb>] tty_write+0x163/0x1e4
             [<811b9cd9>] redirected_tty_write+0x6d/0x75
             [<810b6ed6>] vfs_write+0x75/0xb0
             [<810b7265>] SyS_write+0x44/0x77
             [<8142f8ee>] syscall_call+0x7/0xb
      
      -> #1 (&tty->write_wait){-.....}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
             [<81046332>] __wake_up+0x15/0x3b
             [<811b8733>] tty_wakeup+0x49/0x51
             [<811c3568>] uart_write_wakeup+0x17/0x19
             [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
             [<811c5f28>] serial8250_handle_irq+0x54/0x6a
             [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
             [<811c56d8>] serial8250_interrupt+0x38/0x9e
             [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
             [<81051296>] handle_irq_event+0x2c/0x43
             [<81052cee>] handle_level_irq+0x57/0x80
             [<81002a72>] handle_irq+0x46/0x5c
             [<810027df>] do_IRQ+0x32/0x89
             [<8143036e>] common_interrupt+0x2e/0x33
             [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
             [<811c25a4>] uart_start+0x2d/0x32
             [<811c2c04>] uart_write+0xc7/0xd6
             [<811bc6f6>] n_tty_write+0xb8/0x35e
             [<811b9beb>] tty_write+0x163/0x1e4
             [<811b9cd9>] redirected_tty_write+0x6d/0x75
             [<810b6ed6>] vfs_write+0x75/0xb0
             [<810b7265>] SyS_write+0x44/0x77
             [<8142f8ee>] syscall_call+0x7/0xb
      
      -> #0 (&port_lock_key){-.....}:
             [<8104a62d>] __lock_acquire+0x9ea/0xc6d
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
             [<811c60be>] serial8250_console_write+0x8c/0x10c
             [<8104e402>] call_console_drivers.constprop.31+0x87/0x118
             [<8104f5d5>] console_unlock+0x1d7/0x398
             [<8104fb70>] vprintk_emit+0x3da/0x3e4
             [<81425f76>] printk+0x17/0x19
             [<8105bfa0>] clockevents_program_min_delta+0x104/0x116
             [<8105c548>] clockevents_program_event+0xe7/0xf3
             [<8105cc1c>] tick_program_event+0x1e/0x23
             [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
             [<8103c49e>] __remove_hrtimer+0x5b/0x79
             [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
             [<8103cb4b>] hrtimer_cancel+0xd/0x18
             [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
             [<81080705>] task_clock_event_stop+0x20/0x64
             [<81080756>] task_clock_event_del+0xd/0xf
             [<81081350>] event_sched_out+0xab/0x11e
             [<810813e0>] group_sched_out+0x1d/0x66
             [<81081682>] ctx_sched_out+0xaf/0xbf
             [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
             [<8142cacc>] __schedule+0x4c6/0x4cb
             [<8142cae0>] schedule+0xf/0x11
             [<8142f9a6>] work_resched+0x5/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &port_lock_key --> &ctx->lock --> hrtimer_bases.lock
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(hrtimer_bases.lock);
                                     lock(&ctx->lock);
                                     lock(hrtimer_bases.lock);
        lock(&port_lock_key);
      
       *** DEADLOCK ***
      
      4 locks held by trinity-main/74:
       #0:  (&rq->lock){-.-.-.}, at: [<8142c6f3>] __schedule+0xed/0x4cb
       #1:  (&ctx->lock){......}, at: [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
       #2:  (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66
       #3:  (console_lock){+.+...}, at: [<8104fb5d>] vprintk_emit+0x3c7/0x3e4
      
      stack backtrace:
      CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04be #2
       00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
       8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
       8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
      Call Trace:
       [<81426f69>] dump_stack+0x16/0x18
       [<81425a99>] print_circular_bug+0x18f/0x19c
       [<8104a62d>] __lock_acquire+0x9ea/0xc6d
       [<8104a942>] lock_acquire+0x92/0x101
       [<811c60be>] ? serial8250_console_write+0x8c/0x10c
       [<811c6032>] ? wait_for_xmitr+0x76/0x76
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<811c60be>] ? serial8250_console_write+0x8c/0x10c
       [<811c60be>] serial8250_console_write+0x8c/0x10c
       [<8104af87>] ? lock_release+0x191/0x223
       [<811c6032>] ? wait_for_xmitr+0x76/0x76
       [<8104e402>] call_console_drivers.constprop.31+0x87/0x118
       [<8104f5d5>] console_unlock+0x1d7/0x398
       [<8104fb70>] vprintk_emit+0x3da/0x3e4
       [<81425f76>] printk+0x17/0x19
       [<8105bfa0>] clockevents_program_min_delta+0x104/0x116
       [<8105cc1c>] tick_program_event+0x1e/0x23
       [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
       [<8103c49e>] __remove_hrtimer+0x5b/0x79
       [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
       [<8103cb4b>] hrtimer_cancel+0xd/0x18
       [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
       [<81080705>] task_clock_event_stop+0x20/0x64
       [<81080756>] task_clock_event_del+0xd/0xf
       [<81081350>] event_sched_out+0xab/0x11e
       [<810813e0>] group_sched_out+0x1d/0x66
       [<81081682>] ctx_sched_out+0xaf/0xbf
       [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
       [<8104416d>] ? __dequeue_entity+0x23/0x27
       [<81044505>] ? pick_next_task_fair+0xb1/0x120
       [<8142cacc>] __schedule+0x4c6/0x4cb
       [<81047574>] ? trace_hardirqs_off_caller+0xd7/0x108
       [<810475b0>] ? trace_hardirqs_off+0xb/0xd
       [<81056346>] ? rcu_irq_exit+0x64/0x77
      
      Fix the problem by using printk_deferred() which does not call into the
      scheduler.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      504d5874
    • T
      Revert "irq: Warn when shared interrupts do not match on NO_SUSPEND" · c6f12245
      Thomas Gleixner 提交于
      This reverts commit 4fae4e76.
      
      Undo because it breaks working systems.
      Requested-by: NRafael J. Wysocki <rjw@rjwysocki.net>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      c6f12245
    • T
      Revert "PM / sleep / irq: Do not suspend wakeup interrupts" · 21d1f908
      Thomas Gleixner 提交于
      This reverts commit d709f7bc.
      
      Undo, because it might break exisiting functionality.
      Requested-by: NRafael J. Wysocki <rjw@rjwysocki.net>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      21d1f908
  4. 31 7月, 2014 3 次提交
  5. 30 7月, 2014 1 次提交
  6. 29 7月, 2014 1 次提交
  7. 28 7月, 2014 5 次提交
  8. 25 7月, 2014 1 次提交
  9. 24 7月, 2014 16 次提交
    • P
      irq: Warn when shared interrupts do not match on NO_SUSPEND · 4fae4e76
      Peter Zijlstra 提交于
      When suspend_device_irqs() iterates all descriptors, its pointless if
      one has NO_SUSPEND set while another has not.
      
      Validate on request_irq() that NO_SUSPEND state maches for SHARED
      interrupts.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Link: http://lkml.kernel.org/r/20140724133921.GY6758@twins.programming.kicks-ass.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      4fae4e76
    • S
      ftrace: Add warning if tramp hash does not match nr_trampolines · dc6f03f2
      Steven Rostedt (Red Hat) 提交于
      After adding all the records to the tramp_hash, add a check that makes
      sure that the number of records added matches the number of records
      expected to match and do a WARN_ON and disable ftrace if they do
      not match.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      dc6f03f2
    • S
      ftrace: Fix trampoline hash update check on rec->flags · 2a0343ba
      Steven Rostedt (Red Hat) 提交于
      In the loop of ftrace_save_ops_tramp_hash(), it adds all the recs
      to the ops hash if the rec has only one callback attached and the
      ops is connected to the rec. It gives a nasty warning and shuts down
      ftrace if the rec doesn't have a trampoline set for it. But this
      can happen with the following scenario:
      
        # cd /sys/kernel/debug/tracing
        # echo schedule do_IRQ > set_ftrace_filter
        # mkdir instances/foo
        # echo schedule > instances/foo/set_ftrace_filter
        # echo function_graph > current_function
        # echo function > instances/foo/current_function
        # echo nop > instances/foo/current_function
      
      The above would then trigger the following warning and disable
      ftrace:
      
       ------------[ cut here ]------------
       WARNING: CPU: 0 PID: 3145 at kernel/trace/ftrace.c:2212 ftrace_run_update_code+0xe4/0x15b()
       Modules linked in: ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ip [...]
       CPU: 1 PID: 3145 Comm: bash Not tainted 3.16.0-rc3-test+ #136
       Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
        0000000000000000 ffffffff81808a88 ffffffff81502130 0000000000000000
        ffffffff81040ca1 ffff880077c08000 ffffffff810bd286 0000000000000001
        ffffffff81a56830 ffff88007a041be0 ffff88007a872d60 00000000000001be
       Call Trace:
        [<ffffffff81502130>] ? dump_stack+0x4a/0x75
        [<ffffffff81040ca1>] ? warn_slowpath_common+0x7e/0x97
        [<ffffffff810bd286>] ? ftrace_run_update_code+0xe4/0x15b
        [<ffffffff810bd286>] ? ftrace_run_update_code+0xe4/0x15b
        [<ffffffff810bda1a>] ? ftrace_shutdown+0x11c/0x16b
        [<ffffffff810bda87>] ? unregister_ftrace_function+0x1e/0x38
        [<ffffffff810cc7e1>] ? function_trace_reset+0x1a/0x28
        [<ffffffff810c924f>] ? tracing_set_tracer+0xc1/0x276
        [<ffffffff810c9477>] ? tracing_set_trace_write+0x73/0x91
        [<ffffffff81132383>] ? __sb_start_write+0x9a/0xcc
        [<ffffffff8120478f>] ? security_file_permission+0x1b/0x31
        [<ffffffff81130e49>] ? vfs_write+0xac/0x11c
        [<ffffffff8113115d>] ? SyS_write+0x60/0x8e
        [<ffffffff81508112>] ? system_call_fastpath+0x16/0x1b
       ---[ end trace 938c4415cbc7dc96 ]---
       ------------[ cut here ]------------
      
      Link: http://lkml.kernel.org/r/20140723120805.GB21376@redhat.comReported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      2a0343ba
    • E
      CAPABILITIES: remove undefined caps from all processes · 7d8b6c63
      Eric Paris 提交于
      This is effectively a revert of 7b9a7ec5
      plus fixing it a different way...
      
      We found, when trying to run an application from an application which
      had dropped privs that the kernel does security checks on undefined
      capability bits.  This was ESPECIALLY difficult to debug as those
      undefined bits are hidden from /proc/$PID/status.
      
      Consider a root application which drops all capabilities from ALL 4
      capability sets.  We assume, since the application is going to set
      eff/perm/inh from an array that it will clear not only the defined caps
      less than CAP_LAST_CAP, but also the higher 28ish bits which are
      undefined future capabilities.
      
      The BSET gets cleared differently.  Instead it is cleared one bit at a
      time.  The problem here is that in security/commoncap.c::cap_task_prctl()
      we actually check the validity of a capability being read.  So any task
      which attempts to 'read all things set in bset' followed by 'unset all
      things set in bset' will not even attempt to unset the undefined bits
      higher than CAP_LAST_CAP.
      
      So the 'parent' will look something like:
      CapInh:	0000000000000000
      CapPrm:	0000000000000000
      CapEff:	0000000000000000
      CapBnd:	ffffffc000000000
      
      All of this 'should' be fine.  Given that these are undefined bits that
      aren't supposed to have anything to do with permissions.  But they do...
      
      So lets now consider a task which cleared the eff/perm/inh completely
      and cleared all of the valid caps in the bset (but not the invalid caps
      it couldn't read out of the kernel).  We know that this is exactly what
      the libcap-ng library does and what the go capabilities library does.
      They both leave you in that above situation if you try to clear all of
      you capapabilities from all 4 sets.  If that root task calls execve()
      the child task will pick up all caps not blocked by the bset.  The bset
      however does not block bits higher than CAP_LAST_CAP.  So now the child
      task has bits in eff which are not in the parent.  These are
      'meaningless' undefined bits, but still bits which the parent doesn't
      have.
      
      The problem is now in cred_cap_issubset() (or any operation which does a
      subset test) as the child, while a subset for valid cap bits, is not a
      subset for invalid cap bits!  So now we set durring commit creds that
      the child is not dumpable.  Given it is 'more priv' than its parent.  It
      also means the parent cannot ptrace the child and other stupidity.
      
      The solution here:
      1) stop hiding capability bits in status
      	This makes debugging easier!
      
      2) stop giving any task undefined capability bits.  it's simple, it you
      don't put those invalid bits in CAP_FULL_SET you won't get them in init
      and you won't get them in any other task either.
      	This fixes the cap_issubset() tests and resulting fallout (which
      	made the init task in a docker container untraceable among other
      	things)
      
      3) mask out undefined bits when sys_capset() is called as it might use
      ~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
      	This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.
      
      4) mask out undefined bit when we read a file capability off of disk as
      again likely all bits are set in the xattr for forward/backward
      compatibility.
      	This lets 'setcap all+pe /bin/bash; /bin/bash' run
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Andrew G. Morgan <morgan@kernel.org>
      Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Steve Grubb <sgrubb@redhat.com>
      Cc: Dan Walsh <dwalsh@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJames Morris <james.l.morris@oracle.com>
      7d8b6c63
    • S
      sched_clock: Avoid corrupting hrtimer tree during suspend · f723aa18
      Stephen Boyd 提交于
      During suspend we call sched_clock_poll() to update the epoch and
      accumulated time and reprogram the sched_clock_timer to fire
      before the next wrap-around time. Unfortunately,
      sched_clock_poll() doesn't restart the timer, instead it relies
      on the hrtimer layer to do that and during suspend we aren't
      calling that function from the hrtimer layer. Instead, we're
      reprogramming the expires time while the hrtimer is enqueued,
      which can cause the hrtimer tree to be corrupted. Furthermore, we
      restart the timer during suspend but we update the epoch during
      resume which seems counter-intuitive.
      
      Let's fix this by saving the accumulated state and canceling the
      timer during suspend. On resume we can update the epoch and
      restart the timer similar to what we would do if we were starting
      the clock for the first time.
      
      Fixes: a08ca5d1 "sched_clock: Use an hrtimer instead of timer"
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Link: http://lkml.kernel.org/r/1406174630-23458-1-git-send-email-john.stultz@linaro.org
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      f723aa18
    • A
      net: filter: split filter.c into two files · f5bffecd
      Alexei Starovoitov 提交于
      BPF is used in several kernel components. This split creates logical boundary
      between generic eBPF core and the rest
      
      kernel/bpf/core.c: eBPF interpreter
      
      net/core/filter.c: classic->eBPF converter, classic verifiers, socket filters
      
      This patch only moves functions.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5bffecd
    • S
      ring-buffer: Use rb_page_size() instead of open coded head_page size · 10e83fd0
      Steven Rostedt (Red Hat) 提交于
      There's a helper function to get a ring buffer page size (the number
      of bytes of data recorded on the page), called rb_page_size().
      Use that instead of open coding it.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      10e83fd0
    • J
      timekeeping: Use cached ntp_tick_length when accumulating error · 375f45b5
      John Stultz 提交于
      By caching the ntp_tick_length() when we correct the frequency error,
      and then using that cached value to accumulate error, we avoid large
      initial errors when the tick length is changed.
      
      This makes convergence happen much faster in the simulator, since the
      initial error doesn't have to be slowly whittled away.
      
      This initially seems like an accounting error, but Miroslav pointed out
      that ntp_tick_length() can change mid-tick, so when we apply it in the
      error accumulation, we are applying any recent change to the entire tick.
      
      This approach chooses to apply changes in the ntp_tick_length() only to
      the next tick, which allows us to calculate the freq correction before
      using the new tick length, which avoids accummulating error.
      
      Credit to Miroslav for pointing this out and providing the original patch
      this functionality has been pulled out from, along with the rational.
      
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Reported-by: NMiroslav Lichvar <mlichvar@redhat.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      375f45b5
    • J
      timekeeping: Rework frequency adjustments to work better w/ nohz · dc491596
      John Stultz 提交于
      The existing timekeeping_adjust logic has always been complicated
      to understand. Further, since it was developed prior to NOHZ becoming
      common, its not surprising it performs poorly when NOHZ is enabled.
      
      Since Miroslav pointed out the problematic nature of the existing code
      in the NOHZ case, I've tried to refactor the code to perform better.
      
      The problem with the previous approach was that it tried to adjust
      for the total cumulative error using a scaled dampening factor. This
      resulted in large errors to be corrected slowly, while small errors
      were corrected quickly. With NOHZ the timekeeping code doesn't know
      how far out the next tick will be, so this results in bad
      over-correction to small errors, and insufficient correction to large
      errors.
      
      Inspired by Miroslav's patch, I've refactored the code to try to
      address the correction in two steps.
      
      1) Check the future freq error for the next tick, and if the frequency
      error is large, try to make sure we correct it so it doesn't cause
      much accumulated error.
      
      2) Then make a small single unit adjustment to correct any cumulative
      error that has collected over time.
      
      This method performs fairly well in the simulator Miroslav created.
      
      Major credit to Miroslav for pointing out the issue, providing the
      original patch to resolve this, a simulator for testing, as well as
      helping debug and resolve issues in my implementation so that it
      performed closer to his original implementation.
      
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Reported-by: NMiroslav Lichvar <mlichvar@redhat.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      dc491596
    • J
      timekeeping: Minor fixup for timespec64->timespec assignment · e2dff1ec
      John Stultz 提交于
      In the GENERIC_TIME_VSYSCALL_OLD update_vsyscall implementation,
      we take the tk_xtime() value, which returns a timespec64, and
      store it in a timespec.
      
      This luckily is ok, since the only architectures that use
      GENERIC_TIME_VSYSCALL_OLD are ia64 and ppc64, which are both
      64 bit systems where timespec64 is the same as a timespec.
      
      Even so, for cleanliness reasons, use the conversion function
      to assign the proper type.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      e2dff1ec
    • T
      ftrace: Provide trace clocks monotonic · 1b3e5c09
      Thomas Gleixner 提交于
      Expose the new NMI safe accessor to clock monotonic to the tracer.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      1b3e5c09
    • T
      timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC · 4396e058
      Thomas Gleixner 提交于
      Tracers want a correlated time between the kernel instrumentation and
      user space. We really do not want to export sched_clock() to user
      space, so we need to provide something sensible for this.
      
      Using separate data structures with an non blocking sequence count
      based update mechanism allows us to do that. The data structure
      required for the readout has a sequence counter and two copies of the
      timekeeping data.
      
      On the update side:
      
        smp_wmb();
        tkf->seq++;
        smp_wmb();
        update(tkf->base[0], tk);
        smp_wmb();
        tkf->seq++;
        smp_wmb();
        update(tkf->base[1], tk);
      
      On the reader side:
      
        do {
           seq = tkf->seq;
           smp_rmb();
           idx = seq & 0x01;
           now = now(tkf->base[idx]);
           smp_rmb();
        } while (seq != tkf->seq)
      
      So if a NMI hits the update of base[0] it will use base[1] which is
      still consistent, but this timestamp is not guaranteed to be monotonic
      across an update.
      
      The timestamp is calculated by:
      
      	now = base_mono + clock_delta * slope
      
      So if the update lowers the slope, readers who are forced to the
      not yet updated second array are still using the old steeper slope.
      
       tmono
       ^
       |    o  n
       |   o n
       |  u
       | o
       |o
       |12345678---> reader order
      
       o = old slope
       u = update
       n = new slope
      
      So reader 6 will observe time going backwards versus reader 5.
      
      While other CPUs are likely to be able observe that, the only way
      for a CPU local observation is when an NMI hits in the middle of
      the update. Timestamps taken from that NMI context might be ahead
      of the following timestamps. Callers need to be aware of that and
      deal with it.
      
      V2: Got rid of clock monotonic raw and reorganized the data
          structures. Folded in the barrier fix from Mathieu.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      4396e058
    • T
      timekeeping: Use tk_read_base as argument for timekeeping_get_ns() · 0e5ac3a8
      Thomas Gleixner 提交于
      All the function needs is in the tk_read_base struct. No functional
      change for the current code, just a preparatory patch for the NMI safe
      accessor to clock monotonic which will use struct tk_read_base as well.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      0e5ac3a8
    • T
      timekeeping: Create struct tk_read_base and use it in struct timekeeper · d28ede83
      Thomas Gleixner 提交于
      The members of the new struct are the required ones for the new NMI
      safe accessor to clcok monotonic. In order to reuse the existing
      timekeeping code and to make the update of the fast NMI safe
      timekeepers a simple memcpy use the struct for the timekeeper as well
      and convert all users.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      d28ede83
    • T
      timekeeping: Restructure the timekeeper some more · 6d3aadf3
      Thomas Gleixner 提交于
      Access to time requires to touch two cachelines at minimum
      
         1) The timekeeper data structure
      
         2) The clocksource data structure
      
      The access to the clocksource data structure can be avoided as almost
      all clocksource implementations ignore the argument to the read
      callback, which is a pointer to the clocksource.
      
      But the core needs to touch it to access the members @read and @mask.
      
      So we are better off by copying the @read function pointer and the
      @mask from the clocksource to the core data structure itself.
      
      For the most used ktime_get() access all required data including the
      @read and @mask copies fits together with the sequence counter into a
      single 64 byte cacheline.
      
      For the other time access functions we touch in the current code three
      cache lines in the worst case. But with the clocksource data copies we
      can reduce that to two adjacent cachelines, which is more efficient
      than disjunct cache lines.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      6d3aadf3
    • T
      clocksource: Get rid of cycle_last · 4a0e6377
      Thomas Gleixner 提交于
      cycle_last was added to the clocksource to support the TSC
      validation. We moved that to the core code, so we can get rid of the
      extra copy.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      4a0e6377