1. 14 11月, 2014 6 次提交
    • R
      tracing: Merge consecutive seq_puts calls · d79ac28f
      Rasmus Villemoes 提交于
      Consecutive seq_puts calls with literal strings can be merged to a
      single call. This reduces the size of the generated code, and can also
      lead to slight .rodata reduction (because of fewer nul and padding
      bytes). It should also shave a off a few clock cycles.
      
      Link: http://lkml.kernel.org/r/1415479332-25944-3-git-send-email-linux@rasmusvillemoes.dkSigned-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d79ac28f
    • R
      tracing: Replace seq_printf by simpler equivalents · fa6f0cc7
      Rasmus Villemoes 提交于
      Using seq_printf to print a simple string or a single character is a
      lot more expensive than it needs to be, since seq_puts and seq_putc
      exist.
      
      These patches do
      
        seq_printf(m, s) -> seq_puts(m, s)
        seq_printf(m, "%s", s) -> seq_puts(m, s)
        seq_printf(m, "%c", c) -> seq_putc(m, c)
      
      Subsequent patches will simplify further.
      
      Link: http://lkml.kernel.org/r/1415479332-25944-2-git-send-email-linux@rasmusvillemoes.dkSigned-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      fa6f0cc7
    • D
      tracing: kdb: Fix kernel livelock with empty buffers · 8520dedb
      Daniel Thompson 提交于
      Currently kdb's ftdump command will livelock by constantly printk'ing
      the empty string at KERN_EMERG level if it run when the ftrace system is
      not in use. This occurs because trace_empty() never returns false when
      the ring buffers are left at the start of a non-consuming read [launched
      by ring_buffer_read_start()].
      
      This patch changes the loop exit condition to use the result of
      trace_find_next_entry_inc(). Effectively this switches the non-consuming
      kdb dumper to follow the approach of the non-consuming userspace
      interface [s_next()] rather than the consuming ftrace_dump().
      
      Link: http://lkml.kernel.org/r/1415277716-19419-3-git-send-email-daniel.thompson@linaro.org
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Signed-off-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      8520dedb
    • D
      tracing: kdb: Fix kernel panic during ftdump · c270cc75
      Daniel Thompson 提交于
      Currently kdb's ftdump command unconditionally crashes due to a null
      pointer de-reference whenever the command is run. This in turn causes
      the kernel to panic.
      
      The abridged stacktrace (gathered with ARCH=arm) is:
      
      --- cut here ---
      [<c09535ac>] (panic) from [<c02132dc>] (die+0x264/0x440)
      [<c02132dc>] (die) from [<c0952eb8>]
      (__do_kernel_fault.part.11+0x74/0x84)
      [<c0952eb8>] (__do_kernel_fault.part.11) from [<c021f954>]
      (do_page_fault+0x1d0/0x3c4)
      [<c021f954>] (do_page_fault) from [<c020846c>] (do_DataAbort+0x48/0xac)
      
      [<c020846c>] (do_DataAbort) from [<c0213c58>] (__dabt_svc+0x38/0x60)
      Exception stack(0xc0deba88 to 0xc0debad0)
      ba80:                   e8c29180 00000001 e9854304 e9854300 c0f567d8
      c0df2580
      baa0: 00000000 00000000 00000000 c0f117b8 c0e3a3c0 c0debb0c 00000000
      c0debad0
      bac0: 0000672e c02f4d60 60000193 ffffffff
      [<c0213c58>] (__dabt_svc) from [<c02f4d60>] (kdb_ftdump+0x1e4/0x3d8)
      [<c02f4d60>] (kdb_ftdump) from [<c02ce328>] (kdb_parse+0x2b8/0x698)
      [<c02ce328>] (kdb_parse) from [<c02ceef0>] (kdb_main_loop+0x52c/0x784)
      [<c02ceef0>] (kdb_main_loop) from [<c02d1b0c>] (kdb_stub+0x238/0x490)
      --- cut here ---
      
      The NULL deref occurs due to the initialized use of struct trace_iter's
      buffer_iter member.
      
      This is a regression, albeit a fairly elderly one. It was introduced
      by commit 6d158a81 ("tracing: Remove NR_CPUS array from
      trace_iterator").
      
      This patch solves this by providing a collection of ring_buffer_iter(s)
      and using this to initialize buffer_iter. Note that static allocation
      is used solely because the trace_iter itself is also static allocated.
      Static allocation also means that we have to NULL-ify the pointer during
      cleanup to avoid use-after-free problems.
      
      Link: http://lkml.kernel.org/r/1415277716-19419-2-git-send-email-daniel.thompson@linaro.org
      
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Signed-off-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      c270cc75
    • L
      tracing: Fix traceoff_on_warning handling on boot command line · 933ff9f2
      Luis Claudio R. Goncalves 提交于
      According to the documentation, adding "traceoff_on_warning" to the boot
      command line should be enough to enable the feature. But right now it is
      necessary to specify "traceoff_on_warning=". Along with fixing that, also
      verify if the value passed, if any, is either "0" or "off".
      
      Link: http://lkml.kernel.org/r/20141112231400.GL12281@uudg.orgSigned-off-by: NLuis Claudio R. Goncalves <lgoncalv@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      933ff9f2
    • S
      ftrace: Have the control_ops get a trampoline · fe578ba3
      Steven Rostedt (Red Hat) 提交于
      With the new logic, if only a single user of ftrace function hooks is
      used, it will get its own trampoline assigned to it.
      
      The problem is that the control_ops is an indirect ops that perf ops
      uses. What that means is that when perf registers its ops with
      register_ftrace_function(), it has the CONTROL flag set and gets added
      to the control list instead of the global ftrace list. The control_ops
      gets added to that instead and the mcount trampoline calls the control_ops
      function. The control_ops function will iterate the control list and
      call the ops functions that are attached to it.
      
      But currently the trampoline is added to the perf ops and not the
      control ops, and when ftrace tries to find a trampoline hook for it,
      it fails to find one and gives the following splat:
      
       ------------[ cut here ]------------
       WARNING: CPU: 0 PID: 10133 at kernel/trace/ftrace.c:2033 ftrace_get_addr_new+0x6f/0xc0()
       Modules linked in: [...]
       CPU: 0 PID: 10133 Comm: perf Tainted: P               3.18.0-rc1-test+ #388
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
        00000000000007f1 ffff8800c2643bc8 ffffffff814fca6e ffff88011ea0ed01
        0000000000000000 ffff8800c2643c08 ffffffff81041ffd 0000000000000000
        ffffffff810c388c ffffffff81a5a350 ffff880119b00000 ffffffff810001c8
       Call Trace:
        [<ffffffff814fca6e>] dump_stack+0x46/0x58
        [<ffffffff81041ffd>] warn_slowpath_common+0x81/0x9b
        [<ffffffff810c388c>] ? ftrace_get_addr_new+0x6f/0xc0
        [<ffffffff810001c8>] ? 0xffffffff810001c8
        [<ffffffff81042031>] warn_slowpath_null+0x1a/0x1c
        [<ffffffff810c388c>] ftrace_get_addr_new+0x6f/0xc0
        [<ffffffff8102e938>] ftrace_replace_code+0xd6/0x334
        [<ffffffff810c4116>] ftrace_modify_all_code+0x41/0xc5
        [<ffffffff8102eba6>] arch_ftrace_update_code+0x10/0x19
        [<ffffffff810c293c>] ftrace_run_update_code+0x21/0x42
        [<ffffffff810c298f>] ftrace_startup_enable+0x32/0x34
        [<ffffffff810c3049>] ftrace_startup+0x14e/0x15a
        [<ffffffff810c307c>] register_ftrace_function+0x27/0x40
        [<ffffffff810dc118>] perf_ftrace_event_register+0x3e/0xee
        [<ffffffff810dbfbe>] perf_trace_init+0x29d/0x2a9
        [<ffffffff810eb422>] perf_tp_event_init+0x27/0x3a
        [<ffffffff810f18bc>] perf_init_event+0x9e/0xed
        [<ffffffff810f1ba4>] perf_event_alloc+0x299/0x330
        [<ffffffff810f236b>] SYSC_perf_event_open+0x3ee/0x816
        [<ffffffff8115a066>] ? mntput+0x2d/0x2f
        [<ffffffff81142b00>] ? __fput+0xa7/0x1b2
        [<ffffffff81091300>] ? do_gettimeofday+0x22/0x3a
        [<ffffffff810f279c>] SyS_perf_event_open+0x9/0xb
        [<ffffffff81502a92>] system_call_fastpath+0x12/0x17
       ---[ end trace 81a53565150e4982 ]---
       Bad trampoline accounting at: ffffffff810001c8 (run_init_process+0x0/0x2d) (10000001)
      
      Update the control_ops trampoline instead of the perf ops one.
      
      Reported-by: lkp@01.org
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      fe578ba3
  2. 12 11月, 2014 6 次提交
  3. 01 11月, 2014 2 次提交
    • S
      ftrace/x86: Show trampoline call function in enabled_functions · 15d5b02c
      Steven Rostedt (Red Hat) 提交于
      The file /sys/kernel/debug/tracing/eneabled_functions is used to debug
      ftrace function hooks. Add to the output what function is being called
      by the trampoline if the arch supports it.
      
      Add support for this feature in x86_64.
      
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Tested-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Tested-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      15d5b02c
    • S
      ftrace/x86: Add dynamic allocated trampoline for ftrace_ops · f3bea491
      Steven Rostedt (Red Hat) 提交于
      The current method of handling multiple function callbacks is to register
      a list function callback that calls all the other callbacks based on
      their hash tables and compare it to the function that the callback was
      called on. But this is very inefficient.
      
      For example, if you are tracing all functions in the kernel and then
      add a kprobe to a function such that the kprobe uses ftrace, the
      mcount trampoline will switch from calling the function trace callback
      to calling the list callback that will iterate over all registered
      ftrace_ops (in this case, the function tracer and the kprobes callback).
      That means for every function being traced it checks the hash of the
      ftrace_ops for function tracing and kprobes, even though the kprobes
      is only set at a single function. The kprobes ftrace_ops is checked
      for every function being traced!
      
      Instead of calling the list function for functions that are only being
      traced by a single callback, we can call a dynamically allocated
      trampoline that calls the callback directly. The function graph tracer
      already uses a direct call trampoline when it is being traced by itself
      but it is not dynamically allocated. It's trampoline is static in the
      kernel core. The infrastructure that called the function graph trampoline
      can also be used to call a dynamically allocated one.
      
      For now, only ftrace_ops that are not dynamically allocated can have
      a trampoline. That is, users such as function tracer or stack tracer.
      kprobes and perf allocate their ftrace_ops, and until there's a safe
      way to free the trampoline, it can not be used. The dynamically allocated
      ftrace_ops may, although, use the trampoline if the kernel is not
      compiled with CONFIG_PREEMPT. But that will come later.
      Tested-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Tested-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      f3bea491
  4. 25 10月, 2014 2 次提交
    • S
      ftrace: Fix checking of trampoline ftrace_ops in finding trampoline · 4fc40904
      Steven Rostedt (Red Hat) 提交于
      When modifying code, ftrace has several checks to make sure things
      are being done correctly. One of them is to make sure any code it
      modifies is exactly what it expects it to be before it modifies it.
      In order to do so with the new trampoline logic, it must be able
      to find out what trampoline a function is hooked to in order to
      see if the code that hooks to it is what's expected.
      
      The logic to find the trampoline from a record (accounting descriptor
      for a function that is hooked) needs to only look at the "old_hash"
      of an ops that is being modified. The old_hash is the list of function
      an ops is hooked to before its update. Since a record would only be
      pointing to an ops that is being modified if it was already hooked
      before.
      
      Currently, it can pick a modified ops based on its new functions it
      will be hooked to, and this picks the wrong trampoline and causes
      the check to fail, disabling ftrace.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      
      ftrace: squash into ordering of ops for modification
      4fc40904
    • S
      ftrace: Set ops->old_hash on modifying what an ops hooks to · 8252ecf3
      Steven Rostedt (Red Hat) 提交于
      The code that checks for trampolines when modifying function hooks
      tests against a modified ops "old_hash". But the ops old_hash pointer
      is not being updated before the changes are made, making it possible
      to not find the right hash to the callback and possibly causing
      ftrace to break in accounting and disable itself.
      
      Have the ops set its old_hash before the modifying takes place.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      8252ecf3
  5. 19 10月, 2014 1 次提交
    • C
      futex: Ensure get_futex_key_refs() always implies a barrier · 76835b0e
      Catalin Marinas 提交于
      Commit b0c29f79 (futexes: Avoid taking the hb->lock if there's
      nothing to wake up) changes the futex code to avoid taking a lock when
      there are no waiters. This code has been subsequently fixed in commit
      11d4616b (futex: revert back to the explicit waiter counting code).
      Both the original commit and the fix-up rely on get_futex_key_refs() to
      always imply a barrier.
      
      However, for private futexes, none of the cases in the switch statement
      of get_futex_key_refs() would be hit and the function completes without
      a memory barrier as required before checking the "waiters" in
      futex_wake() -> hb_waiters_pending(). The consequence is a race with a
      thread waiting on a futex on another CPU, allowing the waker thread to
      read "waiters == 0" while the waiter thread to have read "futex_val ==
      locked" (in kernel).
      
      Without this fix, the problem (user space deadlocks) can be seen with
      Android bionic's mutex implementation on an arm64 multi-cluster system.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reported-by: NMatteo Franchin <Matteo.Franchin@arm.com>
      Fixes: b0c29f79 (futexes: Avoid taking the hb->lock if there's nothing to wake up)
      Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
      Tested-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Cc: <stable@vger.kernel.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76835b0e
  6. 15 10月, 2014 1 次提交
    • P
      modules, lock around setting of MODULE_STATE_UNFORMED · d3051b48
      Prarit Bhargava 提交于
      A panic was seen in the following sitation.
      
      There are two threads running on the system. The first thread is a system
      monitoring thread that is reading /proc/modules. The second thread is
      loading and unloading a module (in this example I'm using my simple
      dummy-module.ko).  Note, in the "real world" this occurred with the qlogic
      driver module.
      
      When doing this, the following panic occurred:
      
       ------------[ cut here ]------------
       kernel BUG at kernel/module.c:3739!
       invalid opcode: 0000 [#1] SMP
       Modules linked in: binfmt_misc sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw igb gf128mul glue_helper iTCO_wdt iTCO_vendor_support ablk_helper ptp sb_edac cryptd pps_core edac_core shpchp i2c_i801 pcspkr wmi lpc_ich ioatdma mfd_core dca ipmi_si nfsd ipmi_msghandler auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm isci drm libsas ahci libahci scsi_transport_sas libata i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_module]
       CPU: 37 PID: 186343 Comm: cat Tainted: GF          O--------------   3.10.0+ #7
       Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
       task: ffff8807fd2d8000 ti: ffff88080fa7c000 task.ti: ffff88080fa7c000
       RIP: 0010:[<ffffffff810d64c5>]  [<ffffffff810d64c5>] module_flags+0xb5/0xc0
       RSP: 0018:ffff88080fa7fe18  EFLAGS: 00010246
       RAX: 0000000000000003 RBX: ffffffffa03b5200 RCX: 0000000000000000
       RDX: 0000000000001000 RSI: ffff88080fa7fe38 RDI: ffffffffa03b5000
       RBP: ffff88080fa7fe28 R08: 0000000000000010 R09: 0000000000000000
       R10: 0000000000000000 R11: 000000000000000f R12: ffffffffa03b5000
       R13: ffffffffa03b5008 R14: ffffffffa03b5200 R15: ffffffffa03b5000
       FS:  00007f6ae57ef740(0000) GS:ffff88101e7a0000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000404f70 CR3: 0000000ffed48000 CR4: 00000000001407e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
       Stack:
        ffffffffa03b5200 ffff8810101e4800 ffff88080fa7fe70 ffffffff810d666c
        ffff88081e807300 000000002e0f2fbf 0000000000000000 ffff88100f257b00
        ffffffffa03b5008 ffff88080fa7ff48 ffff8810101e4800 ffff88080fa7fee0
       Call Trace:
        [<ffffffff810d666c>] m_show+0x19c/0x1e0
        [<ffffffff811e4d7e>] seq_read+0x16e/0x3b0
        [<ffffffff812281ed>] proc_reg_read+0x3d/0x80
        [<ffffffff811c0f2c>] vfs_read+0x9c/0x170
        [<ffffffff811c1a58>] SyS_read+0x58/0xb0
        [<ffffffff81605829>] system_call_fastpath+0x16/0x1b
       Code: 48 63 c2 83 c2 01 c6 04 03 29 48 63 d2 eb d9 0f 1f 80 00 00 00 00 48 63 d2 c6 04 13 2d 41 8b 0c 24 8d 50 02 83 f9 01 75 b2 eb cb <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
       RIP  [<ffffffff810d64c5>] module_flags+0xb5/0xc0
        RSP <ffff88080fa7fe18>
      
          Consider the two processes running on the system.
      
          CPU 0 (/proc/modules reader)
          CPU 1 (loading/unloading module)
      
          CPU 0 opens /proc/modules, and starts displaying data for each module by
          traversing the modules list via fs/seq_file.c:seq_open() and
          fs/seq_file.c:seq_read().  For each module in the modules list, seq_read
          does
      
                  op->start()  <-- this is a pointer to m_start()
                  op->show()   <- this is a pointer to m_show()
                  op->stop()   <-- this is a pointer to m_stop()
      
          The m_start(), m_show(), and m_stop() module functions are defined in
          kernel/module.c. The m_start() and m_stop() functions acquire and release
          the module_mutex respectively.
      
          ie) When reading /proc/modules, the module_mutex is acquired and released
          for each module.
      
          m_show() is called with the module_mutex held.  It accesses the module
          struct data and attempts to write out module data.  It is in this code
          path that the above BUG_ON() warning is encountered, specifically m_show()
          calls
      
          static char *module_flags(struct module *mod, char *buf)
          {
                  int bx = 0;
      
                  BUG_ON(mod->state == MODULE_STATE_UNFORMED);
          ...
      
          The other thread, CPU 1, in unloading the module calls the syscall
          delete_module() defined in kernel/module.c.  The module_mutex is acquired
          for a short time, and then released.  free_module() is called without the
          module_mutex.  free_module() then sets mod->state = MODULE_STATE_UNFORMED,
          also without the module_mutex.  Some additional code is called and then the
          module_mutex is reacquired to remove the module from the modules list:
      
              /* Now we can delete it from the lists */
              mutex_lock(&module_mutex);
              stop_machine(__unlink_module, mod, NULL);
              mutex_unlock(&module_mutex);
      
      This is the sequence of events that leads to the panic.
      
      CPU 1 is removing dummy_module via delete_module().  It acquires the
      module_mutex, and then releases it.  CPU 1 has NOT set dummy_module->state to
      MODULE_STATE_UNFORMED yet.
      
      CPU 0, which is reading the /proc/modules, acquires the module_mutex and
      acquires a pointer to the dummy_module which is still in the modules list.
      CPU 0 calls m_show for dummy_module.  The check in m_show() for
      MODULE_STATE_UNFORMED passed for dummy_module even though it is being
      torn down.
      
      Meanwhile CPU 1, which has been continuing to remove dummy_module without
      holding the module_mutex, now calls free_module() and sets
      dummy_module->state to MODULE_STATE_UNFORMED.
      
      CPU 0 now calls module_flags() with dummy_module and ...
      
      static char *module_flags(struct module *mod, char *buf)
      {
              int bx = 0;
      
              BUG_ON(mod->state == MODULE_STATE_UNFORMED);
      
      and BOOM.
      
      Acquire and release the module_mutex lock around the setting of
      MODULE_STATE_UNFORMED in the teardown path, which should resolve the
      problem.
      
      Testing: In the unpatched kernel I can panic the system within 1 minute by
      doing
      
      while (true) do insmod dummy_module.ko; rmmod dummy_module.ko; done
      
      and
      
      while (true) do cat /proc/modules; done
      
      in separate terminals.
      
      In the patched kernel I was able to run just over one hour without seeing
      any issues.  I also verified the output of panic via sysrq-c and the output
      of /proc/modules looks correct for all three states for the dummy_module.
      
              dummy_module 12661 0 - Unloading 0xffffffffa03a5000 (OE-)
              dummy_module 12661 0 - Live 0xffffffffa03bb000 (OE)
              dummy_module 14015 1 - Loading 0xffffffffa03a5000 (OE+)
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: stable@kernel.org
      d3051b48
  7. 14 10月, 2014 9 次提交
  8. 11 10月, 2014 4 次提交
  9. 10 10月, 2014 9 次提交
    • S
      kernel/sys.c: compat sysinfo syscall: fix undefined behavior · 0baae41e
      Scotty Bauer 提交于
      Fix undefined behavior and compiler warning by replacing right shift 32
      with upper_32_bits macro
      Signed-off-by: NScotty Bauer <sbauer@eng.utah.edu>
      Cc: Clemens Ladisch <clemens@ladisch.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0baae41e
    • V
      kernel/sys.c: whitespace fixes · ec94fc3d
      vishnu.ps 提交于
      Fix minor errors and warning messages in kernel/sys.c.  These errors were
      reported by checkpatch while working with some modifications in sys.c
      file.  Fixing this first will help me to improve my further patches.
      
      ERROR: trailing whitespace - 9
      ERROR: do not use assignment in if condition - 4
      ERROR: spaces required around that '?' (ctx:VxO) - 10
      ERROR: switch and case should be at the same indent - 3
      
      total 26 errors & 3 warnings fixed.
      Signed-off-by: Nvishnu.ps <vishnu.ps@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec94fc3d
    • Y
      acct: eliminate compile warning · 067b722f
      Ying Xue 提交于
      If ACCT_VERSION is not defined to 3, below warning appears:
        CC      kernel/acct.o
        kernel/acct.c: In function `do_acct_process':
        kernel/acct.c:475:24: warning: unused variable `ns' [-Wunused-variable]
      
      [akpm@linux-foundation.org: retain the local for code size improvements
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      067b722f
    • I
      kernel/async.c: switch to pr_foo() · 27fb10ed
      Ionut Alexa 提交于
      Signed-off-by: NIonut Alexa <ionut.m.alexa@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27fb10ed
    • S
      mm: use VM_BUG_ON_MM where possible · 96dad67f
      Sasha Levin 提交于
      Dump the contents of the relevant struct_mm when we hit the bug condition.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96dad67f
    • O
      mempolicy: remove the "task" arg of vma_policy_mof() and simplify it · 6b6482bb
      Oleg Nesterov 提交于
      1. vma_policy_mof(task) is simply not safe unless task == current,
         it can race with do_exit()->mpol_put(). Remove this arg and update
         its single caller.
      
      2. vma can not be NULL, remove this check and simplify the code.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b6482bb
    • J
      mm: remove noisy remainder of the scan_unevictable interface · 1f13ae39
      Johannes Weiner 提交于
      The deprecation warnings for the scan_unevictable interface triggers by
      scripts doing `sysctl -a | grep something else'.  This is annoying and not
      helpful.
      
      The interface has been defunct since 264e56d8 ("mm: disable user
      interface to manually rescue unevictable pages"), which was in 2011, and
      there haven't been any reports of usecases for it, only reports that the
      deprecation warnings are annying.  It's unlikely that anybody is using
      this interface specifically at this point, so remove it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f13ae39
    • C
      prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation · f606b77f
      Cyrill Gorcunov 提交于
      During development of c/r we've noticed that in case if we need to support
      user namespaces we face a problem with capabilities in prctl(PR_SET_MM,
      ...) call, in particular once new user namespace is created
      capable(CAP_SYS_RESOURCE) no longer passes.
      
      A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values
      in one bundle, which would allow the kernel to make more intensive test
      for sanity of values and same time allow us to support checkpoint/restore
      of user namespaces.
      
      Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of
      prctl_mm_map structure which carries all the members to be updated.
      
      	prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size)
      
      	struct prctl_mm_map {
      		__u64	start_code;
      		__u64	end_code;
      		__u64	start_data;
      		__u64	end_data;
      		__u64	start_brk;
      		__u64	brk;
      		__u64	start_stack;
      		__u64	arg_start;
      		__u64	arg_end;
      		__u64	env_start;
      		__u64	env_end;
      		__u64	*auxv;
      		__u32	auxv_size;
      		__u32	exe_fd;
      	};
      
      All members except @exe_fd correspond ones of struct mm_struct.  To figure
      out which available values these members may take here are meanings of the
      members.
      
       - start_code, end_code: represent bounds of executable code area
       - start_data, end_data: represent bounds of data area
       - start_brk, brk: used to calculate bounds for brk() syscall
       - start_stack: used when accounting space needed for command
         line arguments, environment and shmat() syscall
       - arg_start, arg_end, env_start, env_end: represent memory area
         supplied for command line arguments and environment variables
       - auxv, auxv_size: carries auxiliary vector, Elf format specifics
       - exe_fd: file descriptor number for executable link (/proc/self/exe)
      
      Thus we apply the following requirements to the values
      
      1) Any member except @auxv, @auxv_size, @exe_fd is rather an address
         in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr)
         interval.
      
      2) While @[start|end]_code and @[start|end]_data may point to an nonexisting
         VMAs (say a program maps own new .text and .data segments during execution)
         the rest of members should belong to VMA which must exist.
      
      3) Addresses must be ordered, ie @start_ member must not be greater or
         equal to appropriate @end_ member.
      
      4) As in regular Elf loading procedure we require that @start_brk and
         @brk be greater than @end_data.
      
      5) If RLIMIT_DATA rlimit is set to non-infinity new values should not
         exceed existing limit. Same applies to RLIMIT_STACK.
      
      6) Auxiliary vector size must not exceed existing one (which is
         predefined as AT_VECTOR_SIZE and depends on architecture).
      
      7) File descriptor passed in @exe_file should be pointing
         to executable file (because we use existing prctl_set_mm_exe_file_locked
         helper it ensures that the file we are going to use as exe link has all
         required permission granted).
      
      Now about where these members are involved inside kernel code:
      
       - @start_code and @end_code are used in /proc/$pid/[stat|statm] output;
      
       - @start_data and @end_data are used in /proc/$pid/[stat|statm] output,
         also they are considered if there enough space for brk() syscall
         result if RLIMIT_DATA is set;
      
       - @start_brk shown in /proc/$pid/stat output and accounted in brk()
         syscall if RLIMIT_DATA is set; also this member is tested to
         find a symbolic name of mmap event for perf system (we choose
         if event is generated for "heap" area); one more aplication is
         selinux -- we test if a process has PROCESS__EXECHEAP permission
         if trying to make heap area being executable with mprotect() syscall;
      
       - @brk is a current value for brk() syscall which lays inside heap
         area, it's shown in /proc/$pid/stat. When syscall brk() succesfully
         provides new memory area to a user space upon brk() completion the
         mm::brk is updated to carry new value;
      
         Both @start_brk and @brk are actively used in /proc/$pid/maps
         and /proc/$pid/smaps output to find a symbolic name "heap" for
         VMA being scanned;
      
       - @start_stack is printed out in /proc/$pid/stat and used to
         find a symbolic name "stack" for task and threads in
         /proc/$pid/maps and /proc/$pid/smaps output, and as the same
         as with @start_brk -- perf system uses it for event naming.
         Also kernel treat this member as a start address of where
         to map vDSO pages and to check if there is enough space
         for shmat() syscall;
      
       - @arg_start, @arg_end, @env_start and @env_end are printed out
         in /proc/$pid/stat. Another access to the data these members
         represent is to read /proc/$pid/environ or /proc/$pid/cmdline.
         Any attempt to read these areas kernel tests with access_process_vm
         helper so a user must have enough rights for this action;
      
       - @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly
         speaking kernel doesn't care much about which exactly data is
         sitting there because it is solely for userspace;
      
       - @exe_fd is referred from /proc/$pid/exe and when generating
         coredump. We uses prctl_set_mm_exe_file_locked helper to update
         this member, so exe-file link modification remains one-shot
         action.
      
      Still note that updating exe-file link now doesn't require sys-resource
      capability anymore, after all there is no much profit in preventing setup
      own file link (there are a number of ways to execute own code -- ptrace,
      ld-preload, so that the only reliable way to find which exactly code is
      executed is to inspect running program memory).  Still we require the
      caller to be at least user-namespace root user.
      
      I believe the old interface should be deprecated and ripped off in a
      couple of kernel releases if no one against.
      
      To test if new interface is implemented in the kernel one can pass
      PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently
      supported struct prctl_mm_map.
      
      [akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions]
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NAndrew Vagin <avagin@openvz.org>
      Tested-by: NAndrew Vagin <avagin@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f606b77f
    • C
      prctl: PR_SET_MM -- factor out mmap_sem when updating mm::exe_file · 71fe97e1
      Cyrill Gorcunov 提交于
      Instead of taking mm->mmap_sem inside prctl_set_mm_exe_file() move it out
      and rename the helper to prctl_set_mm_exe_file_locked().  This will allow
      to reuse this function in a next patch.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Julien Tinnes <jln@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71fe97e1