1. 14 10月, 2008 31 次提交
    • P
      ftrace: make ftrace_printk usable with the other tracers · f09ce573
      Peter Zijlstra 提交于
      Currently ftrace_printk only works with the ftrace tracer, switch it to an
      iter_ctrl setting so we can make us of them with other tracers too.
      
      [rostedt@redhat.com: tweak to the disable condition]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f09ce573
    • S
      ftrace: print continue index fix · 5a90f577
      Steven Rostedt 提交于
      An item in the trace buffer that is bigger than one entry may be split
      up using the TRACE_CONT entry. This makes it a virtual single entry.
      The current code increments the iterator index even while traversing
      TRACE_CONT entries, making it look like the iterator is further than
      it actually is.
      
      This patch adds code to not increment the iterator index while skipping
      over TRACE_CONT entries.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5a90f577
    • S
      ftrace: binary and not logical for continue test · 652567aa
      Steven Rostedt 提交于
      Peter Zijlstra provided me with a nice brown paper bag while letting me know
      that I was doing a logical AND and not a binary one, making a condition
      true more often than it should be.
      
      Luckily, a false true is handled by the calling function and no harm is
      done. But this needs to be fixed regardless.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      652567aa
    • M
      ftrace: make output nicely spaced for up to 999 cpus · a6168353
      Michael Ellerman 提交于
      Currently some of the ftrace output goes skewiff if you have more
      than 9 cpus, and some if you have more than 99.
      
      Twiddle with the headers and format strings to make up to 999 cpus
      display without causing spacing problems.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Acked-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a6168353
    • I
      stack tracer: depends on DEBUG_KERNEL · 2ff01c6a
      Ingo Molnar 提交于
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2ff01c6a
    • S
      ftrace: stack trace add indexes · 1b6cced6
      Steven Rostedt 提交于
      This patch adds indexes into the stack that the functions in the
      stack dump were found at. As an added bonus, I also added a diff
      to show which function is the most notorious consumer of the stack.
      
      The output now looks like this:
      
      # cat /debug/tracing/stack_trace
              Depth   Size      Location    (48 entries)
              -----   ----      --------
        0)     2476     212   blk_recount_segments+0x39/0x59
        1)     2264      12   bio_phys_segments+0x16/0x1d
        2)     2252      20   blk_rq_bio_prep+0x23/0xaf
        3)     2232      12   init_request_from_bio+0x74/0x77
        4)     2220      56   __make_request+0x294/0x331
        5)     2164     136   generic_make_request+0x34f/0x37d
        6)     2028      56   submit_bio+0xe7/0xef
        7)     1972      28   submit_bh+0xd1/0xf0
        8)     1944     112   block_read_full_page+0x299/0x2a9
        9)     1832       8   blkdev_readpage+0x14/0x16
       10)     1824      28   read_cache_page_async+0x7e/0x109
       11)     1796      16   read_cache_page+0x11/0x49
       12)     1780      32   read_dev_sector+0x3c/0x72
       13)     1748      48   read_lba+0x4d/0xaa
       14)     1700     168   efi_partition+0x85/0x61b
       15)     1532      72   rescan_partitions+0x10e/0x266
       16)     1460      40   do_open+0x1c7/0x24e
       17)     1420     292   __blkdev_get+0x79/0x84
       18)     1128      12   blkdev_get+0x12/0x14
       19)     1116      20   register_disk+0xd1/0x11e
       20)     1096      28   add_disk+0x34/0x90
       21)     1068      52   sd_probe+0x2b1/0x366
       22)     1016      20   driver_probe_device+0xa5/0x120
       23)      996       8   __device_attach+0xd/0xf
       24)      988      32   bus_for_each_drv+0x3e/0x68
       25)      956      24   device_attach+0x56/0x6c
       26)      932      16   bus_attach_device+0x26/0x4d
       27)      916      64   device_add+0x380/0x4b4
       28)      852      28   scsi_sysfs_add_sdev+0xa1/0x1c9
       29)      824     160   scsi_probe_and_add_lun+0x919/0xa2a
       30)      664      36   __scsi_add_device+0x88/0xae
       31)      628      44   ata_scsi_scan_host+0x9e/0x21c
       32)      584      28   ata_host_register+0x1cb/0x1db
       33)      556      24   ata_host_activate+0x98/0xb5
       34)      532     192   ahci_init_one+0x9bd/0x9e9
       35)      340      20   pci_device_probe+0x3e/0x5e
       36)      320      20   driver_probe_device+0xa5/0x120
       37)      300      20   __driver_attach+0x3f/0x5e
       38)      280      36   bus_for_each_dev+0x40/0x62
       39)      244      12   driver_attach+0x19/0x1b
       40)      232      28   bus_add_driver+0x9c/0x1af
       41)      204      28   driver_register+0x76/0xd2
       42)      176      20   __pci_register_driver+0x44/0x71
       43)      156       8   ahci_init+0x14/0x16
       44)      148     100   _stext+0x42/0x122
       45)       48      20   kernel_init+0x175/0x1dc
       46)       28      28   kernel_thread_helper+0x7/0x10
      
      The first column is simply an index starting from the inner most function
      and counting down to the outer most.
      
      The next column is the location that the function was found on the stack.
      
      The next column is the size of the stack for that function.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1b6cced6
    • S
      ftrace: remove direct reference to mcount in trace code · 3b47bfc1
      Steven Rostedt 提交于
      The mcount record method of ftrace scans objdump for references to mcount.
      Using mcount as the reference to test if the calls to mcount being replaced
      are indeed calls to mcount, this use of mcount was also caught as a
      location to change. Using a variable that points to the mcount address
      moves this reference into the data section that is not scanned, and
      we do not use a false location to try and modify.
      
      The warn on code was what was used to detect this bug.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3b47bfc1
    • S
      ftrace: add stack tracer · e5a81b62
      Steven Rostedt 提交于
      This is another tracer using the ftrace infrastructure, that examines
      at each function call the size of the stack. If the stack use is greater
      than the previous max it is recorded.
      
      You can always see (and set) the max stack size seen. By setting it
      to zero will start the recording again. The backtrace is also available.
      
      For example:
      
      # cat /debug/tracing/stack_max_size
      1856
      
      # cat /debug/tracing/stack_trace
      [<c027764d>] stack_trace_call+0x8f/0x101
      [<c021b966>] ftrace_call+0x5/0x8
      [<c02553cc>] clocksource_get_next+0x12/0x48
      [<c02542a5>] update_wall_time+0x538/0x6d1
      [<c0245913>] do_timer+0x23/0xb0
      [<c0257657>] tick_do_update_jiffies64+0xd9/0xf1
      [<c02576b9>] tick_sched_timer+0x4a/0xad
      [<c0250fe6>] __run_hrtimer+0x3e/0x75
      [<c02518ed>] hrtimer_interrupt+0xf1/0x154
      [<c022c870>] smp_apic_timer_interrupt+0x71/0x84
      [<c021b7e9>] apic_timer_interrupt+0x2d/0x34
      [<c0238597>] finish_task_switch+0x29/0xa0
      [<c05abd13>] schedule+0x765/0x7be
      [<c05abfca>] schedule_timeout+0x1b/0x90
      [<c05ab4d4>] wait_for_common+0xab/0x101
      [<c05ab5ac>] wait_for_completion+0x12/0x14
      [<c033cfc3>] blk_execute_rq+0x84/0x99
      [<c0402470>] scsi_execute+0xc2/0x105
      [<c040250a>] scsi_execute_req+0x57/0x7f
      [<c043afe0>] sr_test_unit_ready+0x3e/0x97
      [<c043bbd6>] sr_media_change+0x43/0x205
      [<c046b59f>] media_changed+0x48/0x77
      [<c046b5ff>] cdrom_media_changed+0x31/0x37
      [<c043b091>] sr_block_media_changed+0x16/0x18
      [<c02b9e69>] check_disk_change+0x1b/0x63
      [<c046f4c3>] cdrom_open+0x7a1/0x806
      [<c043b148>] sr_block_open+0x78/0x8d
      [<c02ba4c0>] do_open+0x90/0x257
      [<c02ba869>] blkdev_open+0x2d/0x56
      [<c0296a1f>] __dentry_open+0x14d/0x23c
      [<c0296b32>] nameidata_to_filp+0x24/0x38
      [<c02a1c68>] do_filp_open+0x347/0x626
      [<c02967ef>] do_sys_open+0x47/0xbc
      [<c02968b0>] sys_open+0x23/0x2b
      [<c021aadd>] sysenter_do_call+0x12/0x26
      
      I've tested this on both x86_64 and i386.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e5a81b62
    • I
      ftrace: clean up macro usage · ac8825ec
      Ingo Molnar 提交于
      enclose the argument in parenthesis. (especially since we cast it,
      which is a high prio operation)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ac8825ec
    • S
      ftrace: fix build failure · 2d7da80f
      Stephen Rothwell 提交于
      After disabling FTRACE_MCOUNT_RECORD via a patch, a dormant build
      failure surfaced:
      
       kernel/trace/ftrace.c: In function 'ftrace_record_ip':
       kernel/trace/ftrace.c:416: error: incompatible type for argument 1 of '_spin_lock_irqsave'
       kernel/trace/ftrace.c:433: error: incompatible type for argument 1 of '_spin_lock_irqsave'
      
      Introduced by commit 6dad8e07f4c10b17b038e84d29f3ca41c2e55cd0 ("ftrace:
      add necessary locking for ftrace records").
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2d7da80f
    • S
      ftrace: add necessary locking for ftrace records · 99ecdc43
      Steven Rostedt 提交于
      The new design of pre-recorded mcounts and updating the code outside of
      kstop_machine has changed the way the records themselves are protected.
      
      This patch uses the ftrace_lock to protect the records. Note, the lock
      still does not need to be taken within calls that are only called via
      kstop_machine, since the that code can not run while the spin lock is held.
      
      Also removed the hash_lock needed for the daemon when MCOUNT_RECORD is
      configured. Also did a slight cleanup of an unused variable.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      99ecdc43
    • S
      ftrace: do not init module on ftrace disabled · 00fd61ae
      Steven Rostedt 提交于
      If one of the self tests of ftrace has disabled the function tracer,
      do not run the code to convert the mcount calls in modules.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      00fd61ae
    • F
      ftrace: fix some mistakes in error messages · 98a983aa
      Frédéric Weisbecker 提交于
      This patch fixes some mistakes on the tracer in warning messages when
      debugfs fails to create tracing files.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: srostedt@redhat.com
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      98a983aa
    • S
      ftrace: dump out ftrace buffers to console on panic · 3f5a54e3
      Steven Rostedt 提交于
      At OLS I had a lot of interest to be able to have the ftrace buffers
      dumped on panic.  Usually one would expect to uses kexec and examine
      the buffers after a new kernel is loaded. But sometimes the resources
      do not permit kdump and kexec, so having an option to still see the
      sequence of events up to the crash is very advantageous.
      
      This patch adds the option to have the ftrace buffers dumped to the
      console in the latency_trace format on a panic. When the option is set,
      the default entries per CPU buffer are lowered to 16384, since the writing
      to the serial (if that is the console) may take an awful long time
      otherwise.
      
      [
       Changes since -v1:
        Got alpine to send correctly (as well as spell check working).
        Removed config option.
        Moved the static variables into ftrace_dump itself.
        Gave printk a log level.
      ]
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3f5a54e3
    • S
      ftrace: ftrace_printk doc moved · 2f2c99db
      Steven Rostedt 提交于
      Based on Randy Dunlap's suggestion, the ftrace_printk kernel-doc belongs
      with the ftrace_printk macro that should be used. Not with the
      __ftrace_printk internal function.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Acked-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2f2c99db
    • S
      ftrace: printk formatting infrastructure · dd0e545f
      Steven Rostedt 提交于
      This patch adds a feature that can help kernel developers debug their
      code using ftrace.
      
        int ftrace_printk(const char *fmt, ...);
      
      This records into the ftrace buffer using printf formatting. The entry
      size in the buffers are still a fixed length. A new type has been added
      that allows for more entries to be used for a single recording.
      
      The start of the print is still the same as the other entries.
      
      It returns the number of characters written to the ftrace buffer.
      
      For example:
      
      Having a module with the following code:
      
      static int __init ftrace_print_test(void)
      {
              ftrace_printk("jiffies are %ld\n", jiffies);
              return 0;
      }
      
      Gives me:
      
        insmod-5441  3...1 7569us : ftrace_print_test: jiffies are 4296626666
      
      for the latency_trace file and:
      
                insmod-5441  [03]  1959.370498: ftrace_print_test jiffies are 4296626666
      
      for the trace file.
      
      Note: Only the infrastructure should go into the kernel. It is to help
      facilitate debugging for other kernel developers. Calls to ftrace_printk
      is not intended to be left in the kernel, and should be frowned upon just
      like scattering printks around in the code.
      
      But having this easily at your fingertips helps the debugging go faster
      and bugs be solved quicker.
      
      Maybe later on, we can hook this with markers and have their printf format
      be sucked into ftrace output.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dd0e545f
    • S
      ftrace: new continue entry - separate out from trace_entry · 2e2ca155
      Steven Rostedt 提交于
      Some tracers will need to work with more than one entry. In order to do this
      the trace_entry structure was split into two fields. One for the start of
      all entries, and one to continue an existing entry.
      
      The trace_entry structure now has a "field" entry that consists of the previous
      content of the trace_entry, and a "cont" entry that is just a string buffer
      the size of the "field" entry.
      
      Thanks to Andrew Morton for suggesting this idea.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2e2ca155
    • S
      ftrace: remove old pointers to mcount · fed1939c
      Steven Rostedt 提交于
      When a mcount pointer is recorded into a table, it is used to add or
      remove calls to mcount (replacing them with nops). If the code is removed
      via removing a module, the pointers still exist.  At modifying the code
      a check is always made to make sure the code being replaced is the code
      expected. In-other-words, the code being replaced is compared to what
      it is expected to be before being replaced.
      
      There is a very small chance that the code being replaced just happens
      to look like code that calls mcount (very small since the call to mcount
      is relative). To remove this chance, this patch adds ftrace_release to
      allow module unloading to remove the pointers to mcount within the module.
      
      Another change for init calls is made to not trace calls marked with
      __init. The tracing can not be started until after init is done anyway.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fed1939c
    • S
      ftrace: do not show freed records in available_filter_functions · a9fdda33
      Steven Rostedt 提交于
      Seems that freed records can appear in the available_filter_functions list.
      This patch fixes that.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a9fdda33
    • S
      ftrace: enable mcount recording for modules · 90d595fe
      Steven Rostedt 提交于
      This patch enables the loading of the __mcount_section of modules and
      changing all the callers of mcount into nops.
      
      The modification is done before the init_module function is called, so
      again, we do not need to use kstop_machine to make these changes.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      90d595fe
    • S
      ftrace: mcount call site on boot nops core · 68bf21aa
      Steven Rostedt 提交于
      This is the infrastructure to the converting the mcount call sites
      recorded by the __mcount_loc section into nops on boot. It also allows
      for using these sites to enable tracing as normal. When the __mcount_loc
      section is used, the "ftraced" kernel thread is disabled.
      
      This uses the current infrastructure to record the mcount call sites
      as well as convert them to nops. The mcount function is kept as a stub
      on boot up and not converted to the ftrace_record_ip function. We use the
      ftrace_record_ip to only record from the table.
      
      This patch does not handle modules. That comes with a later patch.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      68bf21aa
    • S
      ftrace: create __mcount_loc section · 8da3821b
      Steven Rostedt 提交于
      This patch creates a section in the kernel called "__mcount_loc".
      This will hold a list of pointers to the mcount relocation for
      each call site of mcount.
      
      For example:
      
      objdump -dr init/main.o
      [...]
      Disassembly of section .text:
      
      0000000000000000 <do_one_initcall>:
         0:   55                      push   %rbp
      [...]
      000000000000017b <init_post>:
       17b:   55                      push   %rbp
       17c:   48 89 e5                mov    %rsp,%rbp
       17f:   53                      push   %rbx
       180:   48 83 ec 08             sub    $0x8,%rsp
       184:   e8 00 00 00 00          callq  189 <init_post+0xe>
                              185: R_X86_64_PC32      mcount+0xfffffffffffffffc
      [...]
      
      We will add a section to point to each function call.
      
         .section __mcount_loc,"a",@progbits
      [...]
         .quad .text + 0x185
      [...]
      
      The offset to of the mcount call site in init_post is an offset from
      the start of the section, and not the start of the function init_post.
      The mcount relocation is at the call site 0x185 from the start of the
      .text section.
      
        .text + 0x185  == init_post + 0xa
      
      We need a way to add this __mcount_loc section in a way that we do not
      lose the relocations after final link.  The .text section here will
      be attached to all other .text sections after final link and the
      offsets will be meaningless.  We need to keep track of where these
      .text sections are.
      
      To do this, we use the start of the first function in the section.
      do_one_initcall.  We can make a tmp.s file with this function as a reference
      to the start of the .text section.
      
         .section __mcount_loc,"a",@progbits
      [...]
         .quad do_one_initcall + 0x185
      [...]
      
      Then we can compile the tmp.s into a tmp.o
      
        gcc -c tmp.s -o tmp.o
      
      And link it into back into main.o.
      
        ld -r main.o tmp.o -o tmp_main.o
        mv tmp_main.o main.o
      
      But we have a problem.  What happens if the first function in a section
      is not exported, and is a static function. The linker will not let
      the tmp.o use it.  This case exists in main.o as well.
      
      Disassembly of section .init.text:
      
      0000000000000000 <set_reset_devices>:
         0:   55                      push   %rbp
         1:   48 89 e5                mov    %rsp,%rbp
         4:   e8 00 00 00 00          callq  9 <set_reset_devices+0x9>
                              5: R_X86_64_PC32        mcount+0xfffffffffffffffc
      
      The first function in .init.text is a static function.
      
      00000000000000a8 t __setup_set_reset_devices
      000000000000105f t __setup_str_set_reset_devices
      0000000000000000 t set_reset_devices
      
      The lowercase 't' means that set_reset_devices is local and is not exported.
      If we simply try to link the tmp.o with the set_reset_devices we end
      up with two symbols: one local and one global.
      
       .section __mcount_loc,"a",@progbits
       .quad set_reset_devices + 0x10
      
      00000000000000a8 t __setup_set_reset_devices
      000000000000105f t __setup_str_set_reset_devices
      0000000000000000 t set_reset_devices
                       U set_reset_devices
      
      We still have an undefined reference to set_reset_devices, and if we try
      to compile the kernel, we will end up with an undefined reference to
      set_reset_devices, or even worst, it could be exported someplace else,
      and then we will have a reference to the wrong location.
      
      To handle this case, we make an intermediate step using objcopy.
      We convert set_reset_devices into a global exported symbol before linking
      it with tmp.o and set it back afterwards.
      
      00000000000000a8 t __setup_set_reset_devices
      000000000000105f t __setup_str_set_reset_devices
      0000000000000000 T set_reset_devices
      
      00000000000000a8 t __setup_set_reset_devices
      000000000000105f t __setup_str_set_reset_devices
      0000000000000000 T set_reset_devices
      
      00000000000000a8 t __setup_set_reset_devices
      000000000000105f t __setup_str_set_reset_devices
      0000000000000000 t set_reset_devices
      
      Now we have a section in main.o called __mcount_loc that we can place
      somewhere in the kernel using vmlinux.ld.S and access it to convert
      all these locations that call mcount into nops before starting SMP
      and thus, eliminating the need to do this with kstop_machine.
      
      Note, A well documented perl script (scripts/recordmcount.pl) is used
      to do all this in one location.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8da3821b
    • I
      ftrace: ignore functions that cannot be kprobe-ed · 36dcd67a
      Ingo Molnar 提交于
      kprobes already has an extensive list of annotations for functions
      that should not be instrumented. Add notrace annotations to these
      functions as well.
      
      This is particularly useful for functions called by the NMI path.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      36dcd67a
    • M
      tracepoints: use TABLE_SIZE macro · 9795302a
      Mathieu Desnoyers 提交于
      Steven Rostedt suggested:
      
      | Wouldn't it look nicer to have: (TRACEPOINT_TABLE_SIZE - 1) ?
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9795302a
    • I
      tracing: clean up tracepoints kconfig structure · 5f87f112
      Ingo Molnar 提交于
      do not expose users to CONFIG_TRACEPOINTS - tracers can select it
      just fine.
      
      update ftrace to select CONFIG_TRACEPOINTS.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5f87f112
    • M
      ftrace: port to tracepoints · b07c3f19
      Mathieu Desnoyers 提交于
      Porting the trace_mark() used by ftrace to tracepoints. (cleanup)
      
      Changelog :
      - Change error messages : marker -> tracepoint
      
      [ mingo@elte.hu: conflict resolutions ]
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Acked-by: N'Peter Zijlstra' <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b07c3f19
    • M
      tracing, sched: LTTng instrumentation - scheduler · 0a16b607
      Mathieu Desnoyers 提交于
      Instrument the scheduler activity (sched_switch, migration, wakeups,
      wait for a task, signal delivery) and process/thread
      creation/destruction (fork, exit, kthread stop). Actually, kthread
      creation is not instrumented in this patch because it is architecture
      dependent. It allows to connect tracers such as ftrace which detects
      scheduling latencies, good/bad scheduler decisions. Tools like LTTng can
      export this scheduler information along with instrumentation of the rest
      of the kernel activity to perform post-mortem analysis on the scheduler
      activity.
      
      About the performance impact of tracepoints (which is comparable to
      markers), even without immediate values optimizations, tests done by
      Hideo Aoki on ia64 show no regression. His test case was using hackbench
      on a kernel where scheduler instrumentation (about 5 events in code
      scheduler code) was added. See the "Tracepoints" patch header for
      performance result detail.
      
      Changelog :
      
      - Change instrumentation location and parameter to match ftrace
        instrumentation, previously done with kernel markers.
      
      [ mingo@elte.hu: conflict resolutions ]
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Acked-by: N'Peter Zijlstra' <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0a16b607
    • M
      tracing: Kernel Tracepoints · 97e1c18e
      Mathieu Desnoyers 提交于
      Implementation of kernel tracepoints. Inspired from the Linux Kernel
      Markers. Allows complete typing verification by declaring both tracing
      statement inline functions and probe registration/unregistration static
      inline functions within the same macro "DEFINE_TRACE". No format string
      is required. See the tracepoint Documentation and Samples patches for
      usage examples.
      
      Taken from the documentation patch :
      
      "A tracepoint placed in code provides a hook to call a function (probe)
      that you can provide at runtime. A tracepoint can be "on" (a probe is
      connected to it) or "off" (no probe is attached). When a tracepoint is
      "off" it has no effect, except for adding a tiny time penalty (checking
      a condition for a branch) and space penalty (adding a few bytes for the
      function call at the end of the instrumented function and adds a data
      structure in a separate section).  When a tracepoint is "on", the
      function you provide is called each time the tracepoint is executed, in
      the execution context of the caller. When the function provided ends its
      execution, it returns to the caller (continuing from the tracepoint
      site).
      
      You can put tracepoints at important locations in the code. They are
      lightweight hooks that can pass an arbitrary number of parameters, which
      prototypes are described in a tracepoint declaration placed in a header
      file."
      
      Addition and removal of tracepoints is synchronized by RCU using the
      scheduler (and preempt_disable) as guarantees to find a quiescent state
      (this is really RCU "classic"). The update side uses rcu_barrier_sched()
      with call_rcu_sched() and the read/execute side uses
      "preempt_disable()/preempt_enable()".
      
      We make sure the previous array containing probes, which has been
      scheduled for deletion by the rcu callback, is indeed freed before we
      proceed to the next update. It therefore limits the rate of modification
      of a single tracepoint to one update per RCU period. The objective here
      is to permit fast batch add/removal of probes on _different_
      tracepoints.
      
      Changelog :
      - Use #name ":" #proto as string to identify the tracepoint in the
        tracepoint table. This will make sure not type mismatch happens due to
        connexion of a probe with the wrong type to a tracepoint declared with
        the same name in a different header.
      - Add tracepoint_entry_free_old.
      - Change __TO_TRACE to get rid of the 'i' iterator.
      
      Masami Hiramatsu <mhiramat@redhat.com> :
      Tested on x86-64.
      
      Performance impact of a tracepoint : same as markers, except that it
      adds about 70 bytes of instructions in an unlikely branch of each
      instrumented function (the for loop, the stack setup and the function
      call). It currently adds a memory read, a test and a conditional branch
      at the instrumentation site (in the hot path). Immediate values will
      eventually change this into a load immediate, test and branch, which
      removes the memory read which will make the i-cache impact smaller
      (changing the memory read for a load immediate removes 3-4 bytes per
      site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it
      also saves the d-cache hit).
      
      About the performance impact of tracepoints (which is comparable to
      markers), even without immediate values optimizations, tests done by
      Hideo Aoki on ia64 show no regression. His test case was using hackbench
      on a kernel where scheduler instrumentation (about 5 events in code
      scheduler code) was added.
      
      Quoting Hideo Aoki about Markers :
      
      I evaluated overhead of kernel marker using linux-2.6-sched-fixes git
      tree, which includes several markers for LTTng, using an ia64 server.
      
      While the immediate trace mark feature isn't implemented on ia64, there
      is no major performance regression. So, I think that we don't have any
      issues to propose merging marker point patches into Linus's tree from
      the viewpoint of performance impact.
      
      I prepared two kernels to evaluate. The first one was compiled without
      CONFIG_MARKERS. The second one was enabled CONFIG_MARKERS.
      
      I downloaded the original hackbench from the following URL:
      http://devresources.linux-foundation.org/craiger/hackbench/src/hackbench.c
      
      I ran hackbench 5 times in each condition and calculated the average and
      difference between the kernels.
      
          The parameter of hackbench: every 50 from 50 to 800
          The number of CPUs of the server: 2, 4, and 8
      
      Below is the results. As you can see, major performance regression
      wasn't found in any case. Even if number of processes increases,
      differences between marker-enabled kernel and marker- disabled kernel
      doesn't increase. Moreover, if number of CPUs increases, the differences
      doesn't increase either.
      
      Curiously, marker-enabled kernel is better than marker-disabled kernel
      in more than half cases, although I guess it comes from the difference
      of memory access pattern.
      
      * 2 CPUs
      
      Number of | without      | with         | diff     | diff    |
      processes | Marker [Sec] | Marker [Sec] |   [Sec]  |   [%]   |
      --------------------------------------------------------------
             50 |      4.811   |       4.872  |  +0.061  |  +1.27  |
            100 |      9.854   |      10.309  |  +0.454  |  +4.61  |
            150 |     15.602   |      15.040  |  -0.562  |  -3.6   |
            200 |     20.489   |      20.380  |  -0.109  |  -0.53  |
            250 |     25.798   |      25.652  |  -0.146  |  -0.56  |
            300 |     31.260   |      30.797  |  -0.463  |  -1.48  |
            350 |     36.121   |      35.770  |  -0.351  |  -0.97  |
            400 |     42.288   |      42.102  |  -0.186  |  -0.44  |
            450 |     47.778   |      47.253  |  -0.526  |  -1.1   |
            500 |     51.953   |      52.278  |  +0.325  |  +0.63  |
            550 |     58.401   |      57.700  |  -0.701  |  -1.2   |
            600 |     63.334   |      63.222  |  -0.112  |  -0.18  |
            650 |     68.816   |      68.511  |  -0.306  |  -0.44  |
            700 |     74.667   |      74.088  |  -0.579  |  -0.78  |
            750 |     78.612   |      79.582  |  +0.970  |  +1.23  |
            800 |     85.431   |      85.263  |  -0.168  |  -0.2   |
      --------------------------------------------------------------
      
      * 4 CPUs
      
      Number of | without      | with         | diff     | diff    |
      processes | Marker [Sec] | Marker [Sec] |   [Sec]  |   [%]   |
      --------------------------------------------------------------
             50 |      2.586   |       2.584  |  -0.003  |  -0.1   |
            100 |      5.254   |       5.283  |  +0.030  |  +0.56  |
            150 |      8.012   |       8.074  |  +0.061  |  +0.76  |
            200 |     11.172   |      11.000  |  -0.172  |  -1.54  |
            250 |     13.917   |      14.036  |  +0.119  |  +0.86  |
            300 |     16.905   |      16.543  |  -0.362  |  -2.14  |
            350 |     19.901   |      20.036  |  +0.135  |  +0.68  |
            400 |     22.908   |      23.094  |  +0.186  |  +0.81  |
            450 |     26.273   |      26.101  |  -0.172  |  -0.66  |
            500 |     29.554   |      29.092  |  -0.461  |  -1.56  |
            550 |     32.377   |      32.274  |  -0.103  |  -0.32  |
            600 |     35.855   |      35.322  |  -0.533  |  -1.49  |
            650 |     39.192   |      38.388  |  -0.804  |  -2.05  |
            700 |     41.744   |      41.719  |  -0.025  |  -0.06  |
            750 |     45.016   |      44.496  |  -0.520  |  -1.16  |
            800 |     48.212   |      47.603  |  -0.609  |  -1.26  |
      --------------------------------------------------------------
      
      * 8 CPUs
      
      Number of | without      | with         | diff     | diff    |
      processes | Marker [Sec] | Marker [Sec] |   [Sec]  |   [%]   |
      --------------------------------------------------------------
             50 |      2.094   |       2.072  |  -0.022  |  -1.07  |
            100 |      4.162   |       4.273  |  +0.111  |  +2.66  |
            150 |      6.485   |       6.540  |  +0.055  |  +0.84  |
            200 |      8.556   |       8.478  |  -0.078  |  -0.91  |
            250 |     10.458   |      10.258  |  -0.200  |  -1.91  |
            300 |     12.425   |      12.750  |  +0.325  |  +2.62  |
            350 |     14.807   |      14.839  |  +0.032  |  +0.22  |
            400 |     16.801   |      16.959  |  +0.158  |  +0.94  |
            450 |     19.478   |      19.009  |  -0.470  |  -2.41  |
            500 |     21.296   |      21.504  |  +0.208  |  +0.98  |
            550 |     23.842   |      23.979  |  +0.137  |  +0.57  |
            600 |     26.309   |      26.111  |  -0.198  |  -0.75  |
            650 |     28.705   |      28.446  |  -0.259  |  -0.9   |
            700 |     31.233   |      31.394  |  +0.161  |  +0.52  |
            750 |     34.064   |      33.720  |  -0.344  |  -1.01  |
            800 |     36.320   |      36.114  |  -0.206  |  -0.57  |
      --------------------------------------------------------------
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Acked-by: NMasami Hiramatsu <mhiramat@redhat.com>
      Acked-by: N'Peter Zijlstra' <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      97e1c18e
    • A
      tty: Fix abusers of current->sighand->tty · dbda4c0b
      Alan Cox 提交于
      Various people outside the tty layer still stick their noses in behind the
      scenes. We need to make sure they also obey the locking and referencing rules.
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbda4c0b
    • A
      tty: Move tty_write_message out of kernel/printk · 95f9bfc6
      Alan Cox 提交于
      This is pure tty code so put it in the tty layer where it can be with the
      locking relevant material it uses
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95f9bfc6
    • A
      tty: Add a kref count · 9c9f4ded
      Alan Cox 提交于
      Introduce a kref to the tty structure and use it to protect the tty->signal
      tty references. For now we don't introduce it for anything else.
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c9f4ded
  2. 10 10月, 2008 1 次提交
  3. 09 10月, 2008 1 次提交
    • I
      sched debug: add name to sched_domain sysctl entries · a5d8c348
      Ingo Molnar 提交于
      add /proc/sys/kernel/sched_domain/cpu0/domain0/name, to make
      it easier to see which specific scheduler domain remained at
      that entry.
      
      Since we process the scheduler domain tree and
      simplify it, it's not always immediately clear during debugging
      which domain came from where.
      
      depends on CONFIG_SCHED_DEBUG=y.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a5d8c348
  4. 08 10月, 2008 1 次提交
  5. 07 10月, 2008 1 次提交
  6. 06 10月, 2008 1 次提交
  7. 04 10月, 2008 2 次提交
    • D
      sched_rt.c: resch needed in rt_rq_enqueue() for the root rt_rq · f6121f4f
      Dario Faggioli 提交于
      While working on the new version of the code for SCHED_SPORADIC I
      noticed something strange in the present throttling mechanism. More
      specifically in the throttling timer handler in sched_rt.c
      (do_sched_rt_period_timer()) and in rt_rq_enqueue().
      
      The problem is that, when unthrottling a runqueue, rt_rq_enqueue() only
      asks for rescheduling if the runqueue has a sched_entity associated to
      it (i.e., rt_rq->rt_se != NULL).
      Now, if the runqueue is the root rq (which has a rt_se = NULL)
      rescheduling does not take place, and it is delayed to some undefined
      instant in the future.
      
      This imply some random bandwidth usage by the RT tasks under throttling.
      For instance, setting rt_runtime_us/rt_period_us = 950ms/1000ms an RT
      task will get less than 95%. In our tests we got something varying
      between 70% to 95%.
      Using smaller time values, e.g., 95ms/100ms, things are even worse, and
      I can see values also going down to 20-25%!!
      
      The tests we performed are simply running 'yes' as a SCHED_FIFO task,
      and checking the CPU usage with top, but we can investigate thoroughly
      if you think it is needed.
      
      Things go much better, for us, with the attached patch... Don't know if
      it is the best approach, but it solved the issue for us.
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NMichael Trimarchi <trimarchimichael@yahoo.it>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f6121f4f
    • T
      clockevents: check broadcast tick device not the clock events device · 07454bff
      Thomas Gleixner 提交于
      Impact: jiffies increment too fast.
      
      Hugh Dickins noted that with NOHZ=n and HIGHRES=n jiffies get
      incremented too fast. The reason is a wrong check in the broadcast
      enter/exit code, which keeps the local apic timer in periodic mode
      when the switch happens.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      07454bff
  8. 03 10月, 2008 2 次提交