1. 01 7月, 2013 1 次提交
  2. 23 6月, 2013 1 次提交
  3. 21 6月, 2013 1 次提交
    • S
      tracing: Add DEFINE_EVENT_FN() macro · f5abaa1b
      Steven Rostedt 提交于
      Each TRACE_EVENT() adds several helper functions. If two or more trace events
      share the same structure and print format, they can also share most of these
      helper functions and save a lot of space from duplicate code. This is why the
      DECLARE_EVENT_CLASS() and DEFINE_EVENT() were created.
      
      Some events require a trigger to be called at registering and unregistering of
      the event and to do so they use TRACE_EVENT_FN().
      
      If multiple events require a trigger, they currently have no choice but to use
      TRACE_EVENT_FN() as there's no DEFINE_EVENT_FN() available. This unfortunately
      causes a lot of wasted duplicate code created.
      
      By adding a DEFINE_EVENT_FN(), these events can still use a
      DECLARE_EVENT_CLASS() and then define their own triggers.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/51C3236C.8030508@hds.comSigned-off-by: NSeiji Aguchi <seiji.aguchi@hds.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      f5abaa1b
  4. 07 6月, 2013 1 次提交
    • T
      ext4: use ext4_da_writepages() for all modes · 20970ba6
      Theodore Ts'o 提交于
      Rename ext4_da_writepages() to ext4_writepages() and use it for all
      modes.  We still need to iterate over all the pages in the case of
      data=journalling, but in the case of nodelalloc/data=ordered (which is
      what file systems mounted using ext3 backwards compatibility will use)
      this will allow us to use a much more efficient I/O submission path.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      20970ba6
  5. 05 6月, 2013 2 次提交
    • J
      ext4: restructure writeback path · 4e7ea81d
      Jan Kara 提交于
      There are two issues with current writeback path in ext4.  For one we
      don't necessarily map complete pages when blocksize < pagesize and
      thus needn't do any writeback in one iteration.  We always map some
      blocks though so we will eventually finish mapping the page.  Just if
      writeback races with other operations on the file, forward progress is
      not really guaranteed. The second problem is that current code
      structure makes it hard to associate all the bios to some range of
      pages with one io_end structure so that unwritten extents can be
      converted after all the bios are finished.  This will be especially
      difficult later when io_end will be associated with reserved
      transaction handle.
      
      We restructure the writeback path to a relatively simple loop which
      first prepares extent of pages, then maps one or more extents so that
      no page is partially mapped, and once page is fully mapped it is
      submitted for IO. We keep all the mapping and IO submission
      information in mpage_da_data structure to somewhat reduce stack usage.
      Resulting code is somewhat shorter than the old one and hopefully also
      easier to read.
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4e7ea81d
    • J
      ext4: provide wrappers for transaction reservation calls · 5fe2fe89
      Jan Kara 提交于
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5fe2fe89
  6. 28 5月, 2013 2 次提交
    • L
      ext4: make punch hole code path work with bigalloc · d23142c6
      Lukas Czerner 提交于
      Currently punch hole is disabled in file systems with bigalloc
      feature enabled. However the recent changes in punch hole patch should
      make it easier to support punching holes on bigalloc enabled file
      systems.
      
      This commit changes partial_cluster handling in ext4_remove_blocks(),
      ext4_ext_rm_leaf() and ext4_ext_remove_space(). Currently
      partial_cluster is unsigned long long type and it makes sure that we
      will free the partial cluster if all extents has been released from that
      cluster. However it has been specifically designed only for truncate.
      
      With punch hole we can be freeing just some extents in the cluster
      leaving the rest untouched. So we have to make sure that we will notice
      cluster which still has some extents. To do this I've changed
      partial_cluster to be signed long long type. The only scenario where
      this could be a problem is when cluster_size == block size, however in
      that case there would not be any partial clusters so we're safe. For
      bigger clusters the signed type is enough. Now we use the negative value
      in partial_cluster to mark such cluster used, hence we know that we must
      not free it even if all other extents has been freed from such cluster.
      
      This scenario can be described in simple diagram:
      
      |FFF...FF..FF.UUU|
       ^----------^
        punch hole
      
      . - free space
      | - cluster boundary
      F - freed extent
      U - used extent
      
      Also update respective tracepoints to use signed long long type for
      partial_cluster.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      d23142c6
    • L
      ext4: update ext4_ext_remove_space trace point · 61801325
      Lukas Czerner 提交于
      Add "end" variable.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      61801325
  7. 22 5月, 2013 2 次提交
  8. 03 5月, 2013 1 次提交
    • Y
      ext4: fix fio regression · e30b5dca
      Yan, Zheng 提交于
      We (Linux Kernel Performance project) found a regression introduced
      by commit:
      
        f7fec032 ext4: track all extent status in extent status tree
      
      The commit causes about 20% performance decrease in fio random write
      test. Profiler shows that rb_next() uses a lot of CPU time. The call
      stack is:
      
        rb_next
        ext4_es_find_delayed_extent
        ext4_map_blocks
        _ext4_get_block
        ext4_get_block_write
        __blockdev_direct_IO
        ext4_direct_IO
        generic_file_direct_write
        __generic_file_aio_write
        ext4_file_write
        aio_rw_vect_retry
        aio_run_iocb
        do_io_submit
        sys_io_submit
        system_call_fastpath
        io_submit
        td_io_getevents
        io_u_queued_complete
        thread_main
        main
        __libc_start_main
      
      The cause is that ext4_es_find_delayed_extent() doesn't have an
      upper bound, it keeps searching until a delayed extent is found.
      When there are a lots of non-delayed entries in the extent state
      tree, ext4_es_find_delayed_extent() may uses a lot of CPU time.
      Reported-by: NLKP project <lkp@linux.intel.com>
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      e30b5dca
  9. 30 4月, 2013 2 次提交
  10. 29 4月, 2013 1 次提交
  11. 27 4月, 2013 1 次提交
  12. 24 4月, 2013 1 次提交
    • F
      nohz: Fix unavailable tick_stop tracepoint in dynticks idle · 2c82d1be
      Frederic Weisbecker 提交于
      The trace_tick_stop() tracepoint is only available in full
      dynticks. But it's also used by dynticks-idle so let's build
      it for the latter config as well.
      
      This fixes:
      
           kernel/time/tick-sched.c: In function tick_nohz_stop_sched_tick:
           kernel/time/tick-sched.c:644: error: implicit declaration of function trace_tick_stop
           make[2]: *** [kernel/time/tick-sched.o] Erreur 1
      Reported-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      2c82d1be
  13. 23 4月, 2013 8 次提交
  14. 22 4月, 2013 3 次提交
  15. 19 4月, 2013 1 次提交
  16. 13 4月, 2013 1 次提交
  17. 12 4月, 2013 1 次提交
    • T
      kthread: Prevent unpark race which puts threads on the wrong cpu · f2530dc7
      Thomas Gleixner 提交于
      The smpboot threads rely on the park/unpark mechanism which binds per
      cpu threads on a particular core. Though the functionality is racy:
      
      CPU0	       	 	CPU1  	     	    CPU2
      unpark(T)				    wake_up_process(T)
        clear(SHOULD_PARK)	T runs
      			leave parkme() due to !SHOULD_PARK  
        bind_to(CPU2)		BUG_ON(wrong CPU)						    
      
      We cannot let the tasks move themself to the target CPU as one of
      those tasks is actually the migration thread itself, which requires
      that it starts running on the target cpu right away.
      
      The solution to this problem is to prevent wakeups in park mode which
      are not from unpark(). That way we can guarantee that the association
      of the task to the target cpu is working correctly.
      
      Add a new task state (TASK_PARKED) which prevents other wakeups and
      use this state explicitly for the unpark wakeup.
      
      Peter noticed: Also, since the task state is visible to userspace and
      all the parked tasks are still in the PID space, its a good hint in ps
      and friends that these tasks aren't really there for the moment.
      
      The migration thread has another related issue.
      
      CPU0	      	     	 CPU1
      Bring up CPU2
      create_thread(T)
      park(T)
       wait_for_completion()
      			 parkme()
      			 complete()
      sched_set_stop_task()
      			 schedule(TASK_PARKED)
      
      The sched_set_stop_task() call is issued while the task is on the
      runqueue of CPU1 and that confuses the hell out of the stop_task class
      on that cpu. So we need the same synchronizaion before
      sched_set_stop_task().
      Reported-by: NDave Jones <davej@redhat.com>
      Reported-and-tested-by: NDave Hansen <dave@sr71.net>
      Reported-and-tested-by: NBorislav Petkov <bp@alien8.de>
      Acked-by: NPeter Ziljstra <peterz@infradead.org>
      Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: dhillf@gmail.com
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      f2530dc7
  18. 10 4月, 2013 1 次提交
  19. 04 4月, 2013 1 次提交
  20. 02 4月, 2013 1 次提交
    • T
      writeback: replace custom worker pool implementation with unbound workqueue · 839a8e86
      Tejun Heo 提交于
      Writeback implements its own worker pool - each bdi can be associated
      with a worker thread which is created and destroyed dynamically.  The
      worker thread for the default bdi is always present and serves as the
      "forker" thread which forks off worker threads for other bdis.
      
      there's no reason for writeback to implement its own worker pool when
      using unbound workqueue instead is much simpler and more efficient.
      This patch replaces custom worker pool implementation in writeback
      with an unbound workqueue.
      
      The conversion isn't too complicated but the followings are worth
      mentioning.
      
      * bdi_writeback->last_active, task and wakeup_timer are removed.
        delayed_work ->dwork is added instead.  Explicit timer handling is
        no longer necessary.  Everything works by either queueing / modding
        / flushing / canceling the delayed_work item.
      
      * bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
        bdi_writeback->dwork.  On each execution, it processes
        bdi->work_list and reschedules itself if there are more things to
        do.
      
        The function also handles low-mem condition, which used to be
        handled by the forker thread.  If the function is running off a
        rescuer thread, it only writes out limited number of pages so that
        the rescuer can serve other bdis too.  This preserves the flusher
        creation failure behavior of the forker thread.
      
      * INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
        bdi_writeback_workfn() about on-going bdi unregistration so that it
        always drains work_list even if it's running off the rescuer.  Note
        that the original code was broken in this regard.  Under memory
        pressure, a bdi could finish unregistration with non-empty
        work_list.
      
      * The default bdi is no longer special.  It now is treated the same as
        any other bdi and bdi_cap_flush_forker() is removed.
      
      * BDI_pending is no longer used.  Removed.
      
      * Some tracepoints become non-applicable.  The following TPs are
        removed - writeback_nothread, writeback_wake_thread,
        writeback_wake_forker_thread, writeback_thread_start,
        writeback_thread_stop.
      
      Everything, including devices coming and going away and rescuer
      operation under simulated memory pressure, seems to work fine in my
      test setup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      839a8e86
  21. 26 3月, 2013 2 次提交
  22. 24 3月, 2013 2 次提交
  23. 18 3月, 2013 1 次提交
  24. 15 3月, 2013 2 次提交
    • S
      tracing: Add a way to soft disable trace events · 417944c4
      Steven Rostedt (Red Hat) 提交于
      In order to let triggers enable or disable events, we need a 'soft'
      method for doing so. For example, if a function probe is added that
      lets a user enable or disable events when a function is called, that
      change must be done without taking locks or a mutex, and definitely
      it can't sleep. But the full enabling of a tracepoint is expensive.
      
      By adding a 'SOFT_DISABLE' flag, and converting the flags to be updated
      without the protection of a mutex (using set/clear_bit()), this soft
      disable flag can be used to allow critical sections to enable or disable
      events from being traced (after the event has been placed into "SOFT_MODE").
      
      Some caveats though: The comm recorder (to map pids with a comm) can not
      be soft disabled (yet). If you disable an event with with a "soft"
      disable and wait a while before reading the trace, the comm cache may be
      replaced and you'll get a bunch of <...> for comms in the trace.
      
      Reading the "enable" file for an event that is disabled will now give
      you "0*" where the '*' denotes that the tracepoint is still active but
      the event itself is "disabled".
      
      [ fixed _BIT used in & operation : thanks to Dan Carpenter and smatch ]
      
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      417944c4
    • L
      tracing: Fix some section mismatch warnings · 523c8113
      Li Zefan 提交于
      As we've added __init annotation to field-defining functions, we should
      add __refdata annotation to event_call variables, which reference those
      functions.
      
      Link: http://lkml.kernel.org/r/51343C1F.2050502@huawei.comReported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      523c8113