1. 01 11月, 2011 1 次提交
    • M
      mm: change isolate mode from #define to bitwise type · 4356f21d
      Minchan Kim 提交于
      Change ISOLATE_XXX macro with bitwise isolate_mode_t type.  Normally,
      macro isn't recommended as it's type-unsafe and making debugging harder as
      symbol cannot be passed throught to the debugger.
      
      Quote from Johannes
      " Hmm, it would probably be cleaner to fully convert the isolation mode
      into independent flags.  INACTIVE, ACTIVE, BOTH is currently a
      tri-state among flags, which is a bit ugly."
      
      This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4356f21d
  2. 31 10月, 2011 4 次提交
  3. 27 10月, 2011 1 次提交
    • E
      ext4: optimize ext4_ext_convert_to_initialized() · 6f91bc5f
      Eric Gouriou 提交于
      This patch introduces a fast path in ext4_ext_convert_to_initialized()
      for the case when the conversion can be performed by transferring
      the newly initialized blocks from the uninitialized extent into
      an adjacent initialized extent. Doing so removes the expensive
      invocations of memmove() which occur during extent insertion and
      the subsequent merge.
      
      In practice this should be the common case for clients performing
      append writes into files pre-allocated via
      fallocate(FALLOC_FL_KEEP_SIZE). In such a workload performed via
      direct IO and when using a suboptimal implementation of memmove()
      (x86_64 prior to the 2.6.39 rewrite), this patch reduces kernel CPU
      consumption by 32%.
      
      Two new trace points are added to ext4_ext_convert_to_initialized()
      to offer visibility into its operations. No exit trace point has
      been added due to the multiplicity of return points. This can be
      revisited once the upstream cleanup is backported.
      Signed-off-by: NEric Gouriou <egouriou@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6f91bc5f
  4. 25 10月, 2011 1 次提交
    • A
      net/9p: Convert net/9p protocol dumps to tracepoints · 348b5901
      Aneesh Kumar K.V 提交于
      This helps in more control over debugging.
      root@qemu-img-64:~# ls /pass/123
      ls: cannot access /pass/123: No such file or directory
      root@qemu-img-64:~# cat /sys/kernel/debug/tracing/trace
      # tracer: nop
      #
      #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
      #              | |       |          |         |
                    ls-1536  [001]    70.928584: 9p_protocol_dump: clnt 18446612132784021504 P9_TWALK(tag = 1)
      000: 16 00 00 00 6e 01 00 01 00 00 00 02 00 00 00 01
      010: 00 03 00 31 32 33 00 00 00 ff ff ff ff 00 00 00
      
                    ls-1536  [001]    70.928587: <stack trace>
       => trace_9p_protocol_dump
       => p9pdu_finalize
       => p9_client_rpc
       => p9_client_walk
       => v9fs_vfs_lookup
       => d_alloc_and_lookup
       => walk_component
       => path_lookupat
                    ls-1536  [000]    70.929696: 9p_protocol_dump: clnt 18446612132784021504 P9_RLERROR(tag = 1)
      000: 0b 00 00 00 07 01 00 02 00 00 00 4e 03 00 02 00
      010: 00 00 00 00 03 00 02 00 00 00 00 00 ff 43 00 00
      
                    ls-1536  [000]    70.929697: <stack trace>
       => trace_9p_protocol_dump
       => p9_client_rpc
       => p9_client_walk
       => v9fs_vfs_lookup
       => d_alloc_and_lookup
       => walk_component
       => path_lookupat
       => do_path_lookup
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>
      348b5901
  5. 04 10月, 2011 1 次提交
  6. 03 10月, 2011 1 次提交
    • W
      writeback: IO-less balance_dirty_pages() · 143dfe86
      Wu Fengguang 提交于
      As proposed by Chris, Dave and Jan, don't start foreground writeback IO
      inside balance_dirty_pages(). Instead, simply let it idle sleep for some
      time to throttle the dirtying task. In the mean while, kick off the
      per-bdi flusher thread to do background writeback IO.
      
      RATIONALS
      =========
      
      - disk seeks on concurrent writeback of multiple inodes (Dave Chinner)
      
        If every thread doing writes and being throttled start foreground
        writeback, it leads to N IO submitters from at least N different
        inodes at the same time, end up with N different sets of IO being
        issued with potentially zero locality to each other, resulting in
        much lower elevator sort/merge efficiency and hence we seek the disk
        all over the place to service the different sets of IO.
        OTOH, if there is only one submission thread, it doesn't jump between
        inodes in the same way when congestion clears - it keeps writing to
        the same inode, resulting in large related chunks of sequential IOs
        being issued to the disk. This is more efficient than the above
        foreground writeback because the elevator works better and the disk
        seeks less.
      
      - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)
      
        With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
        from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".
      
        * "CPU usage has dropped by ~55%", "it certainly appears that most of
          the CPU time saving comes from the removal of contention on the
          inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
          cacheline bouncing, because the new code is able to call much less
          frequently into balance_dirty_pages() and hence access the global
          page states)
      
        * the user space "App overhead" is reduced by 20%, by avoiding the
          cacheline pollution by the complex writeback code path
      
        * "for a ~5% throughput reduction", "the number of write IOs have
          dropped by ~25%", and the elapsed time reduced from 41:42.17 to
          40:53.23.
      
        * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
          and improves IO throughput from 38MB/s to 42MB/s.
      
      - IO size too small for fast arrays and too large for slow USB sticks
      
        The write_chunk used by current balance_dirty_pages() cannot be
        directly set to some large value (eg. 128MB) for better IO efficiency.
        Because it could lead to more than 1 second user perceivable stalls.
        Even the current 4MB write size may be too large for slow USB sticks.
        The fact that balance_dirty_pages() starts IO on itself couples the
        IO size to wait time, which makes it hard to do suitable IO size while
        keeping the wait time under control.
      
        Now it's possible to increase writeback chunk size proportional to the
        disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
        the larger writeback size dramatically reduces the seek count to 1/10
        (far beyond my expectation) and improves the write throughput by 24%.
      
      - long block time in balance_dirty_pages() hurts desktop responsiveness
      
        Many of us may have the experience: it often takes a couple of seconds
        or even long time to stop a heavy writing dd/cp/tar command with
        Ctrl-C or "kill -9".
      
      - IO pipeline broken by bumpy write() progress
      
        There are a broad class of "loop {read(buf); write(buf);}" applications
        whose read() pipeline will be under-utilized or even come to a stop if
        the write()s have long latencies _or_ don't progress in a constant rate.
        The current threshold based throttling inherently transfers the large
        low level IO completion fluctuations to bumpy application write()s,
        and further deteriorates with increasing number of dirtiers and/or bdi's.
      
        For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
        the rsync progresses very bumpy in legacy kernel, and throughput is
        improved by 67% by this patchset. (plus the larger write chunk size,
        it will be 93% speedup).
      
        The new rate based throttling can support 1000+ dd's with excellent
        smoothness, low latency and low overheads.
      
      For the above reasons, it's much better to do IO-less and low latency
      pauses in balance_dirty_pages().
      
      Jan Kara, Dave Chinner and me explored the scheme to let
      balance_dirty_pages() wait for enough writeback IO completions to
      safeguard the dirty limit. However it's found to have two problems:
      
      - in large NUMA systems, the per-cpu counters may have big accounting
        errors, leading to big throttle wait time and jitters.
      
      - NFS may kill large amount of unstable pages with one single COMMIT.
        Because NFS server serves COMMIT with expensive fsync() IOs, it is
        desirable to delay and reduce the number of COMMITs. So it's not
        likely to optimize away such kind of bursty IO completions, and the
        resulted large (and tiny) stall times in IO completion based throttling.
      
      So here is a pause time oriented approach, which tries to control the
      pause time in each balance_dirty_pages() invocations, by controlling
      the number of pages dirtied before calling balance_dirty_pages(), for
      smooth and efficient dirty throttling:
      
      - avoid useless (eg. zero pause time) balance_dirty_pages() calls
      - avoid too small pause time (less than   4ms, which burns CPU power)
      - avoid too large pause time (more than 200ms, which hurts responsiveness)
      - avoid big fluctuations of pause times
      
      It can control pause times at will. The default policy (in a followup
      patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
      in 1000-dd case.
      
      BEHAVIOR CHANGE
      ===============
      
      (1) dirty threshold
      
      Users will notice that the applications will get throttled once crossing
      the global (background + dirty)/2=15% threshold, and then balanced around
      17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
      memory in 1-dd case.
      
      Since the task will be soft throttled earlier than before, it may be
      perceived by end users as performance "slow down" if his application
      happens to dirty more than 15% dirtyable memory.
      
      (2) smoothness/responsiveness
      
      Users will notice a more responsive system during heavy writeback.
      "killall dd" will take effect instantly.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      143dfe86
  7. 29 9月, 2011 5 次提交
  8. 28 9月, 2011 1 次提交
  9. 26 9月, 2011 1 次提交
  10. 23 9月, 2011 1 次提交
  11. 21 9月, 2011 1 次提交
    • M
      ASoC: Trace and collect statistics for DAPM graph walking · de02d078
      Mark Brown 提交于
      One of the longest standing areas for improvement in ASoC has been the
      DAPM algorithm - it repeats the same checks many times whenever it is run
      and makes no effort to limit the areas of the graph it checks meaning we
      do an awful lot of walks over the full graph. This has never mattered too
      much as the size of the graph has generally been small in relation to the
      size of the devices supported and the speed of CPUs but it is annoying.
      
      In preparation for work on improving this insert a trace point after the
      graph walk has been done. This gives us specific timing information for
      the walk, and in order to give quantifiable (non-benchmark) numbers also
      count every time we check a link or check the power for a widget and report
      those numbers. Substantial changes in the algorithm may require tweaks to
      the stats but they should be useful for simpler things.
      Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
      de02d078
  12. 20 9月, 2011 1 次提交
  13. 10 9月, 2011 1 次提交
  14. 31 8月, 2011 1 次提交
  15. 11 8月, 2011 1 次提交
    • N
      blktrace: add FLUSH/FUA support · c09c47ca
      Namhyung Kim 提交于
      Add FLUSH/FUA support to blktrace. As FLUSH precedes WRITE and/or
      FUA follows WRITE, use the same 'F' flag for both cases and
      distinguish them by their (relative) position. The end results
      look like (other flags might be shown also):
      
       - WRITE:            W
       - WRITE_FLUSH:      FW
       - WRITE_FUA:        WF
       - WRITE_FLUSH_FUA:  FWF
      
      Note that we reuse TC_BARRIER due to lack of bit space of act_mask
      so that the older versions of blktrace tools will report flush
      requests as barriers from now on.
      
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      c09c47ca
  16. 09 8月, 2011 1 次提交
  17. 08 8月, 2011 1 次提交
    • M
      regmap: Add basic tracepoints · fb2736bb
      Mark Brown 提交于
      Trace single register reads and writes, plus start/stop tracepoints for
      the actual I/O to see where we're spending time. This makes it easy to
      have always on logging without overwhelming the logs and also lets us take
      advantage of all the context and time information that the trace subsystem
      collects for us.
      
      We don't currently trace register values for bulk operations as this would
      add complexity and overhead parsing the cooked data that's being worked
      with.
      Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
      fb2736bb
  18. 31 7月, 2011 1 次提交
  19. 26 7月, 2011 1 次提交
    • J
      xen/tracing: fix compile errors when tracing is disabled. · b3c4b982
      Jeremy Fitzhardinge 提交于
      When CONFIG_FUNCTION_TRACER is disabled, compilation fails as follows:
        CC      arch/x86/xen/setup.o
      In file included from arch/x86/include/asm/xen/hypercall.h:42,
                       from arch/x86/xen/setup.c:19:
      include/trace/events/xen.h:31: warning: 'struct multicall_entry' declared inside parameter list
      include/trace/events/xen.h:31: warning: its scope is only this definition or declaration, which is probably not what you want
      include/trace/events/xen.h:31: warning: 'struct multicall_entry' declared inside parameter list
      include/trace/events/xen.h:31: warning: 'struct multicall_entry' declared inside parameter list
      include/trace/events/xen.h:31: warning: 'struct multicall_entry' declared inside parameter list
      [...]
      arch/x86/xen/trace.c:5: error: '__HYPERVISOR_set_trap_table' undeclared here (not in a function)
      arch/x86/xen/trace.c:5: error: array index in initializer not of integer type
      arch/x86/xen/trace.c:5: error: (near initialization for 'xen_hypercall_names')
      arch/x86/xen/trace.c:6: error: '__HYPERVISOR_mmu_update' undeclared here (not in a function)
      arch/x86/xen/trace.c:6: error: array index in initializer not of integer type
      arch/x86/xen/trace.c:6: error: (near initialization for 'xen_hypercall_names')
      
      Fix this by making sure struct multicall_entry has a declaration in
      scope at all times, and don't bother compiling xen/trace.c when tracing
      is disabled.
      Reported-by: NRandy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      b3c4b982
  20. 20 7月, 2011 1 次提交
  21. 19 7月, 2011 9 次提交
  22. 11 7月, 2011 3 次提交
  23. 10 7月, 2011 1 次提交