1. 03 10月, 2012 1 次提交
  2. 21 9月, 2012 1 次提交
  3. 02 9月, 2012 1 次提交
  4. 17 8月, 2012 2 次提交
  5. 01 8月, 2012 1 次提交
  6. 31 7月, 2012 1 次提交
  7. 15 7月, 2012 1 次提交
  8. 13 7月, 2012 1 次提交
    • T
      workqueue: factor out worker_pool from global_cwq · bd7bdd43
      Tejun Heo 提交于
      Move worklist and all worker management fields from global_cwq into
      the new struct worker_pool.  worker_pool points back to the containing
      gcwq.  worker and cpu_workqueue_struct are updated to point to
      worker_pool instead of gcwq too.
      
      This change is mechanical and doesn't introduce any functional
      difference other than rearranging of fields and an added level of
      indirection in some places.  This is to prepare for multiple pools per
      gcwq.
      
      v2: Comment typo fixes as suggested by Namhyung.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      bd7bdd43
  9. 03 7月, 2012 1 次提交
  10. 29 6月, 2012 1 次提交
    • N
      tracing/kvm: Use __print_hex() for kvm_emulate_insn tracepoint · b102f1d0
      Namhyung Kim 提交于
      The kvm_emulate_insn tracepoint used __print_insn()
      for printing its instructions. However it makes the
      format of the event hard to parse as it reveals TP
      internals.
      
      Fortunately, kernel provides __print_hex for almost
      same purpose, we can use it instead of open coding
      it. The user-space can be changed to parse it later.
      
      That means raw kernel tracing will not be affected
      by this change:
      
       # cd /sys/kernel/debug/tracing/
       # cat events/kvm/kvm_emulate_insn/format
       name: kvm_emulate_insn
       ID: 29
       format:
      	...
       print fmt: "%x:%llx:%s (%s)%s", REC->csbase, REC->rip, __print_hex(REC->insn, REC->len), \
       __print_symbolic(REC->flags, { 0, "real" }, { (1 << 0) | (1 << 1), "vm16" }, \
       { (1 << 0), "prot16" }, { (1 << 0) | (1 << 2), "prot32" }, { (1 << 0) | (1 << 3), "prot64" }), \
       REC->failed ? " failed" : ""
      
       # echo 1 > events/kvm/kvm_emulate_insn/enable
       # cat trace
       # tracer: nop
       #
       # entries-in-buffer/entries-written: 2183/2183   #P:12
       #
       #                              _-----=> irqs-off
       #                             / _----=> need-resched
       #                            | / _---=> hardirq/softirq
       #                            || / _--=> preempt-depth
       #                            ||| /     delay
       #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
       #              | |       |   ||||       |         |
               qemu-kvm-1782  [002] ...1   140.931636: kvm_emulate_insn: 0:c102fa25:89 10 (prot32)
               qemu-kvm-1781  [004] ...1   140.931637: kvm_emulate_insn: 0:c102fa25:89 10 (prot32)
      
      Link: http://lkml.kernel.org/n/tip-wfw6y3b9ugtey8snaow9nmg5@git.kernel.org
      Link: http://lkml.kernel.org/r/1340757701-10711-2-git-send-email-namhyung@kernel.org
      
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: kvm@vger.kernel.org
      Acked-by: NAvi Kivity <avi@redhat.com>
      Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      b102f1d0
  11. 28 6月, 2012 1 次提交
    • A
      x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range · e7b52ffd
      Alex Shi 提交于
      x86 has no flush_tlb_range support in instruction level. Currently the
      flush_tlb_range just implemented by flushing all page table. That is not
      the best solution for all scenarios. In fact, if we just use 'invlpg' to
      flush few lines from TLB, we can get the performance gain from later
      remain TLB lines accessing.
      
      But the 'invlpg' instruction costs much of time. Its execution time can
      compete with cr3 rewriting, and even a bit more on SNB CPU.
      
      So, on a 512 4KB TLB entries CPU, the balance points is at:
      	(512 - X) * 100ns(assumed TLB refill cost) =
      		X(TLB flush entries) * 100ns(assumed invlpg cost)
      
      Here, X is 256, that is 1/2 of 512 entries.
      
      But with the mysterious CPU pre-fetcher and page miss handler Unit, the
      assumed TLB refill cost is far lower then 100ns in sequential access. And
      2 HT siblings in one core makes the memory access more faster if they are
      accessing the same memory. So, in the patch, I just do the change when
      the target entries is less than 1/16 of whole active tlb entries.
      Actually, I have no data support for the percentage '1/16', so any
      suggestions are welcomed.
      
      As to hugetlb, guess due to smaller page table, and smaller active TLB
      entries, I didn't see benefit via my benchmark, so no optimizing now.
      
      My micro benchmark show in ideal scenarios, the performance improves 70
      percent in reading. And in worst scenario, the reading/writing
      performance is similar with unpatched 3.4-rc4 kernel.
      
      Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP
      'always':
      
      multi thread testing, '-t' paramter is thread number:
      	       	        with patch   unpatched 3.4-rc4
      ./mprotect -t 1           14ns		24ns
      ./mprotect -t 2           13ns		22ns
      ./mprotect -t 4           12ns		19ns
      ./mprotect -t 8           14ns		16ns
      ./mprotect -t 16          28ns		26ns
      ./mprotect -t 32          54ns		51ns
      ./mprotect -t 128         200ns		199ns
      
      Single process with sequencial flushing and memory accessing:
      
      		       	with patch   unpatched 3.4-rc4
      ./mprotect		    7ns			11ns
      ./mprotect -p 4096  -l 8 -n 10240
      			    21ns		21ns
      
      [ hpa: http://lkml.kernel.org/r/1B4B44D9196EFF41AE41FDA404FC0A100BFF94@SHSMSX101.ccr.corp.intel.com
        has additional performance numbers. ]
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-3-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      e7b52ffd
  12. 18 6月, 2012 1 次提交
  13. 14 6月, 2012 1 次提交
  14. 07 6月, 2012 1 次提交
  15. 30 5月, 2012 4 次提交
    • M
      mm: vmscan: remove reclaim_mode_t · 23b9da55
      Mel Gorman 提交于
      There is little motiviation for reclaim_mode_t once RECLAIM_MODE_[A]SYNC
      and lumpy reclaim have been removed.  This patch gets rid of
      reclaim_mode_t as well and improves the documentation about what
      reclaim/compaction is and when it is triggered.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23b9da55
    • M
      mm: vmscan: do not stall on writeback during memory compaction · 41ac1999
      Mel Gorman 提交于
      This patch stops reclaim/compaction entering sync reclaim as this was
      only intended for lumpy reclaim and an oversight.  Page migration has
      its own logic for stalling on writeback pages if necessary and memory
      compaction is already using it.
      
      Waiting on page writeback is bad for a number of reasons but the primary
      one is that waiting on writeback to a slow device like USB can take a
      considerable length of time.  Page reclaim instead uses
      wait_iff_congested() to throttle if too many dirty pages are being
      scanned.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41ac1999
    • M
      mm: vmscan: remove lumpy reclaim · c53919ad
      Mel Gorman 提交于
      This series removes lumpy reclaim and some stalling logic that was
      unintentionally being used by memory compaction.  The end result is that
      stalling on dirty pages during page reclaim now depends on
      wait_iff_congested().
      
      Four kernels were compared
      
        3.3.0     vanilla
        3.4.0-rc2 vanilla
        3.4.0-rc2 lumpyremove-v2 is patch one from this series
        3.4.0-rc2 nosync-v2r3 is the full series
      
      Removing lumpy reclaim saves almost 900 bytes of text whereas the full
      series removes 1200 bytes.
      
           text     data      bss       dec     hex  filename
        67403754  1927944  2260992  10929311  a6c49f  vmlinux-3.4.0-rc2-vanilla
        6739479  1927944  2260992  10928415  a6c11f  vmlinux-3.4.0-rc2-lumpyremove-v2
        6739159  1927944  2260992  10928095  a6bfdf  vmlinux-3.4.0-rc2-nosync-v2
      
      There are behaviour changes in the series and so tests were run with
      monitoring of ftrace events.  This disrupts results so the performance
      results are distorted but the new behaviour should be clearer.
      
      fs-mark running in a threaded configuration showed little of interest as
      it did not push reclaim aggressively
      
        FS-Mark Multi Threaded
                                3.3.0-vanilla       rc2-vanilla       lumpyremove-v2r3       nosync-v2r3
        Files/s  min           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
        Files/s  mean          3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
        Files/s  stddev        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)
        Files/s  max           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
        Overhead min      508667.00 ( 0.00%)   521350.00 (-2.49%)   544292.00 (-7.00%)   547168.00 (-7.57%)
        Overhead mean     551185.00 ( 0.00%)   652690.73 (-18.42%)   991208.40 (-79.83%)   570130.53 (-3.44%)
        Overhead stddev    18200.69 ( 0.00%)   331958.29 (-1723.88%)  1579579.43 (-8578.68%)     9576.81 (47.38%)
        Overhead max      576775.00 ( 0.00%)  1846634.00 (-220.17%)  6901055.00 (-1096.49%)   585675.00 (-1.54%)
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             309.90    300.95    307.33    298.95
        User+Sys Time Running Test (seconds)        319.32    309.67    315.69    307.51
        Total Elapsed Time (seconds)               1187.85   1193.09   1191.98   1193.73
      
        MMTests Statistics: vmstat
        Page Ins                                       80532       82212       81420       79480
        Page Outs                                  111434984   111456240   111437376   111582628
        Swap Ins                                           0           0           0           0
        Swap Outs                                          0           0           0           0
        Direct pages scanned                           44881       27889       27453       34843
        Kswapd pages scanned                        25841428    25860774    25861233    25843212
        Kswapd pages reclaimed                      25841393    25860741    25861199    25843179
        Direct pages reclaimed                         44881       27889       27453       34843
        Kswapd efficiency                                99%         99%         99%         99%
        Kswapd velocity                            21754.791   21675.460   21696.029   21649.127
        Direct efficiency                               100%        100%        100%        100%
        Direct velocity                               37.783      23.375      23.031      29.188
        Percentage direct scans                           0%          0%          0%          0%
      
      ftrace showed that there was no stalling on writeback or pages submitted
      for IO from reclaim context.
      
      postmark was similar and while it was more interesting, it also did not
      push reclaim heavily.
      
        POSTMARK
                                             3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
        Transactions per second:               16.00 ( 0.00%)    20.00 (25.00%)    18.00 (12.50%)    17.00 ( 6.25%)
        Data megabytes read per second:        18.80 ( 0.00%)    24.27 (29.10%)    22.26 (18.40%)    20.54 ( 9.26%)
        Data megabytes written per second:     35.83 ( 0.00%)    46.25 (29.08%)    42.42 (18.39%)    39.14 ( 9.24%)
        Files created alone per second:        28.00 ( 0.00%)    38.00 (35.71%)    34.00 (21.43%)    30.00 ( 7.14%)
        Files create/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)
        Files deleted alone per second:       556.00 ( 0.00%)  1224.00 (120.14%)  3062.00 (450.72%)  6124.00 (1001.44%)
        Files delete/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)
      
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             113.34    107.99    109.73    108.72
        User+Sys Time Running Test (seconds)        145.51    139.81    143.32    143.55
        Total Elapsed Time (seconds)               1159.16    899.23    980.17   1062.27
      
        MMTests Statistics: vmstat
        Page Ins                                    13710192    13729032    13727944    13760136
        Page Outs                                   43071140    42987228    42733684    42931624
        Swap Ins                                           0           0           0           0
        Swap Outs                                          0           0           0           0
        Direct pages scanned                               0           0           0           0
        Kswapd pages scanned                         9941613     9937443     9939085     9929154
        Kswapd pages reclaimed                       9940926     9936751     9938397     9928465
        Direct pages reclaimed                             0           0           0           0
        Kswapd efficiency                                99%         99%         99%         99%
        Kswapd velocity                             8576.567   11051.058   10140.164    9347.109
        Direct efficiency                               100%        100%        100%        100%
        Direct velocity                                0.000       0.000       0.000       0.000
      
      It looks like here that the full series regresses performance but as
      ftrace showed no usage of wait_iff_congested() or sync reclaim I am
      assuming it's a disruption due to monitoring.  Other data such as memory
      usage, page IO, swap IO all looked similar.
      
      Running a benchmark with a plain DD showed nothing very interesting.
      The full series stalled in wait_iff_congested() slightly less but stall
      times on vanilla kernels were marginal.
      
      Running a benchmark that hammered on file-backed mappings showed stalls
      due to congestion but not in sync writebacks
      
        MICRO
                                             3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             308.13    294.50    298.75    299.53
        User+Sys Time Running Test (seconds)        330.45    316.28    318.93    320.79
        Total Elapsed Time (seconds)               1814.90   1833.88   1821.14   1832.91
      
        MMTests Statistics: vmstat
        Page Ins                                      108712      120708       97224      110344
        Page Outs                                  155514576   156017404   155813676   156193256
        Swap Ins                                           0           0           0           0
        Swap Outs                                          0           0           0           0
        Direct pages scanned                         2599253     1550480     2512822     2414760
        Kswapd pages scanned                        69742364    71150694    68839041    69692533
        Kswapd pages reclaimed                      34824488    34773341    34796602    34799396
        Direct pages reclaimed                         53693       94750       61792       75205
        Kswapd efficiency                                49%         48%         50%         49%
        Kswapd velocity                            38427.662   38797.901   37799.972   38022.889
        Direct efficiency                                 2%          6%          2%          3%
        Direct velocity                             1432.174     845.464    1379.807    1317.446
        Percentage direct scans                           3%          2%          3%          3%
        Page writes by reclaim                             0           0           0           0
        Page writes file                                   0           0           0           0
        Page writes anon                                   0           0           0           0
        Page reclaim immediate                             0           0           0        1218
        Page rescued immediate                             0           0           0           0
        Slabs scanned                                  15360       16384       13312       16384
        Direct inode steals                                0           0           0           0
        Kswapd inode steals                             4340        4327        1630        4323
      
        FTrace Reclaim Statistics: congestion_wait
        Direct number congest     waited                 0          0          0          0
        Direct time   congest     waited               0ms        0ms        0ms        0ms
        Direct full   congest     waited                 0          0          0          0
        Direct number conditional waited               900        870        754        789
        Direct time   conditional waited               0ms        0ms        0ms       20ms
        Direct full   conditional waited                 0          0          0          0
        KSwapd number congest     waited              2106       2308       2116       1915
        KSwapd time   congest     waited          139924ms   157832ms   125652ms   132516ms
        KSwapd full   congest     waited              1346       1530       1202       1278
        KSwapd number conditional waited             12922      16320      10943      14670
        KSwapd time   conditional waited               0ms        0ms        0ms        0ms
        KSwapd full   conditional waited                 0          0          0          0
      
      Reclaim statistics are not radically changed.  The stall times in kswapd
      are massive but it is clear that it is due to calls to congestion_wait()
      and that is almost certainly the call in balance_pgdat().  Otherwise
      stalls due to dirty pages are non-existant.
      
      I ran a benchmark that stressed high-order allocation.  This is very
      artifical load but was used in the past to evaluate lumpy reclaim and
      compaction.  Generally I look at allocation success rates and latency
      figures.
      
        STRESS-HIGHALLOC
                         3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
        Pass 1          81.00 ( 0.00%)    28.00 (-53.00%)    24.00 (-57.00%)    28.00 (-53.00%)
        Pass 2          82.00 ( 0.00%)    39.00 (-43.00%)    38.00 (-44.00%)    43.00 (-39.00%)
        while Rested    88.00 ( 0.00%)    87.00 (-1.00%)    88.00 ( 0.00%)    88.00 ( 0.00%)
      
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             740.93    681.42    685.14    684.87
        User+Sys Time Running Test (seconds)       2922.65   3269.52   3281.35   3279.44
        Total Elapsed Time (seconds)               1161.73   1152.49   1159.55   1161.44
      
        MMTests Statistics: vmstat
        Page Ins                                     4486020     2807256     2855944     2876244
        Page Outs                                    7261600     7973688     7975320     7986120
        Swap Ins                                       31694           0           0           0
        Swap Outs                                      98179           0           0           0
        Direct pages scanned                           53494       57731       34406      113015
        Kswapd pages scanned                         6271173     1287481     1278174     1219095
        Kswapd pages reclaimed                       2029240     1281025     1260708     1201583
        Direct pages reclaimed                          1468       14564       16649       92456
        Kswapd efficiency                                32%         99%         98%         98%
        Kswapd velocity                             5398.133    1117.130    1102.302    1049.641
        Direct efficiency                                 2%         25%         48%         81%
        Direct velocity                               46.047      50.092      29.672      97.306
        Percentage direct scans                           0%          4%          2%          8%
        Page writes by reclaim                       1616049           0           0           0
        Page writes file                             1517870           0           0           0
        Page writes anon                               98179           0           0           0
        Page reclaim immediate                        103778       27339        9796       17831
        Page rescued immediate                             0           0           0           0
        Slabs scanned                                1096704      986112      980992      998400
        Direct inode steals                              223      215040      216736      247881
        Kswapd inode steals                           175331       61548       68444       63066
        Kswapd skipped wait                            21991           0           1           0
        THP fault alloc                                    1         135         125         134
        THP collapse alloc                               393         311         228         236
        THP splits                                        25          13           7           8
        THP fault fallback                                 0           0           0           0
        THP collapse fail                                  3           5           7           7
        Compaction stalls                                865        1270        1422        1518
        Compaction success                               370         401         353         383
        Compaction failures                              495         869        1069        1135
        Compaction pages moved                        870155     3828868     4036106     4423626
        Compaction move failure                        26429       23865       29742       27514
      
      Success rates are completely hosed for 3.4-rc2 which is almost certainly
      due to commit fe2c2a10 ("vmscan: reclaim at order 0 when compaction
      is enabled").  I expected this would happen for kswapd and impair
      allocation success rates (https://lkml.org/lkml/2012/1/25/166) but I did
      not anticipate this much a difference: 80% less scanning, 37% less
      reclaim by kswapd
      
      In comparison, reclaim/compaction is not aggressive and gives up easily
      which is the intended behaviour.  hugetlbfs uses __GFP_REPEAT and would
      be much more aggressive about reclaim/compaction than THP allocations
      are.  The stress test above is allocating like neither THP or hugetlbfs
      but is much closer to THP.
      
      Mainline is now impaired in terms of high order allocation under heavy
      load although I do not know to what degree as I did not test with
      __GFP_REPEAT.  Keep this in mind for bugs related to hugepage pool
      resizing, THP allocation and high order atomic allocation failures from
      network devices.
      
      In terms of congestion throttling, I see the following for this test
      
        FTrace Reclaim Statistics: congestion_wait
        Direct number congest     waited                 3          0          0          0
        Direct time   congest     waited               0ms        0ms        0ms        0ms
        Direct full   congest     waited                 0          0          0          0
        Direct number conditional waited               957        512       1081       1075
        Direct time   conditional waited               0ms        0ms        0ms        0ms
        Direct full   conditional waited                 0          0          0          0
        KSwapd number congest     waited                36          4          3          5
        KSwapd time   congest     waited            3148ms      400ms      300ms      500ms
        KSwapd full   congest     waited                30          4          3          5
        KSwapd number conditional waited             88514        197        332        542
        KSwapd time   conditional waited            4980ms        0ms        0ms        0ms
        KSwapd full   conditional waited                49          0          0          0
      
      The "conditional waited" times are the most interesting as this is
      directly impacted by the number of dirty pages encountered during scan.
      As lumpy reclaim is no longer scanning contiguous ranges, it is finding
      fewer dirty pages.  This brings wait times from about 5 seconds to 0.
      kswapd itself is still calling congestion_wait() so it'll still stall but
      it's a lot less.
      
      In terms of the type of IO we were doing, I see this
      
        FTrace Reclaim Statistics: mm_vmscan_writepage
        Direct writes anon  sync                         0          0          0          0
        Direct writes anon  async                        0          0          0          0
        Direct writes file  sync                         0          0          0          0
        Direct writes file  async                        0          0          0          0
        Direct writes mixed sync                         0          0          0          0
        Direct writes mixed async                        0          0          0          0
        KSwapd writes anon  sync                         0          0          0          0
        KSwapd writes anon  async                    91682          0          0          0
        KSwapd writes file  sync                         0          0          0          0
        KSwapd writes file  async                   822629          0          0          0
        KSwapd writes mixed sync                         0          0          0          0
        KSwapd writes mixed async                        0          0          0          0
      
      In 3.2, kswapd was doing a bunch of async writes of pages but
      reclaim/compaction was never reaching a point where it was doing sync
      IO.  This does not guarantee that reclaim/compaction was not calling
      wait_on_page_writeback() but I would consider it unlikely.  It indicates
      that merging patches 2 and 3 to stop reclaim/compaction calling
      wait_on_page_writeback() should be safe.
      
      This patch:
      
      Lumpy reclaim had a purpose but in the mind of some, it was to kick the
      system so hard it trashed.  For others the purpose was to complicate
      vmscan.c.  Over time it was giving softer shoes and a nicer attitude but
      memory compaction needs to step up and replace it so this patch sends
      lumpy reclaim to the farm.
      
      The tracepoint format changes for isolating LRU pages with this patch
      applied.  Furthermore reclaim/compaction can no longer queue dirty pages
      in pageout() if the underlying BDI is congested.  Lumpy reclaim used
      this logic and reclaim/compaction was using it in error.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c53919ad
    • R
      mm: remove swap token code · e709ffd6
      Rik van Riel 提交于
      The swap token code no longer fits in with the current VM model.  It
      does not play well with cgroups or the better NUMA placement code in
      development, since we have only one swap token globally.
      
      It also has the potential to mess with scalability of the system, by
      increasing the number of non-reclaimable pages on the active and
      inactive anon LRU lists.
      
      Last but not least, the swap token code has been broken for a year
      without complaints, as reported by Konstantin Khlebnikov.  This suggests
      we no longer have much use for it.
      
      The days of sub-1G memory systems with heavy use of swap are over.  If
      we ever need thrashing reducing code in the future, we will have to
      implement something that does scale.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NBob Picco <bpicco@meloft.net>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e709ffd6
  16. 16 5月, 2012 4 次提交
  17. 10 5月, 2012 1 次提交
    • P
      rcu: Make RCU_FAST_NO_HZ handle timer migration · 21e52e15
      Paul E. McKenney 提交于
      The current RCU_FAST_NO_HZ assumes that timers do not migrate unless a
      CPU goes offline, in which case it assumes that the CPU will have to come
      out of dyntick-idle mode (cancelling the timer) in order to go offline.
      This is important because when RCU_FAST_NO_HZ permits a CPU to enter
      dyntick-idle mode despite having RCU callbacks pending, it posts a timer
      on that CPU to force a wakeup on that CPU.  This wakeup ensures that the
      CPU will eventually handle the end of the grace period, including invoking
      its RCU callbacks.
      
      However, Pascal Chapperon's test setup shows that the timer handler
      rcu_idle_gp_timer_func() really does get invoked in some cases.  This is
      problematic because this can cause the CPU that entered dyntick-idle
      mode despite still having RCU callbacks pending to remain in
      dyntick-idle mode indefinitely, which means that its RCU callbacks might
      never be invoked.  This situation can result in grace-period delays or
      even system hangs, which matches Pascal's observations of slow boot-up
      and shutdown (https://lkml.org/lkml/2012/4/5/142).  See also the bugzilla:
      
      	https://bugzilla.redhat.com/show_bug.cgi?id=806548
      
      This commit therefore causes the "should never be invoked" timer handler
      rcu_idle_gp_timer_func() to use smp_call_function_single() to wake up
      the CPU for which the timer was intended, allowing that CPU to invoke
      its RCU callbacks in a timely manner.
      Reported-by: NPascal Chapperon <pascal.chapperon@wanadoo.fr>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      21e52e15
  18. 06 5月, 2012 1 次提交
  19. 02 5月, 2012 1 次提交
  20. 25 4月, 2012 1 次提交
  21. 23 4月, 2012 1 次提交
    • L
      ASoC: dapm: Fix x86_64 build warning. · c97f3bdd
      Liam Girdwood 提交于
      Fixes the following build warning on x86_64.
      
      In file included from include/trace/ftrace.h:567:0,
                       from include/trace/define_trace.h:86,
                       from include/trace/events/asoc.h:410,
                       from sound/soc/soc-core.c:45:
      include/trace/events/asoc.h: In function 'ftrace_raw_event_snd_soc_dapm_output_path':
      include/trace/events/asoc.h:246:1: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
      include/trace/events/asoc.h: In function 'ftrace_raw_event_snd_soc_dapm_input_path':
      include/trace/events/asoc.h:275:1: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
      Signed-off-by: NLiam Girdwood <lrg@ti.com>
      Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
      c97f3bdd
  22. 19 4月, 2012 1 次提交
  23. 11 4月, 2012 1 次提交
    • J
      jbd: Refine commit writeout logic · 2db938be
      Jan Kara 提交于
      Currently we write out all journal buffers in WRITE_SYNC mode. This improves
      performance for fsync heavy workloads but hinders performance when writes
      are mostly asynchronous, most noticably it slows down readers and users
      complain about slow desktop response etc.
      
      So submit writes as asynchronous in the normal case and only submit writes as
      WRITE_SYNC if we detect someone is waiting for current transaction commit.
      
      I've gathered some numbers to back this change. The first is the read latency
      test. It measures time to read 1 MB after several seconds of sleeping in
      presence of streaming writes.
      
      Top 10 times (out of 90) in us:
      Before		After
      2131586		697473
      1709932		557487
      1564598		535642
      1480462		347573
      1478579		323153
      1408496		222181
      1388960		181273
      1329565		181070
      1252486		172832
      1223265		172278
      
      Average:
      619377		82180
      
      So the improvement in both maximum and average latency is massive.
      
      I've measured fsync throughput by:
      fs_mark -n 100 -t 1 -s 16384 -d /mnt/fsync/ -S 1 -L 4
      
      in presence of streaming reader. The numbers (fsyncs/s) are:
      Before		After
      9.9		6.3
      6.8		6.0
      6.3		6.2
      5.8		6.1
      
      So fsync performance seems unharmed by this change.
      Signed-off-by: NJan Kara <jack@suse.cz>
      2db938be
  24. 10 4月, 2012 1 次提交
  25. 31 3月, 2012 1 次提交
  26. 22 3月, 2012 1 次提交
  27. 16 3月, 2012 1 次提交
    • P
      device.h: audit and cleanup users in main include dir · 313162d0
      Paul Gortmaker 提交于
      The <linux/device.h> header includes a lot of stuff, and
      it in turn gets a lot of use just for the basic "struct device"
      which appears so often.
      
      Clean up the users as follows:
      
      1) For those headers only needing "struct device" as a pointer
      in fcn args, replace the include with exactly that.
      
      2) For headers not really using anything from device.h, simply
      delete the include altogether.
      
      3) For headers relying on getting device.h implicitly before
      being included themselves, now explicitly include device.h
      
      4) For files in which doing #1 or #2 uncovers an implicit
      dependency on some other header, fix by explicitly adding
      the required header(s).
      
      Any C files that were implicitly relying on device.h to be
      present have already been dealt with in advance.
      
      Total removals from #1 and #2: 51.  Total additions coming
      from #3: 9.  Total other implicit dependencies from #4: 7.
      
      As of 3.3-rc1, there were 110, so a net removal of 42 gives
      about a 38% reduction in device.h presence in include/*
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      313162d0
  28. 14 3月, 2012 2 次提交
    • J
      jbd2: issue cache flush after checkpointing even with internal journal · 79feb521
      Jan Kara 提交于
      When we reach jbd2_cleanup_journal_tail(), there is no guarantee that
      checkpointed buffers are on a stable storage - especially if buffers were
      written out by jbd2_log_do_checkpoint(), they are likely to be only in disk's
      caches. Thus when we update journal superblock effectively removing old
      transaction from journal, this write of superblock can get to stable storage
      before those checkpointed buffers which can result in filesystem corruption
      after a crash. Thus we must unconditionally issue a cache flush before we
      update journal superblock in these cases.
      
      A similar problem can also occur if journal superblock is written only in
      disk's caches, other transaction starts reusing space of the transaction
      cleaned from the log and power failure happens. Subsequent journal replay would
      still try to replay the old transaction but some of it's blocks may be already
      overwritten by the new transaction. For this reason we must use WRITE_FUA when
      updating log tail and we must first write new log tail to disk and update
      in-memory information only after that.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      79feb521
    • J
      jbd2: split updating of journal superblock and marking journal empty · 24bcc89c
      Jan Kara 提交于
      There are three case of updating journal superblock. In the first case, we want
      to mark journal as empty (setting s_sequence to 0), in the second case we want
      to update log tail, in the third case we want to update s_errno. Split these
      cases into separate functions. It makes the code slightly more straightforward
      and later patches will make the distinction even more important.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      24bcc89c
  29. 24 2月, 2012 1 次提交
  30. 23 2月, 2012 1 次提交
    • D
      tracepoint, vfs, sched: Add exec() tracepoint · 4ff16c25
      David Smith 提交于
      Added a minimal exec tracepoint. Exec is an important major event
      in the life of a task, like fork(), clone() or exit(), all of
      which we already trace.
      
      [ We also do scheduling re-balancing during exec() - so it's useful
        from a scheduler instrumentation POV as well. ]
      
      If you want to watch a task start up, when it gets exec'ed is a good place
      to start.  With the addition of this tracepoint, exec's can be monitored
      and better picture of general system activity can be obtained. This
      tracepoint will also enable better process life tracking, allowing you to
      answer questions like "what process keeps starting up binary X?".
      
      This tracepoint can also be useful in ftrace filtering and trigger
      conditions: i.e. starting or stopping filtering when exec is called.
      Signed-off-by: NDavid Smith <dsmith@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/4F314D19.7030504@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      4ff16c25
  31. 22 2月, 2012 2 次提交
    • P
      sched/events: Revert trace_sched_stat_sleeptime() · 8c79a045
      Peter Zijlstra 提交于
      Commit 1ac9bc69 ("sched/tracing: Add a new tracepoint for sleeptime")
      added a new sched:sched_stat_sleeptime tracepoint.
      
      It's broken: the first sample we get on a task might be bad because
      of a stale sleep_start value that wasn't reset at the last task switch
      because the tracepoint was not active.
      
      It also breaks the existing schedstat samples due to the side
      effects of:
      
      -               se->statistics.sleep_start = 0;
      ...
      -               se->statistics.block_start = 0;
      
      Nor do I see means to fix it without adding overhead to the scheduler
      fast path, which I'm not willing to for the sake of redundant
      instrumentation.
      
      Most importantly, sleep time information can already be constructed
      by tracing context switches and wakeups, and taking the timestamp
      difference between the schedule-out, the wakeup and the schedule-in.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/n/tip-pc4c9qhl8q6vg3bs4j6k0rbd@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      8c79a045
    • P
      rcu: Avoid waking up CPUs having only kfree_rcu() callbacks · 486e2593
      Paul E. McKenney 提交于
      When CONFIG_RCU_FAST_NO_HZ is enabled, RCU will allow a given CPU to
      enter dyntick-idle mode even if it still has RCU callbacks queued.
      RCU avoids system hangs in this case by scheduling a timer for several
      jiffies in the future.  However, if all of the callbacks on that CPU
      are from kfree_rcu(), there is no reason to wake the CPU up, as it is
      not a problem to defer freeing of memory.
      
      This commit therefore tracks the number of callbacks on a given CPU
      that are from kfree_rcu(), and avoids scheduling the timer if all of
      a given CPU's callbacks are from kfree_rcu().
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      486e2593