1. 30 5月, 2012 8 次提交
    • K
      bug: introduce BUILD_BUG_ON_INVALID() macro · baf05aa9
      Konstantin Khlebnikov 提交于
      Sometimes we want to check some expressions correctness at compile time.
      "(void)(e);" or "if (e);" can be dangerous if the expression has
      side-effects, and gcc sometimes generates a lot of code, even if the
      expression has no effect.
      
      This patch introduces macro BUILD_BUG_ON_INVALID() for such checks, it
      forces a compilation error if expression is invalid without any extra
      code.
      
      [Cast to "long" required because sizeof does not work for bit-fields.]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      baf05aa9
    • J
      mm: memcg: count pte references from every member of the reclaimed hierarchy · c3ac9a8a
      Johannes Weiner 提交于
      The rmap walker checking page table references has historically ignored
      references from VMAs that were not part of the memcg that was being
      reclaimed during memcg hard limit reclaim.
      
      When transitioning global reclaim to memcg hierarchy reclaim, I missed
      that bit and now references from outside a memcg are ignored even during
      global reclaim.
      
      Reverting back to traditional behaviour - count all references during
      global reclaim and only mind references of the memcg being reclaimed
      during limit reclaim would be one option.
      
      However, the more generic idea is to ignore references exactly then when
      they are outside the hierarchy that is currently under reclaim; because
      only then will their reclamation be of any use to help the pressure
      situation.  It makes no sense to ignore references from a sibling memcg
      and then evict a page that will be immediately refaulted by that sibling
      which contributes to the same usage of the common ancestor under
      reclaim.
      
      The solution: make the rmap walker ignore references from VMAs that are
      not part of the hierarchy that is being reclaimed.
      
      Flat limit reclaim will stay the same, hierarchical limit reclaim will
      mind the references only to pages that the hierarchy owns.  Global
      reclaim, since it reclaims from all memcgs, will be fixed to regard all
      references.
      
      [akpm@linux-foundation.org: name the args in the declaration]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: Konstantin Khlebnikov<khlebnikov@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3ac9a8a
    • A
      mm: do_migrate_pages(): rename arguments · 0ce72d4f
      Andrew Morton 提交于
      s/from_nodes/from and s/to_nodes/to/.  The "_nodes" is redundant - it
      duplicates the argument's type.
      
      Done in a fit of irritation over 80-col issues :(
      
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <mkosaki@redhat.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ce72d4f
    • M
      mm: vmscan: remove reclaim_mode_t · 23b9da55
      Mel Gorman 提交于
      There is little motiviation for reclaim_mode_t once RECLAIM_MODE_[A]SYNC
      and lumpy reclaim have been removed.  This patch gets rid of
      reclaim_mode_t as well and improves the documentation about what
      reclaim/compaction is and when it is triggered.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23b9da55
    • M
      mm: vmscan: do not stall on writeback during memory compaction · 41ac1999
      Mel Gorman 提交于
      This patch stops reclaim/compaction entering sync reclaim as this was
      only intended for lumpy reclaim and an oversight.  Page migration has
      its own logic for stalling on writeback pages if necessary and memory
      compaction is already using it.
      
      Waiting on page writeback is bad for a number of reasons but the primary
      one is that waiting on writeback to a slow device like USB can take a
      considerable length of time.  Page reclaim instead uses
      wait_iff_congested() to throttle if too many dirty pages are being
      scanned.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41ac1999
    • M
      mm: vmscan: remove lumpy reclaim · c53919ad
      Mel Gorman 提交于
      This series removes lumpy reclaim and some stalling logic that was
      unintentionally being used by memory compaction.  The end result is that
      stalling on dirty pages during page reclaim now depends on
      wait_iff_congested().
      
      Four kernels were compared
      
        3.3.0     vanilla
        3.4.0-rc2 vanilla
        3.4.0-rc2 lumpyremove-v2 is patch one from this series
        3.4.0-rc2 nosync-v2r3 is the full series
      
      Removing lumpy reclaim saves almost 900 bytes of text whereas the full
      series removes 1200 bytes.
      
           text     data      bss       dec     hex  filename
        67403754  1927944  2260992  10929311  a6c49f  vmlinux-3.4.0-rc2-vanilla
        6739479  1927944  2260992  10928415  a6c11f  vmlinux-3.4.0-rc2-lumpyremove-v2
        6739159  1927944  2260992  10928095  a6bfdf  vmlinux-3.4.0-rc2-nosync-v2
      
      There are behaviour changes in the series and so tests were run with
      monitoring of ftrace events.  This disrupts results so the performance
      results are distorted but the new behaviour should be clearer.
      
      fs-mark running in a threaded configuration showed little of interest as
      it did not push reclaim aggressively
      
        FS-Mark Multi Threaded
                                3.3.0-vanilla       rc2-vanilla       lumpyremove-v2r3       nosync-v2r3
        Files/s  min           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
        Files/s  mean          3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
        Files/s  stddev        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)
        Files/s  max           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
        Overhead min      508667.00 ( 0.00%)   521350.00 (-2.49%)   544292.00 (-7.00%)   547168.00 (-7.57%)
        Overhead mean     551185.00 ( 0.00%)   652690.73 (-18.42%)   991208.40 (-79.83%)   570130.53 (-3.44%)
        Overhead stddev    18200.69 ( 0.00%)   331958.29 (-1723.88%)  1579579.43 (-8578.68%)     9576.81 (47.38%)
        Overhead max      576775.00 ( 0.00%)  1846634.00 (-220.17%)  6901055.00 (-1096.49%)   585675.00 (-1.54%)
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             309.90    300.95    307.33    298.95
        User+Sys Time Running Test (seconds)        319.32    309.67    315.69    307.51
        Total Elapsed Time (seconds)               1187.85   1193.09   1191.98   1193.73
      
        MMTests Statistics: vmstat
        Page Ins                                       80532       82212       81420       79480
        Page Outs                                  111434984   111456240   111437376   111582628
        Swap Ins                                           0           0           0           0
        Swap Outs                                          0           0           0           0
        Direct pages scanned                           44881       27889       27453       34843
        Kswapd pages scanned                        25841428    25860774    25861233    25843212
        Kswapd pages reclaimed                      25841393    25860741    25861199    25843179
        Direct pages reclaimed                         44881       27889       27453       34843
        Kswapd efficiency                                99%         99%         99%         99%
        Kswapd velocity                            21754.791   21675.460   21696.029   21649.127
        Direct efficiency                               100%        100%        100%        100%
        Direct velocity                               37.783      23.375      23.031      29.188
        Percentage direct scans                           0%          0%          0%          0%
      
      ftrace showed that there was no stalling on writeback or pages submitted
      for IO from reclaim context.
      
      postmark was similar and while it was more interesting, it also did not
      push reclaim heavily.
      
        POSTMARK
                                             3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
        Transactions per second:               16.00 ( 0.00%)    20.00 (25.00%)    18.00 (12.50%)    17.00 ( 6.25%)
        Data megabytes read per second:        18.80 ( 0.00%)    24.27 (29.10%)    22.26 (18.40%)    20.54 ( 9.26%)
        Data megabytes written per second:     35.83 ( 0.00%)    46.25 (29.08%)    42.42 (18.39%)    39.14 ( 9.24%)
        Files created alone per second:        28.00 ( 0.00%)    38.00 (35.71%)    34.00 (21.43%)    30.00 ( 7.14%)
        Files create/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)
        Files deleted alone per second:       556.00 ( 0.00%)  1224.00 (120.14%)  3062.00 (450.72%)  6124.00 (1001.44%)
        Files delete/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)
      
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             113.34    107.99    109.73    108.72
        User+Sys Time Running Test (seconds)        145.51    139.81    143.32    143.55
        Total Elapsed Time (seconds)               1159.16    899.23    980.17   1062.27
      
        MMTests Statistics: vmstat
        Page Ins                                    13710192    13729032    13727944    13760136
        Page Outs                                   43071140    42987228    42733684    42931624
        Swap Ins                                           0           0           0           0
        Swap Outs                                          0           0           0           0
        Direct pages scanned                               0           0           0           0
        Kswapd pages scanned                         9941613     9937443     9939085     9929154
        Kswapd pages reclaimed                       9940926     9936751     9938397     9928465
        Direct pages reclaimed                             0           0           0           0
        Kswapd efficiency                                99%         99%         99%         99%
        Kswapd velocity                             8576.567   11051.058   10140.164    9347.109
        Direct efficiency                               100%        100%        100%        100%
        Direct velocity                                0.000       0.000       0.000       0.000
      
      It looks like here that the full series regresses performance but as
      ftrace showed no usage of wait_iff_congested() or sync reclaim I am
      assuming it's a disruption due to monitoring.  Other data such as memory
      usage, page IO, swap IO all looked similar.
      
      Running a benchmark with a plain DD showed nothing very interesting.
      The full series stalled in wait_iff_congested() slightly less but stall
      times on vanilla kernels were marginal.
      
      Running a benchmark that hammered on file-backed mappings showed stalls
      due to congestion but not in sync writebacks
      
        MICRO
                                             3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             308.13    294.50    298.75    299.53
        User+Sys Time Running Test (seconds)        330.45    316.28    318.93    320.79
        Total Elapsed Time (seconds)               1814.90   1833.88   1821.14   1832.91
      
        MMTests Statistics: vmstat
        Page Ins                                      108712      120708       97224      110344
        Page Outs                                  155514576   156017404   155813676   156193256
        Swap Ins                                           0           0           0           0
        Swap Outs                                          0           0           0           0
        Direct pages scanned                         2599253     1550480     2512822     2414760
        Kswapd pages scanned                        69742364    71150694    68839041    69692533
        Kswapd pages reclaimed                      34824488    34773341    34796602    34799396
        Direct pages reclaimed                         53693       94750       61792       75205
        Kswapd efficiency                                49%         48%         50%         49%
        Kswapd velocity                            38427.662   38797.901   37799.972   38022.889
        Direct efficiency                                 2%          6%          2%          3%
        Direct velocity                             1432.174     845.464    1379.807    1317.446
        Percentage direct scans                           3%          2%          3%          3%
        Page writes by reclaim                             0           0           0           0
        Page writes file                                   0           0           0           0
        Page writes anon                                   0           0           0           0
        Page reclaim immediate                             0           0           0        1218
        Page rescued immediate                             0           0           0           0
        Slabs scanned                                  15360       16384       13312       16384
        Direct inode steals                                0           0           0           0
        Kswapd inode steals                             4340        4327        1630        4323
      
        FTrace Reclaim Statistics: congestion_wait
        Direct number congest     waited                 0          0          0          0
        Direct time   congest     waited               0ms        0ms        0ms        0ms
        Direct full   congest     waited                 0          0          0          0
        Direct number conditional waited               900        870        754        789
        Direct time   conditional waited               0ms        0ms        0ms       20ms
        Direct full   conditional waited                 0          0          0          0
        KSwapd number congest     waited              2106       2308       2116       1915
        KSwapd time   congest     waited          139924ms   157832ms   125652ms   132516ms
        KSwapd full   congest     waited              1346       1530       1202       1278
        KSwapd number conditional waited             12922      16320      10943      14670
        KSwapd time   conditional waited               0ms        0ms        0ms        0ms
        KSwapd full   conditional waited                 0          0          0          0
      
      Reclaim statistics are not radically changed.  The stall times in kswapd
      are massive but it is clear that it is due to calls to congestion_wait()
      and that is almost certainly the call in balance_pgdat().  Otherwise
      stalls due to dirty pages are non-existant.
      
      I ran a benchmark that stressed high-order allocation.  This is very
      artifical load but was used in the past to evaluate lumpy reclaim and
      compaction.  Generally I look at allocation success rates and latency
      figures.
      
        STRESS-HIGHALLOC
                         3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
        Pass 1          81.00 ( 0.00%)    28.00 (-53.00%)    24.00 (-57.00%)    28.00 (-53.00%)
        Pass 2          82.00 ( 0.00%)    39.00 (-43.00%)    38.00 (-44.00%)    43.00 (-39.00%)
        while Rested    88.00 ( 0.00%)    87.00 (-1.00%)    88.00 ( 0.00%)    88.00 ( 0.00%)
      
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             740.93    681.42    685.14    684.87
        User+Sys Time Running Test (seconds)       2922.65   3269.52   3281.35   3279.44
        Total Elapsed Time (seconds)               1161.73   1152.49   1159.55   1161.44
      
        MMTests Statistics: vmstat
        Page Ins                                     4486020     2807256     2855944     2876244
        Page Outs                                    7261600     7973688     7975320     7986120
        Swap Ins                                       31694           0           0           0
        Swap Outs                                      98179           0           0           0
        Direct pages scanned                           53494       57731       34406      113015
        Kswapd pages scanned                         6271173     1287481     1278174     1219095
        Kswapd pages reclaimed                       2029240     1281025     1260708     1201583
        Direct pages reclaimed                          1468       14564       16649       92456
        Kswapd efficiency                                32%         99%         98%         98%
        Kswapd velocity                             5398.133    1117.130    1102.302    1049.641
        Direct efficiency                                 2%         25%         48%         81%
        Direct velocity                               46.047      50.092      29.672      97.306
        Percentage direct scans                           0%          4%          2%          8%
        Page writes by reclaim                       1616049           0           0           0
        Page writes file                             1517870           0           0           0
        Page writes anon                               98179           0           0           0
        Page reclaim immediate                        103778       27339        9796       17831
        Page rescued immediate                             0           0           0           0
        Slabs scanned                                1096704      986112      980992      998400
        Direct inode steals                              223      215040      216736      247881
        Kswapd inode steals                           175331       61548       68444       63066
        Kswapd skipped wait                            21991           0           1           0
        THP fault alloc                                    1         135         125         134
        THP collapse alloc                               393         311         228         236
        THP splits                                        25          13           7           8
        THP fault fallback                                 0           0           0           0
        THP collapse fail                                  3           5           7           7
        Compaction stalls                                865        1270        1422        1518
        Compaction success                               370         401         353         383
        Compaction failures                              495         869        1069        1135
        Compaction pages moved                        870155     3828868     4036106     4423626
        Compaction move failure                        26429       23865       29742       27514
      
      Success rates are completely hosed for 3.4-rc2 which is almost certainly
      due to commit fe2c2a10 ("vmscan: reclaim at order 0 when compaction
      is enabled").  I expected this would happen for kswapd and impair
      allocation success rates (https://lkml.org/lkml/2012/1/25/166) but I did
      not anticipate this much a difference: 80% less scanning, 37% less
      reclaim by kswapd
      
      In comparison, reclaim/compaction is not aggressive and gives up easily
      which is the intended behaviour.  hugetlbfs uses __GFP_REPEAT and would
      be much more aggressive about reclaim/compaction than THP allocations
      are.  The stress test above is allocating like neither THP or hugetlbfs
      but is much closer to THP.
      
      Mainline is now impaired in terms of high order allocation under heavy
      load although I do not know to what degree as I did not test with
      __GFP_REPEAT.  Keep this in mind for bugs related to hugepage pool
      resizing, THP allocation and high order atomic allocation failures from
      network devices.
      
      In terms of congestion throttling, I see the following for this test
      
        FTrace Reclaim Statistics: congestion_wait
        Direct number congest     waited                 3          0          0          0
        Direct time   congest     waited               0ms        0ms        0ms        0ms
        Direct full   congest     waited                 0          0          0          0
        Direct number conditional waited               957        512       1081       1075
        Direct time   conditional waited               0ms        0ms        0ms        0ms
        Direct full   conditional waited                 0          0          0          0
        KSwapd number congest     waited                36          4          3          5
        KSwapd time   congest     waited            3148ms      400ms      300ms      500ms
        KSwapd full   congest     waited                30          4          3          5
        KSwapd number conditional waited             88514        197        332        542
        KSwapd time   conditional waited            4980ms        0ms        0ms        0ms
        KSwapd full   conditional waited                49          0          0          0
      
      The "conditional waited" times are the most interesting as this is
      directly impacted by the number of dirty pages encountered during scan.
      As lumpy reclaim is no longer scanning contiguous ranges, it is finding
      fewer dirty pages.  This brings wait times from about 5 seconds to 0.
      kswapd itself is still calling congestion_wait() so it'll still stall but
      it's a lot less.
      
      In terms of the type of IO we were doing, I see this
      
        FTrace Reclaim Statistics: mm_vmscan_writepage
        Direct writes anon  sync                         0          0          0          0
        Direct writes anon  async                        0          0          0          0
        Direct writes file  sync                         0          0          0          0
        Direct writes file  async                        0          0          0          0
        Direct writes mixed sync                         0          0          0          0
        Direct writes mixed async                        0          0          0          0
        KSwapd writes anon  sync                         0          0          0          0
        KSwapd writes anon  async                    91682          0          0          0
        KSwapd writes file  sync                         0          0          0          0
        KSwapd writes file  async                   822629          0          0          0
        KSwapd writes mixed sync                         0          0          0          0
        KSwapd writes mixed async                        0          0          0          0
      
      In 3.2, kswapd was doing a bunch of async writes of pages but
      reclaim/compaction was never reaching a point where it was doing sync
      IO.  This does not guarantee that reclaim/compaction was not calling
      wait_on_page_writeback() but I would consider it unlikely.  It indicates
      that merging patches 2 and 3 to stop reclaim/compaction calling
      wait_on_page_writeback() should be safe.
      
      This patch:
      
      Lumpy reclaim had a purpose but in the mind of some, it was to kick the
      system so hard it trashed.  For others the purpose was to complicate
      vmscan.c.  Over time it was giving softer shoes and a nicer attitude but
      memory compaction needs to step up and replace it so this patch sends
      lumpy reclaim to the farm.
      
      The tracepoint format changes for isolating LRU pages with this patch
      applied.  Furthermore reclaim/compaction can no longer queue dirty pages
      in pageout() if the underlying BDI is congested.  Lumpy reclaim used
      this logic and reclaim/compaction was using it in error.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c53919ad
    • R
      mm: remove swap token code · e709ffd6
      Rik van Riel 提交于
      The swap token code no longer fits in with the current VM model.  It
      does not play well with cgroups or the better NUMA placement code in
      development, since we have only one swap token globally.
      
      It also has the potential to mess with scalability of the system, by
      increasing the number of non-reclaimable pages on the active and
      inactive anon LRU lists.
      
      Last but not least, the swap token code has been broken for a year
      without complaints, as reported by Konstantin Khlebnikov.  This suggests
      we no longer have much use for it.
      
      The days of sub-1G memory systems with heavy use of swap are over.  If
      we ever need thrashing reducing code in the future, we will have to
      implement something that does scale.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NBob Picco <bpicco@meloft.net>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e709ffd6
    • P
      pagemap.h: fix warning about possibly used before init var · af2e8409
      Paul Gortmaker 提交于
      Commit f56f821f ("mm: extend prefault helpers to fault in more than
      PAGE_SIZE") added in the new functions: fault_in_multipages_writeable()
      and fault_in_multipages_readable().
      
      However, we currently see:
      
        include/linux/pagemap.h:492: warning: 'ret' may be used uninitialized in this function
        include/linux/pagemap.h:492: note: 'ret' was declared here
      
      Unlike a lot of gcc nags, this one appears somewhat legit.  i.e.  passing
      in an invalid negative value of "size" does make it look like all the
      conditionals in there would be bypassed and the uninitialized value would
      be returned.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af2e8409
  2. 27 5月, 2012 3 次提交
    • L
      word-at-a-time: make the interfaces truly generic · 36126f8f
      Linus Torvalds 提交于
      This changes the interfaces in <asm/word-at-a-time.h> to be a bit more
      complicated, but a lot more generic.
      
      In particular, it allows us to really do the operations efficiently on
      both little-endian and big-endian machines, pretty much regardless of
      machine details.  For example, if you can rely on a fast population
      count instruction on your architecture, this will allow you to make your
      optimized <asm/word-at-a-time.h> file with that.
      
      NOTE! The "generic" version in include/asm-generic/word-at-a-time.h is
      not truly generic, it actually only works on big-endian.  Why? Because
      on little-endian the generic algorithms are wasteful, since you can
      inevitably do better. The x86 implementation is an example of that.
      
      (The only truly non-generic part of the asm-generic implementation is
      the "find_zero()" function, and you could make a little-endian version
      of it.  And if the Kbuild infrastructure allowed us to pick a particular
      header file, that would be lovely)
      
      The <asm/word-at-a-time.h> functions are as follows:
      
       - WORD_AT_A_TIME_CONSTANTS: specific constants that the algorithm
         uses.
      
       - has_zero(): take a word, and determine if it has a zero byte in it.
         It gets the word, the pointer to the constant pool, and a pointer to
         an intermediate "data" field it can set.
      
         This is the "quick-and-dirty" zero tester: it's what is run inside
         the hot loops.
      
       - "prep_zero_mask()": take the word, the data that has_zero() produced,
         and the constant pool, and generate an *exact* mask of which byte had
         the first zero.  This is run directly *outside* the loop, and allows
         the "has_zero()" function to answer the "is there a zero byte"
         question without necessarily getting exactly *which* byte is the
         first one to contain a zero.
      
         If you do multiple byte lookups concurrently (eg "hash_name()", which
         looks for both NUL and '/' bytes), after you've done the prep_zero_mask()
         phase, the result of those can be or'ed together to get the "either
         or" case.
      
       - The result from "prep_zero_mask()" can then be fed into "find_zero()"
         (to find the byte offset of the first byte that was zero) or into
         "zero_bytemask()" (to find the bytemask of the bytes preceding the
         zero byte).
      
         The existence of zero_bytemask() is optional, and is not necessary
         for the normal string routines.  But dentry name hashing needs it, so
         if you enable DENTRY_WORD_AT_A_TIME you need to expose it.
      
      This changes the generic strncpy_from_user() function and the dentry
      hashing functions to use these modified word-at-a-time interfaces.  This
      gets us back to the optimized state of the x86 strncpy that we lost in
      the previous commit when moving over to the generic version.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36126f8f
    • T
      NFSv4.1: Don't clobber the seqid if exchange_id returns a confirmed clientid · 32b01310
      Trond Myklebust 提交于
      If the EXCHGID4_FLAG_CONFIRMED_R flag is set, the client is in theory
      supposed to already know the correct value of the seqid, in which case
      RFC5661 states that it should ignore the value returned.
      
      Also ensure that if the sanity check in nfs4_check_cl_exchange_flags
      fails, then we must not change the nfs_client fields.
      
      Finally, clean up the code: we don't need to retest the value of
      'status' unless it can change.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      32b01310
    • T
      NFSv4.1: Add DESTROY_CLIENTID · 66245539
      Trond Myklebust 提交于
      Ensure that we destroy our lease on last unmount
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      66245539
  3. 26 5月, 2012 3 次提交
  4. 25 5月, 2012 8 次提交
    • S
      dma-buf: minor documentation fixes. · 12c4727e
      Sumit Semwal 提交于
      Some minor inline documentation fixes for gaps resulting from new patches.
      Signed-off-by: NSumit Semwal <sumit.semwal@ti.com>
      Signed-off-by: NSumit Semwal <sumit.semwal@linaro.org>
      12c4727e
    • D
      dma-buf: add vmap interface · 98f86c9e
      Dave Airlie 提交于
      The main requirement I have for this interface is for scanning out
      using the USB gpu devices. Since these devices have to read the
      framebuffer on updates and linearly compress it, using kmaps
      is a major overhead for every update.
      
      v2: fix warn issues pointed out by Sylwester Nawrocki.
      
      v3: fix compile !CONFIG_DMA_SHARED_BUFFER and add _GPL for now
      Signed-off-by: NDave Airlie <airlied@redhat.com>
      Reviewed-by: NRob Clark <rob.clark@linaro.org>
      Signed-off-by: NSumit Semwal <sumit.semwal@linaro.org>
      98f86c9e
    • D
      dma-buf: mmap support · 4c78513e
      Daniel Vetter 提交于
      Compared to Rob Clark's RFC I've ditched the prepare/finish hooks
      and corresponding ioctls on the dma_buf file. The major reason for
      that is that many people seem to be under the impression that this is
      also for synchronization with outstanding asynchronous processsing.
      I'm pretty massively opposed to this because:
      
      - It boils down reinventing a new rather general-purpose userspace
        synchronization interface. If we look at things like futexes, this
        is hard to get right.
      - Furthermore a lot of kernel code has to interact with this
        synchronization primitive. This smells a look like the dri1 hw_lock,
        a horror show I prefer not to reinvent.
      - Even more fun is that multiple different subsystems would interact
        here, so we have plenty of opportunities to create funny deadlock
        scenarios.
      
      I think synchronization is a wholesale different problem from data
      sharing and should be tackled as an orthogonal problem.
      
      Now we could demand that prepare/finish may only ensure cache
      coherency (as Rob intended), but that runs up into the next problem:
      We not only need mmap support to facilitate sw-only processing nodes
      in a pipeline (without jumping through hoops by importing the dma_buf
      into some sw-access only importer), which allows for a nicer
      ION->dma-buf upgrade path for existing Android userspace. We also need
      mmap support for existing importing subsystems to support existing
      userspace libraries. And a loot of these subsystems are expected to
      export coherent userspace mappings.
      
      So prepare/finish can only ever be optional and the exporter /needs/
      to support coherent mappings. Given that mmap access is always
      somewhat fallback-y in nature I've decided to drop this optimization,
      instead of just making it optional. If we demonstrate a clear need for
      this, supported by benchmark results, we can always add it in again
      later as an optional extension.
      
      Other differences compared to Rob's RFC is the above mentioned support
      for mapping a dma-buf through facilities provided by the importer.
      Which results in mmap support no longer being optional.
      
      Note that this dma-buf mmap patch does _not_ support every possible
      insanity an existing subsystem could pull of with mmap: Because it
      does not allow to intercept pagefaults and shoot down ptes importing
      subsystems can't add some magic of their own at these points (e.g. to
      automatically synchronize with outstanding rendering or set up some
      special resources). I've done a cursory read through a few mmap
      implementions of various subsytems and I'm hopeful that we can avoid
      this (and the complexity it'd bring with it).
      
      Additonally I've extended the documentation a bit to explain the hows
      and whys of this mmap extension.
      
      In case we ever want to add support for explicitly cache maneged
      userspace mmap with a prepare/finish ioctl pair, we could specify that
      userspace needs to mmap a different part of the dma_buf, e.g. the
      range starting at dma_buf->size up to dma_buf->size*2. This works
      because the size of a dma_buf is invariant over it's lifetime. The
      exporter would obviously need to fall back to coherent mappings for
      both ranges if a legacy clients maps the coherent range and the
      architecture cannot suppor conflicting caching policies. Also, this
      would obviously be optional and userspace needs to be able to fall
      back to coherent mappings.
      
      v2:
      - Spelling fixes from Rob Clark.
      - Compile fix for !DMA_BUF from Rob Clark.
      - Extend commit message to explain how explicitly cache managed mmap
        support could be added later.
      - Extend the documentation with implementations notes for exporters
        that need to manually fake coherency.
      
      v3:
      - dma_buf pointer initialization goof-up noticed by Rebecca Schultz
        Zavin.
      
      Cc: Rob Clark <rob.clark@linaro.org>
      Cc: Rebecca Schultz Zavin <rebecca@android.com>
      Acked-by: NRob Clark <rob.clark@linaro.org>
      Signed-Off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: NSumit Semwal <sumit.semwal@linaro.org>
      4c78513e
    • W
      nfs4.1: add BIND_CONN_TO_SESSION operation · 7c44f1ae
      Weston Andros Adamson 提交于
      This patch adds the BIND_CONN_TO_SESSION operation which is needed for
      upcoming SP4_MACH_CRED work and useful for recovering from broken connections
      without destroying the session.
      Signed-off-by: NWeston Andros Adamson <dros@netapp.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      7c44f1ae
    • A
      NFSv4.1 add nfs_inode book keeping for mdsthreshold · 2701d086
      Andy Adamson 提交于
      Keep track of the number of bytes read or written via buffered, direct, and
      mem-mapped i/o for use by mdsthreshold size_io hints.
      Signed-off-by: NAndy Adamson <andros@netapp.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      2701d086
    • A
      82be417a
    • A
      NFSv4.1 mdsthreshold attribute xdr · 88034c3d
      Andy Adamson 提交于
      We only support one layout type per file system, so one threshold_item4 per
      mdsthreshold4.
      Signed-off-by: NAndy Adamson <andros@netapp.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      88034c3d
    • D
      kernel: Move REPEAT_BYTE definition into linux/kernel.h · 44696908
      David S. Miller 提交于
      And make sure that everything using it explicitly includes
      that header file.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44696908
  5. 24 5月, 2012 2 次提交
  6. 23 5月, 2012 16 次提交
    • J
      Revert "sched, perf: Use a single callback into the scheduler" · ab0cce56
      Jiri Olsa 提交于
      This reverts commit cb04ff9a ("sched, perf: Use a single
      callback into the scheduler").
      
      Before this change was introduced, the process switch worked
      like this (wrt. to perf event schedule):
      
           schedule (prev, next)
             - schedule out all perf events for prev
             - switch to next
             - schedule in all perf events for current (next)
      
      After the commit, the process switch looks like:
      
           schedule (prev, next)
             - schedule out all perf events for prev
             - schedule in all perf events for (next)
             - switch to next
      
      The problem is, that after we schedule perf events in, the pmu
      is enabled and we can receive events even before we make the
      switch to next - so "current" still being prev process (event
      SAMPLE data are filled based on the value of the "current"
      process).
      
      Thats exactly what we see for test__PERF_RECORD test. We receive
      SAMPLES with PID of the process that our tracee is scheduled
      from.
      
      Discussed with Peter Zijlstra:
      
       > Bah!, yeah I guess reverting is the right thing for now. Sad
       > though.
       >
       > So by having the two hooks we have a black-spot between them
       > where we receive no events at all, this black-spot covers the
       > hand-over of current and we thus don't receive the 'wrong'
       > events.
       >
       > I rather liked we could do away with both that black-spot and
       > clean up the code a little, but apparently people rely on it.
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: acme@redhat.com
      Cc: paulus@samba.org
      Cc: cjashfor@linux.vnet.ibm.com
      Cc: fweisbec@gmail.com
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/20120523111302.GC1638@m.brq.redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ab0cce56
    • S
      mfd: Fix max77693 build failure · 78302a19
      Samuel Ortiz 提交于
      Without it we get:
      
      drivers/mfd/max77693.c: In function ‘max77693_i2c_probe’:
      drivers/mfd/max77693.c:157:2: error: implicit declaration of function
      ‘max77693_irq_init’ [-Werror=implicit-function-declaration]
      drivers/mfd/max77693.c: In function ‘max77693_resume’:
      drivers/mfd/max77693.c:215:2: error: implicit declaration of function
      ‘max77693_irq_resume’ [-Werror=implicit-function-declaration]
      drivers/mfd/max77693-irq.c: In function ‘max77693_irq_lock’:
      drivers/mfd/max77693-irq.c:104:2: error: ‘struct max77693_dev’ has no member
      named ‘irqlock’
      drivers/mfd/max77693-irq.c: In function ‘max77693_irq_sync_unlock’:
      drivers/mfd/max77693-irq.c:119:11: error: ‘struct max77693_dev’ has no member
      named ‘irq_masks_cache’
      drivers/mfd/max77693-irq.c:119:42: error: ‘struct max77693_dev’ has no member
      named ‘irq_masks_cur’
      drivers/mfd/max77693-irq.c:122:13: error: ‘struct max77693_dev’ has no member
      named ‘irq_masks_cur’
      drivers/mfd/max77693-irq.c:125:24: error: ‘struct max77693_dev’ has no member
      named ‘irqlock’
      drivers/mfd/max77693-irq.c: In function ‘max77693_irq_mask’:
      drivers/mfd/max77693-irq.c:141:11: error: ‘struct max77693_dev’ has no member
      named ‘irq_masks_cur’
      drivers/mfd/max77693-irq.c:143:11: error: ‘struct max77693_dev’ has no member
      named ‘irq_masks_cur’
      drivers/mfd/max77693-irq.c: In function ‘max77693_irq_unmask’:
      drivers/mfd/max77693-irq.c:153:11: error: ‘struct max77693_dev’ has no member
      named ‘irq_masks_cur’
      drivers/mfd/max77693-irq.c:155:11: error: ‘struct max77693_dev’ has no member
      named ‘irq_masks_cur’
      drivers/mfd/max77693-irq.c: In function ‘max77693_irq_thread’:
      drivers/mfd/max77693-irq.c:209:26: error: ‘struct max77693_dev’ has no member
      named ‘irq_masks_cur’
      drivers/mfd/max77693-irq.c:211:27: error: ‘struct max77693_dev’ has no member
      named ‘irq_masks_cur’
      drivers/mfd/max77693-irq.c:217:39: error: ‘struct max77693_dev’ has no member
      named ‘irq_domain’
      drivers/mfd/max77693-irq.c: In function ‘max77693_irq_init’:
      drivers/mfd/max77693-irq.c:260:2: error: ‘struct max77693_dev’ has no member
      named ‘irqlock’
      drivers/mfd/max77693-irq.c:268:12: error: ‘struct max77693_dev’ has no member
      named ‘irq_masks_cur’
      drivers/mfd/max77693-irq.c:269:12: error: ‘struct max77693_dev’ has no member
      named ‘irq_masks_cache’
      drivers/mfd/max77693-irq.c:271:12: error: ‘struct max77693_dev’ has no member
      named ‘irq_masks_cur’
      drivers/mfd/max77693-irq.c:272:12: error: ‘struct max77693_dev’ has no member
      named ‘irq_masks_cache’
      drivers/mfd/max77693-irq.c:292:10: error: ‘struct max77693_dev’ has no member
      named ‘irq_domain’
      Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>
      78302a19
    • D
      ttm: add prime sharing support to TTM (v2) · 129b78bf
      Dave Airlie 提交于
      This adds the ability for ttm common code to take an SG table
      and use it as the backing for a slave TTM object.
      
      The drivers can then populate their GTT tables using the SG object.
      
      v2: make sure to setup VM for sg bos as well.
      Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Signed-off-by: NDave Airlie <airlied@redhat.com>
      129b78bf
    • D
      drm/prime: introduce sg->pages/addr arrays helper · 51ab7ba2
      Dave Airlie 提交于
      the ttm drivers need this currently, in order to get fault handling
      working and efficient.
      
      It also allows addrs to be NULL for devices like udl.
      Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NDave Airlie <airlied@redhat.com>
      51ab7ba2
    • L
      mfd: Save device node parsed platform data for tps65910 sub devices · cb8d8654
      Laxman Dewangan 提交于
      Save the allocated memory to store the parsed device node information
      to the global device structure so that sub devices can directly use this
      pointer.
      In this way, the sub devices does not require to re-allocate the
      memory for storing the sub-devices specific device node information.
      Signed-off-by: NLaxman Dewangan <ldewangan@nvidia.com>
      Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>
      cb8d8654
    • J
      mfd: Add r_select to lm3533 platform data · 730a3d01
      Johan Hovold 提交于
      Add resistor-select parameter to the platform data.
      Signed-off-by: NJohan Hovold <jhovold@gmail.com>
      Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>
      730a3d01
    • S
      if: restore token ring ARP type to header · 37c106d0
      stephen hemminger 提交于
      Recent removal of Token Ring breaks the build of iproute2.
      
      Even though Token Ring support is gone from the kernel, it is worth
      keeping the the definition of the TR ARP type to avoid breaking
      userspace programs that use this file.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37c106d0
    • C
      NFS: EXCHANGE_ID should save the server major and minor ID · acdeb69d
      Chuck Lever 提交于
      Save the server major and minor ID results from EXCHANGE_ID, as they
      are needed for detecting server trunking.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      acdeb69d
    • C
      NFS: Add nfs_client behavior flags · 4bf590e0
      Chuck Lever 提交于
      "noresvport" and "discrtry" can be passed to nfs_create_rpc_client()
      by setting flags in the passed-in nfs_client.  This change makes it
      easy to add new flags.
      
      Note that these settings are now "sticky" over the lifetime of a
      struct nfs_client, and may even be copied when an nfs_client is
      cloned.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      4bf590e0
    • C
      NFS: Refactor nfs_get_client(): initialize nfs_client · 8cab4c39
      Chuck Lever 提交于
      Clean up: Continue to rationalize the locking in nfs_get_client() by
      moving the logic that handles the case where a matching server IP
      address is not found.
      
      When we support server trunking detection, client initialization may
      return a different nfs_client struct than was passed to it.  Change
      the synopsis of the init_client methods to return an nfs_client.
      
      The client initialization logic in nfs_get_client() is not much more
      than a wrapper around ->init_client.  It's simpler to keep the little
      bits of error handling in the version-specific init_client methods.
      
      No behavior change is expected.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      8cab4c39
    • C
      NFS: Always use the same SETCLIENTID boot verifier · f092075d
      Chuck Lever 提交于
      Currently our NFS client assigns a unique SETCLIENTID boot verifier
      for each server IP address it knows about.  It's set to CURRENT_TIME
      when the struct nfs_client for that server IP is created.
      
      During the SETCLIENTID operation, our client also presents an
      nfs_client_id4 string to servers, as an identifier on which the server
      can hang all of this client's NFSv4 state.  Our client's
      nfs_client_id4 string is unique for each server IP address.
      
      An NFSv4 server is obligated to wipe all NFSv4 state associated with
      an nfs_client_id4 string when the client presents the same
      nfs_client_id4 string along with a changed SETCLIENTID boot verifier.
      
      When our client unmounts the last of a server's shares, it destroys
      that server's struct nfs_client.  The next time the client mounts that
      NFS server, it creates a fresh struct nfs_client with a fresh boot
      verifier.  On seeing the fresh verifer, the server wipes any previous
      NFSv4 state associated with that nfs_client_id4.
      
      However, NFSv4.1 clients are supposed to present the same
      nfs_client_id4 string to all servers.  And, to support Transparent
      State Migration, the same nfs_client_id4 string should be presented
      to all NFSv4.0 servers so they recognize that migrated state for this
      client belongs with state a server may already have for this client.
      (This is known as the Uniform Client String model).
      
      If the nfs_client_id4 string is the same but the boot verifier changes
      for each server IP address, SETCLIENTID and EXCHANGE_ID operations
      from such a client could unintentionally result in a server wiping a
      client's previously obtained lease.
      
      Thus, if our NFS client is going to use a fixed nfs_client_id4 string,
      either for NFSv4.0 or NFSv4.1 mounts, our NFS client should use a
      boot verifier that does not change depending on server IP address.
      Replace our current per-nfs_client boot verifier with a per-nfs_net
      boot verifier.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      f092075d
    • C
      NFS: Use proper naming conventions for the nfs_client.net field · 73ea666c
      Chuck Lever 提交于
      Clean up:  When naming fields and data types, follow established
      conventions to facilitate accurate grep/cscope searches.
      
      Introduced by commit e50a7a1a "NFS: make NFS client allocated per
      network namespace context," Tue Jan 10, 2012.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      73ea666c
    • C
      NFS: Use proper naming conventions for nfs_client.impl_id field · 59155546
      Chuck Lever 提交于
      Clean up:  When naming fields and data types, follow established
      conventions to facilitate accurate grep/cscope searches.
      
      Additionally, for consistency, move the impl_id field into the NFSv4-
      specific part of the nfs_client, and free that memory in the logic
      that shuts down NFSv4 nfs_clients.
      
      Introduced by commit 7d2ed9ac "NFSv4: parse and display server
      implementation ids," Fri Feb 17, 2012.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      59155546
    • C
      NFS: Use proper naming conventions for NFSv4.1 server scope fields · 79d4e1f0
      Chuck Lever 提交于
      Clean up:  When naming fields and data types, follow established
      conventions to facilitate accurate grep/cscope searches.
      
      Additionally, for consistency, move the scope field into the NFSv4-
      specific part of the nfs_client, and free that memory in the logic
      that shuts down NFSv4 nfs_clients.
      
      Introduced by commit 99fe60d0 "nfs41: exchange_id operation", April
      1 2009.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      79d4e1f0
    • C
    • C
      NFS: Add NFSDBG_STATE · e3c0fb7e
      Chuck Lever 提交于
      fs/nfs/nfs4state.c does not yet have any dprintk() call sites, and I'm
      about to introduce some.  We will need a new flag for enabling them.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      e3c0fb7e