1. 21 5月, 2016 40 次提交
    • D
      mm/zswap: use workqueue to destroy pool · 200867af
      Dan Streetman 提交于
      Add a work_struct to struct zswap_pool, and change __zswap_pool_empty to
      use the workqueue instead of using call_rcu().
      
      When zswap destroys a pool no longer in use, it uses call_rcu() to
      perform the destruction/freeing.  Since that executes in softirq
      context, it must not sleep.  However, actually destroying the pool
      involves freeing the per-cpu compressors (which requires locking the
      cpu_add_remove_lock mutex) and freeing the zpool, for which the
      implementation may sleep (e.g.  zsmalloc calls kmem_cache_destroy, which
      locks the slab_mutex).  So if either mutex is currently taken, or any
      other part of the compressor or zpool implementation sleeps, it will
      result in a BUG().
      
      It's not easy to reproduce this when changing zswap's params normally.
      In testing with a loaded system, this does not fail:
      
        $ cd /sys/module/zswap/parameters
        $ echo lz4 > compressor ; echo zsmalloc > zpool
      
      nor does this:
      
        $ while true ; do
        > echo lzo > compressor ; echo zbud > zpool
        > sleep 1
        > echo lz4 > compressor ; echo zsmalloc > zpool
        > sleep 1
        > done
      
      although it's still possible either of those might fail, depending on
      whether anything else besides zswap has locked the mutexes.
      
      However, changing a parameter with no delay immediately causes the
      schedule while atomic BUG:
      
        $ while true ; do
        > echo lzo > compressor ; echo lz4 > compressor
        > done
      
      This is essentially the same as Yu Zhao's proposed patch to zsmalloc,
      but moved to zswap, to cover compressor and zpool freeing.
      
      Fixes: f1c54846 ("zswap: dynamic pool creation")
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Reported-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dan Streetman <dan.streetman@canonical.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      200867af
    • S
      zram: user per-cpu compression streams · da9556a2
      Sergey Senozhatsky 提交于
      Remove idle streams list and keep compression streams in per-cpu data.
      This removes two contented spin_lock()/spin_unlock() calls from write
      path and also prevent write OP from being preempted while holding the
      compression stream, which can cause slow downs.
      
      For instance, let's assume that we have N cpus and N-2
      max_comp_streams.TASK1 owns the last idle stream, TASK2-TASK3 come in
      with the write requests:
      
        TASK1            TASK2              TASK3
       zram_bvec_write()
        spin_lock
        find stream
        spin_unlock
      
        compress
      
        <<preempted>>   zram_bvec_write()
                         spin_lock
                         find stream
                         spin_unlock
                           no_stream
                             schedule
                                           zram_bvec_write()
                                            spin_lock
                                            find_stream
                                            spin_unlock
                                              no_stream
                                                schedule
         spin_lock
         release stream
         spin_unlock
           wake up TASK2
      
      not only TASK2 and TASK3 will not get the stream, TASK1 will be
      preempted in the middle of its operation; while we would prefer it to
      finish compression and release the stream.
      
      Test environment: x86_64, 4 CPU box, 3G zram, lzo
      
      The following fio tests were executed:
            read, randread, write, randwrite, rw, randrw
      with the increasing number of jobs from 1 to 10.
      
                        4 streams        8 streams       per-cpu
        ===========================================================
        jobs1
        READ:           2520.1MB/s       2566.5MB/s      2491.5MB/s
        READ:           2102.7MB/s       2104.2MB/s      2091.3MB/s
        WRITE:          1355.1MB/s       1320.2MB/s      1378.9MB/s
        WRITE:          1103.5MB/s       1097.2MB/s      1122.5MB/s
        READ:           434013KB/s       435153KB/s      439961KB/s
        WRITE:          433969KB/s       435109KB/s      439917KB/s
        READ:           403166KB/s       405139KB/s      403373KB/s
        WRITE:          403223KB/s       405197KB/s      403430KB/s
        jobs2
        READ:           7958.6MB/s       8105.6MB/s      8073.7MB/s
        READ:           6864.9MB/s       6989.8MB/s      7021.8MB/s
        WRITE:          2438.1MB/s       2346.9MB/s      3400.2MB/s
        WRITE:          1994.2MB/s       1990.3MB/s      2941.2MB/s
        READ:           981504KB/s       973906KB/s      1018.8MB/s
        WRITE:          981659KB/s       974060KB/s      1018.1MB/s
        READ:           937021KB/s       938976KB/s      987250KB/s
        WRITE:          934878KB/s       936830KB/s      984993KB/s
        jobs3
        READ:           13280MB/s        13553MB/s       13553MB/s
        READ:           11534MB/s        11785MB/s       11755MB/s
        WRITE:          3456.9MB/s       3469.9MB/s      4810.3MB/s
        WRITE:          3029.6MB/s       3031.6MB/s      4264.8MB/s
        READ:           1363.8MB/s       1362.6MB/s      1448.9MB/s
        WRITE:          1361.9MB/s       1360.7MB/s      1446.9MB/s
        READ:           1309.4MB/s       1310.6MB/s      1397.5MB/s
        WRITE:          1307.4MB/s       1308.5MB/s      1395.3MB/s
        jobs4
        READ:           20244MB/s        20177MB/s       20344MB/s
        READ:           17886MB/s        17913MB/s       17835MB/s
        WRITE:          4071.6MB/s       4046.1MB/s      6370.2MB/s
        WRITE:          3608.9MB/s       3576.3MB/s      5785.4MB/s
        READ:           1824.3MB/s       1821.6MB/s      1997.5MB/s
        WRITE:          1819.8MB/s       1817.4MB/s      1992.5MB/s
        READ:           1765.7MB/s       1768.3MB/s      1937.3MB/s
        WRITE:          1767.5MB/s       1769.1MB/s      1939.2MB/s
        jobs5
        READ:           18663MB/s        18986MB/s       18823MB/s
        READ:           16659MB/s        16605MB/s       16954MB/s
        WRITE:          3912.4MB/s       3888.7MB/s      6126.9MB/s
        WRITE:          3506.4MB/s       3442.5MB/s      5519.3MB/s
        READ:           1798.2MB/s       1746.5MB/s      1935.8MB/s
        WRITE:          1792.7MB/s       1740.7MB/s      1929.1MB/s
        READ:           1727.6MB/s       1658.2MB/s      1917.3MB/s
        WRITE:          1726.5MB/s       1657.2MB/s      1916.6MB/s
        jobs6
        READ:           21017MB/s        20922MB/s       21162MB/s
        READ:           19022MB/s        19140MB/s       18770MB/s
        WRITE:          3968.2MB/s       4037.7MB/s      6620.8MB/s
        WRITE:          3643.5MB/s       3590.2MB/s      6027.5MB/s
        READ:           1871.8MB/s       1880.5MB/s      2049.9MB/s
        WRITE:          1867.8MB/s       1877.2MB/s      2046.2MB/s
        READ:           1755.8MB/s       1710.3MB/s      1964.7MB/s
        WRITE:          1750.5MB/s       1705.9MB/s      1958.8MB/s
        jobs7
        READ:           21103MB/s        20677MB/s       21482MB/s
        READ:           18522MB/s        18379MB/s       19443MB/s
        WRITE:          4022.5MB/s       4067.4MB/s      6755.9MB/s
        WRITE:          3691.7MB/s       3695.5MB/s      5925.6MB/s
        READ:           1841.5MB/s       1933.9MB/s      2090.5MB/s
        WRITE:          1842.7MB/s       1935.3MB/s      2091.9MB/s
        READ:           1832.4MB/s       1856.4MB/s      1971.5MB/s
        WRITE:          1822.3MB/s       1846.2MB/s      1960.6MB/s
        jobs8
        READ:           20463MB/s        20194MB/s       20862MB/s
        READ:           18178MB/s        17978MB/s       18299MB/s
        WRITE:          4085.9MB/s       4060.2MB/s      7023.8MB/s
        WRITE:          3776.3MB/s       3737.9MB/s      6278.2MB/s
        READ:           1957.6MB/s       1944.4MB/s      2109.5MB/s
        WRITE:          1959.2MB/s       1946.2MB/s      2111.4MB/s
        READ:           1900.6MB/s       1885.7MB/s      2082.1MB/s
        WRITE:          1896.2MB/s       1881.4MB/s      2078.3MB/s
        jobs9
        READ:           19692MB/s        19734MB/s       19334MB/s
        READ:           17678MB/s        18249MB/s       17666MB/s
        WRITE:          4004.7MB/s       4064.8MB/s      6990.7MB/s
        WRITE:          3724.7MB/s       3772.1MB/s      6193.6MB/s
        READ:           1953.7MB/s       1967.3MB/s      2105.6MB/s
        WRITE:          1953.4MB/s       1966.7MB/s      2104.1MB/s
        READ:           1860.4MB/s       1897.4MB/s      2068.5MB/s
        WRITE:          1858.9MB/s       1895.9MB/s      2066.8MB/s
        jobs10
        READ:           19730MB/s        19579MB/s       19492MB/s
        READ:           18028MB/s        18018MB/s       18221MB/s
        WRITE:          4027.3MB/s       4090.6MB/s      7020.1MB/s
        WRITE:          3810.5MB/s       3846.8MB/s      6426.8MB/s
        READ:           1956.1MB/s       1994.6MB/s      2145.2MB/s
        WRITE:          1955.9MB/s       1993.5MB/s      2144.8MB/s
        READ:           1852.8MB/s       1911.6MB/s      2075.8MB/s
        WRITE:          1855.7MB/s       1914.6MB/s      2078.1MB/s
      
      perf stat
      
                                        4 streams                       8 streams                       per-cpu
        ====================================================================================================================
        jobs1
        stalled-cycles-frontend      23,174,811,209 (  38.21%)     23,220,254,188 (  38.25%)       23,061,406,918 (  38.34%)
        stalled-cycles-backend       11,514,174,638 (  18.98%)     11,696,722,657 (  19.27%)       11,370,852,810 (  18.90%)
        instructions                 73,925,005,782 (    1.22)     73,903,177,632 (    1.22)       73,507,201,037 (    1.22)
        branches                     14,455,124,835 ( 756.063)     14,455,184,779 ( 755.281)       14,378,599,509 ( 758.546)
        branch-misses                    69,801,336 (   0.48%)         80,225,529 (   0.55%)           72,044,726 (   0.50%)
        jobs2
        stalled-cycles-frontend      49,912,741,782 (  46.11%)     50,101,189,290 (  45.95%)       32,874,195,633 (  35.11%)
        stalled-cycles-backend       27,080,366,230 (  25.02%)     27,949,970,232 (  25.63%)       16,461,222,706 (  17.58%)
        instructions                122,831,629,690 (    1.13)    122,919,846,419 (    1.13)      121,924,786,775 (    1.30)
        branches                     23,725,889,239 ( 692.663)     23,733,547,140 ( 688.062)       23,553,950,311 ( 794.794)
        branch-misses                    90,733,041 (   0.38%)         96,320,895 (   0.41%)           84,561,092 (   0.36%)
        jobs3
        stalled-cycles-frontend      66,437,834,608 (  45.58%)     63,534,923,344 (  43.69%)       42,101,478,505 (  33.19%)
        stalled-cycles-backend       34,940,799,661 (  23.97%)     34,774,043,148 (  23.91%)       21,163,324,388 (  16.68%)
        instructions                171,692,121,862 (    1.18)    171,775,373,044 (    1.18)      170,353,542,261 (    1.34)
        branches                     32,968,962,622 ( 628.723)     32,987,739,894 ( 630.512)       32,729,463,918 ( 717.027)
        branch-misses                   111,522,732 (   0.34%)        110,472,894 (   0.33%)           99,791,291 (   0.30%)
        jobs4
        stalled-cycles-frontend      98,741,701,675 (  49.72%)     94,797,349,965 (  47.59%)       54,535,655,381 (  33.53%)
        stalled-cycles-backend       54,642,609,615 (  27.51%)     55,233,554,408 (  27.73%)       27,882,323,541 (  17.14%)
        instructions                220,884,807,851 (    1.11)    220,930,887,273 (    1.11)      218,926,845,851 (    1.35)
        branches                     42,354,518,180 ( 592.105)     42,362,770,587 ( 590.452)       41,955,552,870 ( 716.154)
        branch-misses                   138,093,449 (   0.33%)        131,295,286 (   0.31%)          121,794,771 (   0.29%)
        jobs5
        stalled-cycles-frontend     116,219,747,212 (  48.14%)    110,310,397,012 (  46.29%)       66,373,082,723 (  33.70%)
        stalled-cycles-backend       66,325,434,776 (  27.48%)     64,157,087,914 (  26.92%)       32,999,097,299 (  16.76%)
        instructions                270,615,008,466 (    1.12)    270,546,409,525 (    1.14)      268,439,910,948 (    1.36)
        branches                     51,834,046,557 ( 599.108)     51,811,867,722 ( 608.883)       51,412,576,077 ( 729.213)
        branch-misses                   158,197,086 (   0.31%)        142,639,805 (   0.28%)          133,425,455 (   0.26%)
        jobs6
        stalled-cycles-frontend     138,009,414,492 (  48.23%)    139,063,571,254 (  48.80%)       75,278,568,278 (  32.80%)
        stalled-cycles-backend       79,211,949,650 (  27.68%)     79,077,241,028 (  27.75%)       37,735,797,899 (  16.44%)
        instructions                319,763,993,731 (    1.12)    319,937,782,834 (    1.12)      316,663,600,784 (    1.38)
        branches                     61,219,433,294 ( 595.056)     61,250,355,540 ( 598.215)       60,523,446,617 ( 733.706)
        branch-misses                   169,257,123 (   0.28%)        154,898,028 (   0.25%)          141,180,587 (   0.23%)
        jobs7
        stalled-cycles-frontend     162,974,812,119 (  49.20%)    159,290,061,987 (  48.43%)       88,046,641,169 (  33.21%)
        stalled-cycles-backend       92,223,151,661 (  27.84%)     91,667,904,406 (  27.87%)       44,068,454,971 (  16.62%)
        instructions                369,516,432,430 (    1.12)    369,361,799,063 (    1.12)      365,290,380,661 (    1.38)
        branches                     70,795,673,950 ( 594.220)     70,743,136,124 ( 597.876)       69,803,996,038 ( 732.822)
        branch-misses                   181,708,327 (   0.26%)        165,767,821 (   0.23%)          150,109,797 (   0.22%)
        jobs8
        stalled-cycles-frontend     185,000,017,027 (  49.30%)    182,334,345,473 (  48.37%)       99,980,147,041 (  33.26%)
        stalled-cycles-backend      105,753,516,186 (  28.18%)    107,937,830,322 (  28.63%)       51,404,177,181 (  17.10%)
        instructions                418,153,161,055 (    1.11)    418,308,565,828 (    1.11)      413,653,475,581 (    1.38)
        branches                     80,035,882,398 ( 592.296)     80,063,204,510 ( 589.843)       79,024,105,589 ( 730.530)
        branch-misses                   199,764,528 (   0.25%)        177,936,926 (   0.22%)          160,525,449 (   0.20%)
        jobs9
        stalled-cycles-frontend     210,941,799,094 (  49.63%)    204,714,679,254 (  48.55%)      114,251,113,756 (  33.96%)
        stalled-cycles-backend      122,640,849,067 (  28.85%)    122,188,553,256 (  28.98%)       58,360,041,127 (  17.35%)
        instructions                468,151,025,415 (    1.10)    467,354,869,323 (    1.11)      462,665,165,216 (    1.38)
        branches                     89,657,067,510 ( 585.628)     89,411,550,407 ( 588.990)       88,360,523,943 ( 730.151)
        branch-misses                   218,292,301 (   0.24%)        191,701,247 (   0.21%)          178,535,678 (   0.20%)
        jobs10
        stalled-cycles-frontend     233,595,958,008 (  49.81%)    227,540,615,689 (  49.11%)      160,341,979,938 (  43.07%)
        stalled-cycles-backend      136,153,676,021 (  29.03%)    133,635,240,742 (  28.84%)       65,909,135,465 (  17.70%)
        instructions                517,001,168,497 (    1.10)    516,210,976,158 (    1.11)      511,374,038,613 (    1.37)
        branches                     98,911,641,329 ( 585.796)     98,700,069,712 ( 591.583)       97,646,761,028 ( 728.712)
        branch-misses                   232,341,823 (   0.23%)        199,256,308 (   0.20%)          183,135,268 (   0.19%)
      
      per-cpu streams tend to cause significantly less stalled cycles; execute
      less branches and hit less branch-misses.
      
      perf stat reported execution time
      
                                4 streams        8 streams       per-cpu
        ====================================================================
        jobs1
        seconds elapsed        20.909073870     20.875670495    20.817838540
        jobs2
        seconds elapsed        18.529488399     18.720566469    16.356103108
        jobs3
        seconds elapsed        18.991159531     18.991340812    16.766216066
        jobs4
        seconds elapsed        19.560643828     19.551323547    16.246621715
        jobs5
        seconds elapsed        24.746498464     25.221646740    20.696112444
        jobs6
        seconds elapsed        28.258181828     28.289765505    22.885688857
        jobs7
        seconds elapsed        32.632490241     31.909125381    26.272753738
        jobs8
        seconds elapsed        35.651403851     36.027596308    29.108024711
        jobs9
        seconds elapsed        40.569362365     40.024227989    32.898204012
        jobs10
        seconds elapsed        44.673112304     43.874898137    35.632952191
      
      Please see
        Link: http://marc.info/?l=linux-kernel&m=146166970727530
        Link: http://marc.info/?l=linux-kernel&m=146174716719650
      for more test results (under low memory conditions).
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Suggested-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da9556a2
    • S
      zsmalloc: require GFP in zs_malloc() · d0d8da2d
      Sergey Senozhatsky 提交于
      Pass GFP flags to zs_malloc() instead of using a fixed mask supplied to
      zs_create_pool(), so we can be more flexible, but, more importantly, we
      need this to switch zram to per-cpu compression streams -- zram will try
      to allocate handle with preemption disabled in a fast path and switch to
      a slow path (using different gfp mask) if the fast one has failed.
      
      Apart from that, this also align zs_malloc() interface with zspool/zbud.
      
      [sergey.senozhatsky@gmail.com: pass GFP flags to zs_malloc() instead of using a fixed mask]
        Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish
      Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfishSigned-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0d8da2d
    • M
      zsmalloc: remove unused pool param in obj_free · 1ee47165
      Minchan Kim 提交于
      Let's remove unused pool param in obj_free
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ee47165
    • M
      zsmalloc: reorder function parameters · 251cbb95
      Minchan Kim 提交于
      Clean up function parameter ordering to order higher data structure
      first.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      251cbb95
    • M
      zsmalloc: clean up many BUG_ON · 830e4bc5
      Minchan Kim 提交于
      There are many BUG_ON in zsmalloc.c which is not recommened so change
      them as alternatives.
      
      Normal rule is as follows:
      
      1. avoid BUG_ON if possible. Instead, use VM_BUG_ON or VM_BUG_ON_PAGE
      
      2. use VM_BUG_ON_PAGE if we need to see struct page's fields
      
      3. use those assertion in primitive functions so higher functions can
         rely on the assertion in the primitive function.
      
      4. Don't use assertion if following instruction can trigger Oops
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      830e4bc5
    • M
      zsmalloc: use first_page rather than page · a4209467
      Minchan Kim 提交于
      Clean up function parameter "struct page".  Many functions of zsmalloc
      expect that page paramter is "first_page" so use "first_page" rather
      than "page" for code readability.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4209467
    • A
      kasan/tests: add tests for user memory access functions · eae08dca
      Andrey Ryabinin 提交于
      Add some tests for the newly-added user memory access API.
      
      Link: http://lkml.kernel.org/r/1462538722-1574-1-git-send-email-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eae08dca
    • A
      x86/kasan: instrument user memory access API · 1771c6e1
      Andrey Ryabinin 提交于
      Exchange between user and kernel memory is coded in assembly language.
      Which means that such accesses won't be spotted by KASAN as a compiler
      instruments only C code.
      
      Add explicit KASAN checks to user memory access API to ensure that
      userspace writes to (or reads from) a valid kernel memory.
      
      Note: Unlike others strncpy_from_user() is written mostly in C and KASAN
      sees memory accesses in it.  However, it makes sense to add explicit
      check for all @count bytes that *potentially* could be written to the
      kernel.
      
      [aryabinin@virtuozzo.com: move kasan check under the condition]
        Link: http://lkml.kernel.org/r/1462869209-21096-1-git-send-email-aryabinin@virtuozzo.com
      Link: http://lkml.kernel.org/r/1462538722-1574-4-git-send-email-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1771c6e1
    • A
      mm/kasan: add API to check memory regions · 64f8ebaf
      Andrey Ryabinin 提交于
      Memory access coded in an assembly won't be seen by KASAN as a compiler
      can instrument only C code.  Add kasan_check_[read,write]() API which is
      going to be used to check a certain memory range.
      
      Link: http://lkml.kernel.org/r/1462538722-1574-3-git-send-email-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: NAlexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64f8ebaf
    • A
      mm/kasan: print name of mem[set,cpy,move]() caller in report · 936bb4bb
      Andrey Ryabinin 提交于
      When bogus memory access happens in mem[set,cpy,move]() it's usually
      caller's fault.  So don't blame mem[set,cpy,move]() in bug report, blame
      the caller instead.
      
      Before:
        BUG: KASAN: out-of-bounds access in memset+0x23/0x40 at <address>
      After:
        BUG: KASAN: out-of-bounds access in <memset_caller> at <address>
      
      Link: http://lkml.kernel.org/r/1462538722-1574-2-git-send-email-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: NAlexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      936bb4bb
    • A
      mm, kasan: add a ksize() test · 96fe805f
      Alexander Potapenko 提交于
      Add a test that makes sure ksize() unpoisons the whole chunk.
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96fe805f
    • A
      mm, kasan: don't call kasan_krealloc() from ksize(). · 4ebb31a4
      Alexander Potapenko 提交于
      Instead of calling kasan_krealloc(), which replaces the memory
      allocation stack ID (if stack depot is used), just unpoison the whole
      memory chunk.
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ebb31a4
    • A
      mm: kasan: initial memory quarantine implementation · 55834c59
      Alexander Potapenko 提交于
      Quarantine isolates freed objects in a separate queue.  The objects are
      returned to the allocator later, which helps to detect use-after-free
      errors.
      
      When the object is freed, its state changes from KASAN_STATE_ALLOC to
      KASAN_STATE_QUARANTINE.  The object is poisoned and put into quarantine
      instead of being returned to the allocator, therefore every subsequent
      access to that object triggers a KASAN error, and the error handler is
      able to say where the object has been allocated and deallocated.
      
      When it's time for the object to leave quarantine, its state becomes
      KASAN_STATE_FREE and it's returned to the allocator.  From now on the
      allocator may reuse it for another allocation.  Before that happens,
      it's still possible to detect a use-after free on that object (it
      retains the allocation/deallocation stacks).
      
      When the allocator reuses this object, the shadow is unpoisoned and old
      allocation/deallocation stacks are wiped.  Therefore a use of this
      object, even an incorrect one, won't trigger ASan warning.
      
      Without the quarantine, it's not guaranteed that the objects aren't
      reused immediately, that's why the probability of catching a
      use-after-free is lower than with quarantine in place.
      
      Quarantine isolates freed objects in a separate queue.  The objects are
      returned to the allocator later, which helps to detect use-after-free
      errors.
      
      Freed objects are first added to per-cpu quarantine queues.  When a
      cache is destroyed or memory shrinking is requested, the objects are
      moved into the global quarantine queue.  Whenever a kmalloc call allows
      memory reclaiming, the oldest objects are popped out of the global queue
      until the total size of objects in quarantine is less than 3/4 of the
      maximum quarantine size (which is a fraction of installed physical
      memory).
      
      As long as an object remains in the quarantine, KASAN is able to report
      accesses to it, so the chance of reporting a use-after-free is
      increased.  Once the object leaves quarantine, the allocator may reuse
      it, in which case the object is unpoisoned and KASAN can't detect
      incorrect accesses to it.
      
      Right now quarantine support is only enabled in SLAB allocator.
      Unification of KASAN features in SLAB and SLUB will be done later.
      
      This patch is based on the "mm: kasan: quarantine" patch originally
      prepared by Dmitry Chernenkov.  A number of improvements have been
      suggested by Andrey Ryabinin.
      
      [glider@google.com: v9]
        Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.comSigned-off-by: NAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55834c59
    • Y
      mm: call page_ext_init() after all struct pages are initialized · b8f1a75d
      Yang Shi 提交于
      When DEFERRED_STRUCT_PAGE_INIT is enabled, just a subset of memmap at
      boot are initialized, then the rest are initialized in parallel by
      starting one-off "pgdatinitX" kernel thread for each node X.
      
      If page_ext_init is called before it, some pages will not have valid
      extension, this may lead the below kernel oops when booting up kernel:
      
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
        PGD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in:
        CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26
        Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009
        task: ffff88017c080040 ti: ffff88017c084000 task.ti: ffff88017c084000
        RIP: 0010:[<ffffffff8118d982>]  [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
        RSP: 0000:ffff88017c087c48  EFLAGS: 00010046
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
        RDX: 0000000000000980 RSI: 0000000000000080 RDI: 0000000000660401
        RBP: ffff88017c087cd0 R08: 0000000000000401 R09: 0000000000000009
        R10: ffff88017c080040 R11: 000000000000000a R12: 0000000000000400
        R13: ffffea0019810000 R14: ffffea0019810040 R15: ffff88066cfe6080
        FS:  0000000000000000(0000) GS:ffff88066cd40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 0000000002406000 CR4: 00000000000006e0
        Call Trace:
          free_hot_cold_page+0x192/0x1d0
          __free_pages+0x5c/0x90
          __free_pages_boot_core+0x11a/0x14e
          deferred_free_range+0x50/0x62
          deferred_init_memmap+0x220/0x3c3
          kthread+0xf8/0x110
          ret_from_fork+0x22/0x40
        Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 <48> 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff
        RIP  [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
         RSP <ffff88017c087c48>
        CR2: 0000000000000000
      
      Move page_ext_init() after page_alloc_init_late() to make sure page extension
      is setup for all pages.
      
      Link: http://lkml.kernel.org/r/1463696006-31360-1-git-send-email-yang.shi@linaro.orgSigned-off-by: NYang Shi <yang.shi@linaro.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8f1a75d
    • D
      mm, migrate: increment fail count on ENOMEM · dfef2ef4
      David Rientjes 提交于
      If page migration fails due to -ENOMEM, nr_failed should still be
      incremented for proper statistics.
      
      This was encountered recently when all page migration vmstats showed 0,
      and inferred that migrate_pages() was never called, although in reality
      the first page migration failed because compaction_alloc() failed to
      find a migration target.
      
      This patch increments nr_failed so the vmstat is properly accounted on
      ENOMEM.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605191510230.32658@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfef2ef4
    • C
      mm/compaction.c: fix zoneindex in kcompactd() · 6cd9dc3e
      Chen Feng 提交于
      While testing the kcompactd in my platform 3G MEM only DMA ZONE.  I
      found the kcompactd never wakeup.  It seems the zoneindex has already
      minus 1 before.  So the traverse here should be <=.
      
      It fixes a regression where kswapd could previously compact, but
      kcompactd not.  Not a crash fix though.
      
      [akpm@linux-foundation.org: fix kcompactd_do_work() as well, per Hugh]
      Link: http://lkml.kernel.org/r/1463659121-84124-1-git-send-email-puck.chen@hisilicon.com
      Fixes: accf6242 ("mm, kswapd: replace kswapd compaction with waking up kcompactd")
      Signed-off-by: NChen Feng <puck.chen@hisilicon.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zhuangluan Su <suzhuangluan@hisilicon.com>
      Cc: Yiping Xu <xuyiping@hisilicon.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cd9dc3e
    • Y
      mm: page_is_guard(): return false when page_ext arrays are not allocated yet · 0bb2fd13
      Yang Shi 提交于
      When enabling the below kernel configs:
      
      CONFIG_DEFERRED_STRUCT_PAGE_INIT
      CONFIG_DEBUG_PAGEALLOC
      CONFIG_PAGE_EXTENSION
      CONFIG_DEBUG_VM
      
      kernel bootup may fail due to the following oops:
      
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
        PGD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in:
        CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26
        Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009
        task: ffff88017c080040 ti: ffff88017c084000 task.ti: ffff88017c084000
        RIP: 0010:[<ffffffff8118d982>]  [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
        RSP: 0000:ffff88017c087c48  EFLAGS: 00010046
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
        RDX: 0000000000000980 RSI: 0000000000000080 RDI: 0000000000660401
        RBP: ffff88017c087cd0 R08: 0000000000000401 R09: 0000000000000009
        R10: ffff88017c080040 R11: 000000000000000a R12: 0000000000000400
        R13: ffffea0019810000 R14: ffffea0019810040 R15: ffff88066cfe6080
        FS:  0000000000000000(0000) GS:ffff88066cd40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 0000000002406000 CR4: 00000000000006e0
        Call Trace:
          free_hot_cold_page+0x192/0x1d0
          __free_pages+0x5c/0x90
          __free_pages_boot_core+0x11a/0x14e
          deferred_free_range+0x50/0x62
          deferred_init_memmap+0x220/0x3c3
          kthread+0xf8/0x110
          ret_from_fork+0x22/0x40
        Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 <48> 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff
        RIP  [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
         RSP <ffff88017c087c48>
        CR2: 0000000000000000
      
      The problem is lookup_page_ext() returns NULL then page_is_guard() tried
      to access it in page freeing.
      
      page_is_guard() depends on PAGE_EXT_DEBUG_GUARD bit of page extension
      flag, but freeing page might reach here before the page_ext arrays are
      allocated when feeding a range of pages to the allocator for the first
      time during bootup or memory hotplug.
      
      When it returns NULL, page_is_guard() should just return false instead
      of checking PAGE_EXT_DEBUG_GUARD unconditionally.
      
      Link: http://lkml.kernel.org/r/1463610225-29060-1-git-send-email-yang.shi@linaro.orgSigned-off-by: NYang Shi <yang.shi@linaro.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bb2fd13
    • D
      mm, thp: khugepaged should scan when sleep value is written · f0508977
      David Rientjes 提交于
      If a large value is written to scan_sleep_millisecs, for example, that
      period must lapse before khugepaged will wake up for periodic
      collapsing.
      
      If this value is tuned to 1 day, for example, and then re-tuned to its
      default 10s, khugepaged will still wait for a day before scanning again.
      
      This patch causes khugepaged to wakeup immediately when the value is
      changed and then sleep until that value is rewritten or the new value
      lapses.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605181453200.4786@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0508977
    • N
      MM: increase safety margin provided by PF_LESS_THROTTLE · a53eaff8
      NeilBrown 提交于
      When nfsd is exporting a filesystem over NFS which is then NFS-mounted
      on the local machine there is a risk of deadlock.  This happens when
      there are lots of dirty pages in the NFS filesystem and they cause NFSD
      to be throttled, either in throttle_vm_writeout() or in
      balance_dirty_pages().
      
      To avoid this problem the PF_LESS_THROTTLE flag is set for NFSD threads
      and it provides a 25% increase to the limits that affect NFSD.  Any
      process writing to an NFS filesystem will be throttled well before the
      number of dirty NFS pages reaches the limit imposed on NFSD, so NFSD
      will not deadlock on pages that it needs to write out.  At least it
      shouldn't.
      
      All processes are allowed a small excess margin to avoid performing too
      many calculations: ratelimit_pages.
      
      ratelimit_pages is set so that if a thread on every CPU uses the entire
      margin, the total will only go 3% over the limit, and this is much less
      than the 25% bonus that PF_LESS_THROTTLE provides, so this margin
      shouldn't be a problem.  But it is.
      
      The "total memory" that these 3% and 25% are calculated against are not
      really total memory but are "global_dirtyable_memory()" which doesn't
      include anonymous memory, just free memory and page-cache memory.
      
      The "ratelimit_pages" number is based on whatever the
      global_dirtyable_memory was on the last CPU hot-plug, which might not be
      what you expect, but is probably close to the total freeable memory.
      
      The throttle threshold uses the global_dirtable_memory at the moment
      when the throttling happens, which could be much less than at the last
      CPU hotplug.  So if lots of anonymous memory has been allocated, thus
      pushing out lots of page-cache pages, then NFSD might end up being
      throttled due to dirty NFS pages because the "25%" bonus it gets is
      calculated against a rather small amount of dirtyable memory, while the
      "3%" margin that other processes are allowed to dirty without penalty is
      calculated against a much larger number.
      
      To remove this possibility of deadlock we need to make sure that the
      margin granted to PF_LESS_THROTTLE exceeds that rate-limit margin.
      Simply adding ratelimit_pages isn't enough as that should be multiplied
      by the number of cpus.
      
      So add "global_wb_domain.dirty_limit / 32" as that more accurately
      reflects the current total over-shoot margin.  This ensures that the
      number of dirty NFS pages never gets so high that nfsd will be throttled
      waiting for them to be written.
      
      Link: http://lkml.kernel.org/r/87futgowwv.fsf@notabene.neil.brown.nameSigned-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a53eaff8
    • N
      mm: check_new_page_bad() directly returns in __PG_HWPOISON case · e570f56c
      Naoya Horiguchi 提交于
      Currently we check page->flags twice for "HWPoisoned" case of
      check_new_page_bad(), which can cause a race with unpoisoning.
      
      This race unnecessarily taints kernel with "BUG: Bad page state".
      check_new_page_bad() is the only caller of bad_page() which is
      interested in __PG_HWPOISON, so let's move the hwpoison related code in
      bad_page() to it.
      
      Link: http://lkml.kernel.org/r/20160518100949.GA17299@hori1.linux.bs1.fc.nec.co.jpSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e570f56c
    • S
      mm, kasan: fix to call kasan_free_pages() after poisoning page · 29b52de1
      seokhoon.yoon 提交于
      When CONFIG_PAGE_POISONING and CONFIG_KASAN is enabled,
      free_pages_prepare()'s codeflow is below.
      
        1)kmemcheck_free_shadow()
        2)kasan_free_pages()
          - set shadow byte of page is freed
        3)kernel_poison_pages()
        3.1) check access to page is valid or not using kasan
          ---> error occur, kasan think it is invalid access
        3.2) poison page
        4)kernel_map_pages()
      
      So kasan_free_pages() should be called after poisoning the page.
      
      Link: http://lkml.kernel.org/r/1463220405-7455-1-git-send-email-iamyooon@gmail.comSigned-off-by: Nseokhoon.yoon <iamyooon@gmail.com>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Laura Abbott <labbott@fedoraproject.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29b52de1
    • M
      mm: disable fault around on emulated access bit architecture · d0834a6c
      Minchan Kim 提交于
      fault_around aims to reduce minor faults of file-backed pages via
      speculative ahead pte mapping and relying on readahead logic.  However,
      on non-HW access bit architecture the benefit is highly limited because
      they should emulate the young bit with minor faults for reclaim's page
      aging algorithm.  IOW, we cannot reduce minor faults on those
      architectures.
      
      I did quick a test on my ARM machine.
      
      512M file mmap sequential every word read on eSATA drive 4 times.
      stddev is stable.
      
        = fault_around 4096 =
        elapsed time(usec): 6747645
      
        = fault_around 65536 =
        elapsed time(usec): 6709263
      
        0.5% gain.
      
      Even when I tested it with eMMC there is no gain because I guess with
      slow storage the major fault is the dominant factor.
      
      Also, fault_around has the side effect of shrinking slab more
      aggressively and causes higher vmpressure, so if such speculation fails,
      it can evict slab more which can result in page I/O (e.g., inode cache).
      In the end, it would make void any benefit of fault_around.
      
      So let's make the default "disabled" on those architectures.
      
      Link: http://lkml.kernel.org/r/20160518014229.GB21538@bboxSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0834a6c
    • K
      mm: make faultaround produce old ptes · 5c0a85fa
      Kirill A. Shutemov 提交于
      Currently, faultaround code produces young pte.  This can screw up
      vmscan behaviour[1], as it makes vmscan think that these pages are hot
      and not push them out on first round.
      
      During sparse file access faultaround gets more pages mapped and all of
      them are young.  Under memory pressure, this makes vmscan swap out anon
      pages instead, or to drop other page cache pages which otherwise stay
      resident.
      
      Modify faultaround to produce old ptes, so they can easily be reclaimed
      under memory pressure.
      
      This can to some extend defeat the purpose of faultaround on machines
      without hardware accessed bit as it will not help us with reducing the
      number of minor page faults.
      
      We may want to disable faultaround on such machines altogether, but
      that's subject for separate patchset.
      
      Minchan:
       "I tested 512M mmap sequential word read test on non-HW access bit
        system (i.e., ARM) and confirmed it doesn't increase minor fault any
        more.
      
        old: 4096 fault_around
        minor fault: 131291
        elapsed time: 6747645 usec
      
        new: 65536 fault_around
        minor fault: 131291
        elapsed time: 6709263 usec
      
        0.56% benefit"
      
      [1] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org
      
      Link: http://lkml.kernel.org/r/1463488366-47723-1-git-send-email-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Tested-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c0a85fa
    • S
      mm: use phys_addr_t for reserve_bootmem_region() arguments · 4b50bcc7
      Stefan Bader 提交于
      Since commit 92923ca3 ("mm: meminit: only set page reserved in the
      memblock region") the reserved bit is set on reserved memblock regions.
      However start and end address are passed as unsigned long.  This is only
      32bit on i386, so it can end up marking the wrong pages reserved for
      ranges at 4GB and above.
      
      This was observed on a 32bit Xen dom0 which was booted with initial
      memory set to a value below 4G but allowing to balloon in memory
      (dom0_mem=1024M for example).  This would define a reserved bootmem
      region for the additional memory (for example on a 8GB system there was
      a reverved region covering the 4GB-8GB range).  But since the addresses
      were passed on as unsigned long, this was actually marking all pages
      from 0 to 4GB as reserved.
      
      Fixes: 92923ca3 ("mm: meminit: only set page reserved in the memblock region")
      Link: http://lkml.kernel.org/r/1463491221-10573-1-git-send-email-stefan.bader@canonical.comSigned-off-by: NStefan Bader <stefan.bader@canonical.com>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b50bcc7
    • O
      userfaultfd: don't pin the user memory in userfaultfd_file_create() · d2005e3f
      Oleg Nesterov 提交于
      userfaultfd_file_create() increments mm->mm_users; this means that the
      memory won't be unmapped/freed if mm owner exits/execs, and UFFDIO_COPY
      after that can populate the orphaned mm more.
      
      Change userfaultfd_file_create() and userfaultfd_ctx_put() to use
      mm->mm_count to pin mm_struct.  This means that
      atomic_inc_not_zero(mm->mm_users) is needed when we are going to
      actually play with this memory.  Except handle_userfault() path doesn't
      need this, the caller must already have a reference.
      
      The patch adds the new trivial helper, mmget_not_zero(), it can have
      more users.
      
      Link: http://lkml.kernel.org/r/20160516172254.GA8595@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2005e3f
    • R
      mm/memblock.c: remove unnecessary always-true comparison · cd33a76b
      Richard Leitner 提交于
      Comparing an u64 variable to >= 0 returns always true and can therefore
      be removed.  This issue was detected using the -Wtype-limits gcc flag.
      
      This patch fixes following type-limits warning:
      
        mm/memblock.c: In function `__next_reserved_mem_region':
        mm/memblock.c:843:11: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
          if (*idx >= 0 && *idx < type->cnt) {
      
      Link: http://lkml.kernel.org/r/20160510103625.3a7f8f32@g0hl1n.netSigned-off-by: NRichard Leitner <dev@g0hl1n.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd33a76b
    • V
      z3fold: the 3-fold allocator for compressed pages · 9a001fc1
      Vitaly Wool 提交于
      This patch introduces z3fold, a special purpose allocator for storing
      compressed pages.  It is designed to store up to three compressed pages
      per physical page.  It is a ZBUD derivative which allows for higher
      compression ratio keeping the simplicity and determinism of its
      predecessor.
      
      This patch comes as a follow-up to the discussions at the Embedded Linux
      Conference in San-Diego related to the talk [1].  The outcome of these
      discussions was that it would be good to have a compressed page
      allocator as stable and deterministic as zbud with with higher
      compression ratio.
      
      To keep the determinism and simplicity, z3fold, just like zbud, always
      stores an integral number of compressed pages per page, but it can store
      up to 3 pages unlike zbud which can store at most 2.  Therefore the
      compression ratio goes to around 2.6x while zbud's one is around 1.7x.
      
      The patch is based on the latest linux.git tree.
      
      This version has been updated after testing on various simulators (e.g.
      ARM Versatile Express, MIPS Malta, x86_64/Haswell) and basing on
      comments from Dan Streetman [3].
      
      [1] https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou
      [2] https://lkml.org/lkml/2016/4/21/799
      [3] https://lkml.org/lkml/2016/5/4/852
      
      Link: http://lkml.kernel.org/r/20160509151753.ec3f9fda3c9898d31ff52a32@gmail.comSigned-off-by: NVitaly Wool <vitalywool@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a001fc1
    • A
      mm: thp: split_huge_pmd_address() comment improvement · d5ee7c3b
      Andrea Arcangeli 提交于
      Comment is partly wrong, this improves it by including the case of
      split_huge_pmd_address() called by try_to_unmap_one if TTU_SPLIT_HUGE_PMD
      is set.
      
      Link: http://lkml.kernel.org/r/1462547040-1737-4-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5ee7c3b
    • A
      mm: thp: microoptimize compound_mapcount() · 5f527c2b
      Andrea Arcangeli 提交于
      compound_mapcount() is only called after PageCompound() has already been
      checked by the caller, so there's no point to check it again.  Gcc may
      optimize it away too because it's inline but this will remove the
      runtime check for sure and add it'll add an assert instead.
      
      Link: http://lkml.kernel.org/r/1462547040-1737-3-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f527c2b
    • C
      vmstat: get rid of the ugly cpu_stat_off variable · 7b8da4c7
      Christoph Lameter 提交于
      The cpu_stat_off variable is unecessary since we can check if a
      workqueue request is pending otherwise.  Removal of cpu_stat_off makes
      it pretty easy for the vmstat shepherd to ensure that the proper things
      happen.
      
      Removing the state also removes all races related to it.  Should a
      workqueue not be scheduled as needed for vmstat_update then the shepherd
      will notice and schedule it as needed.  Should a workqueue be
      unecessarily scheduled then the vmstat updater will disable it.
      
      [akpm@linux-foundation.org: fix indentation, per Michal]
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1605061306460.17934@east.gentwo.orgSigned-off-by: NChristoph Lameter <cl@linux.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b8da4c7
    • G
      memcg: fix stale mem_cgroup_force_empty() comment · 51038171
      Greg Thelen 提交于
      Commit f61c42a7 ("memcg: remove tasks/children test from
      mem_cgroup_force_empty()") removed memory reparenting from the function.
      
      Fix the function's comment.
      
      Link: http://lkml.kernel.org/r/1462569810-54496-1-git-send-email-gthelen@google.comSigned-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51038171
    • Y
      mm: use unsigned long constant for page flags · d2a1a1f0
      Yu Zhao 提交于
      struct page->flags is unsigned long, so when shifting bits we should use
      UL suffix to match it.
      
      Found this problem after I added 64-bit CPU specific page flags and
      failed to compile the kernel:
      
        mm/page_alloc.c: In function '__free_one_page':
        mm/page_alloc.c:672:2: error: integer overflow in expression [-Werror=overflow]
      
      Link: http://lkml.kernel.org/r/1461971723-16187-1-git-send-email-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2a1a1f0
    • M
      mm: use existing helper to convert "on"/"off" to boolean · 2a138dc7
      Minfei Huang 提交于
      It's more convenient to use existing function helper to convert string
      "on/off" to boolean.
      
      Link: http://lkml.kernel.org/r/1461908824-16129-1-git-send-email-mnghuan@gmail.comSigned-off-by: NMinfei Huang <mnghuan@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a138dc7
    • T
      mm,writeback: don't use memory reserves for wb_start_writeback · 78ebc2f7
      Tetsuo Handa 提交于
      When writeback operation cannot make forward progress because memory
      allocation requests needed for doing I/O cannot be satisfied (e.g.
      under OOM-livelock situation), we can observe flood of order-0 page
      allocation failure messages caused by complete depletion of memory
      reserves.
      
      This is caused by unconditionally allocating "struct wb_writeback_work"
      objects using GFP_ATOMIC from PF_MEMALLOC context.
      
      __alloc_pages_nodemask() {
        __alloc_pages_slowpath() {
          __alloc_pages_direct_reclaim() {
            __perform_reclaim() {
              current->flags |= PF_MEMALLOC;
              try_to_free_pages() {
                do_try_to_free_pages() {
                  wakeup_flusher_threads() {
                    wb_start_writeback() {
                      kzalloc(sizeof(*work), GFP_ATOMIC) {
                        /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
                      }
                    }
                  }
                }
              }
              current->flags &= ~PF_MEMALLOC;
            }
          }
        }
      }
      
      Since I/O is stalling, allocating writeback requests forever shall
      deplete memory reserves.  Fortunately, since wb_start_writeback() can
      fall back to wb_wakeup() when allocating "struct wb_writeback_work"
      failed, we don't need to allow wb_start_writeback() to use memory
      reserves.
      
        Mem-Info:
        active_anon:289393 inactive_anon:2093 isolated_anon:29
         active_file:10838 inactive_file:113013 isolated_file:859
         unevictable:0 dirty:108531 writeback:5308 unstable:0
         slab_reclaimable:5526 slab_unreclaimable:7077
         mapped:9970 shmem:2159 pagetables:2387 bounce:0
         free:3042 free_pcp:0 free_cma:0
        Node 0 DMA free:6968kB min:44kB low:52kB high:64kB active_anon:6056kB inactive_anon:176kB active_file:712kB inactive_file:744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:208kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9708 all_unreclaimable? yes
        lowmem_reserve[]: 0 1732 1732 1732
        Node 0 DMA32 free:5200kB min:5200kB low:6500kB high:7800kB active_anon:1151516kB inactive_anon:8196kB active_file:42640kB inactive_file:451076kB unevictable:0kB isolated(anon):116kB isolated(file):3564kB present:2080640kB managed:1775332kB mlocked:0kB dirty:433368kB writeback:21232kB mapped:39144kB shmem:8452kB slab_reclaimable:22056kB slab_unreclaimable:28100kB kernel_stack:20976kB pagetables:9404kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2701604 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 0 DMA: 25*4kB (UME) 16*8kB (UME) 3*16kB (UE) 5*32kB (UME) 2*64kB (UM) 2*128kB (ME) 2*256kB (ME) 1*512kB (E) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 6964kB
        Node 0 DMA32: 925*4kB (UME) 140*8kB (UME) 5*16kB (ME) 5*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5060kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        126847 total pagecache pages
        0 pages in swap cache
        Swap cache stats: add 0, delete 0, find 0/0
        Free swap  = 0kB
        Total swap = 0kB
        524157 pages RAM
        0 pages HighMem/MovableOnly
        76348 pages reserved
        0 pages hwpoisoned
        Out of memory: Kill process 4450 (file_io.00) score 998 or sacrifice child
        Killed process 4450 (file_io.00) total-vm:4308kB, anon-rss:100kB, file-rss:1184kB, shmem-rss:0kB
        kthreadd: page allocation failure: order:0, mode:0x2200020
        file_io.00: page allocation failure: order:0, mode:0x2200020
        CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
        Call Trace:
          warn_alloc_failed+0xf7/0x150
          __alloc_pages_nodemask+0x23f/0xa60
          alloc_pages_current+0x87/0x110
          new_slab+0x3a1/0x440
          ___slab_alloc+0x3cf/0x590
          __slab_alloc.isra.64+0x18/0x1d
          kmem_cache_alloc+0x11c/0x150
          wb_start_writeback+0x39/0x90
          wakeup_flusher_threads+0x7f/0xf0
          do_try_to_free_pages+0x1f9/0x410
          try_to_free_pages+0x94/0xc0
          __alloc_pages_nodemask+0x566/0xa60
          alloc_pages_current+0x87/0x110
          __page_cache_alloc+0xaf/0xc0
          pagecache_get_page+0x88/0x260
          grab_cache_page_write_begin+0x21/0x40
          xfs_vm_write_begin+0x2f/0xf0
          generic_perform_write+0xca/0x1c0
          xfs_file_buffered_aio_write+0xcc/0x1f0
          xfs_file_write_iter+0x84/0x140
          __vfs_write+0xc7/0x100
          vfs_write+0x9d/0x190
          SyS_write+0x50/0xc0
          entry_SYSCALL_64_fastpath+0x12/0x6a
        Mem-Info:
        active_anon:293335 inactive_anon:2093 isolated_anon:0
         active_file:10829 inactive_file:110045 isolated_file:32
         unevictable:0 dirty:109275 writeback:822 unstable:0
         slab_reclaimable:5489 slab_unreclaimable:10070
         mapped:9999 shmem:2159 pagetables:2420 bounce:0
         free:3 free_pcp:0 free_cma:0
        Node 0 DMA free:12kB min:44kB low:52kB high:64kB active_anon:6060kB inactive_anon:176kB active_file:708kB inactive_file:756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:7160kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9844 all_unreclaimable? yes
        lowmem_reserve[]: 0 1732 1732 1732
        Node 0 DMA32 free:0kB min:5200kB low:6500kB high:7800kB active_anon:1167280kB inactive_anon:8196kB active_file:42608kB inactive_file:439424kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775332kB mlocked:0kB dirty:436344kB writeback:3288kB mapped:39260kB shmem:8452kB slab_reclaimable:21908kB slab_unreclaimable:33120kB kernel_stack:20976kB pagetables:9536kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11073180 all_unreclaimable? yes
        lowmem_reserve[]: 0 0 0 0
        Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
        Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        123086 total pagecache pages
        0 pages in swap cache
        Swap cache stats: add 0, delete 0, find 0/0
        Free swap  = 0kB
        Total swap = 0kB
        524157 pages RAM
        0 pages HighMem/MovableOnly
        76348 pages reserved
        0 pages hwpoisoned
        SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
          cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
          node 0: slabs: 3218, objs: 205952, free: 0
        file_io.00: page allocation failure: order:0, mode:0x2200020
        CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
      
      Assuming that somebody will find a better solution, let's apply this
      patch for now to stop bleeding, for this problem frequently prevents me
      from testing OOM livelock condition.
      
      Link: http://lkml.kernel.org/r/20160318131136.GE7152@quack.suse.czSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78ebc2f7
    • E
      Documentation: vm: fix spelling mistakes · 89474d50
      Eric Engestrom 提交于
      Signed-off-by: NEric Engestrom <eric@engestrom.ch>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89474d50
    • W
      mm fix commmets: if SPARSEMEM, pgdata doesn't have page_ext · 0c9ad804
      Weijie Yang 提交于
      If SPARSEMEM, use page_ext in mem_section
      if !SPARSEMEM, use page_ext in pgdata
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c9ad804
    • C
      include/linux/hugetlb.h: use bool instead of int for hugepage_migration_supported() · d70c17d4
      Chen Gang 提交于
      It is used as a pure bool function within kernel source wide.
      Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d70c17d4
    • C
      include/linux/hugetlb*.h: clean up code · 7fab358d
      Chen Gang 提交于
      Macro HUGETLBFS_SB is clear enough, so one statement is clearer than 3
      lines statements.
      
      Remove redundant return statements for non-return functions, which can
      save lines, at least.
      Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7fab358d
    • M
      mm/swap.c: put activate_page_pvecs and other pagevecs together · a4a921aa
      Ming Li 提交于
      Put the activate_page_pvecs definition next to those of the other
      pagevecs, for clarity.
      Signed-off-by: NMing Li <mingli199x@qq.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4a921aa