1. 26 3月, 2015 8 次提交
    • M
      mm: numa: slow PTE scan rate if migration failures occur · 074c2381
      Mel Gorman 提交于
      Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226
      
        Across the board the 4.0-rc1 numbers are much slower, and the degradation
        is far worse when using the large memory footprint configs. Perf points
        straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:
      
         -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
            - default_send_IPI_mask_sequence_phys
               - 99.99% physflat_send_IPI_mask
                  - 99.37% native_send_call_func_ipi
                       smp_call_function_many
                     - native_flush_tlb_others
                        - 99.85% flush_tlb_page
                             ptep_clear_flush
                             try_to_unmap_one
                             rmap_walk
                             try_to_unmap
                             migrate_pages
                             migrate_misplaced_page
                           - handle_mm_fault
                              - 99.73% __do_page_fault
                                   trace_do_page_fault
                                   do_async_page_fault
                                 + async_page_fault
                    0.63% native_send_call_func_single_ipi
                       generic_exec_single
                       smp_call_function_single
      
      This is showing excessive migration activity even though excessive
      migrations are meant to get throttled.  Normally, the scan rate is tuned
      on a per-task basis depending on the locality of faults.  However, if
      migrations fail for any reason then the PTE scanner may scan faster if
      the faults continue to be remote.  This means there is higher system CPU
      overhead and fault trapping at exactly the time we know that migrations
      cannot happen.  This patch tracks when migration failures occur and
      slows the PTE scanner.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      074c2381
    • M
      mm: numa: preserve PTE write permissions across a NUMA hinting fault · b191f9b1
      Mel Gorman 提交于
      Protecting a PTE to trap a NUMA hinting fault clears the writable bit
      and further faults are needed after trapping a NUMA hinting fault to set
      the writable bit again.  This patch preserves the writable bit when
      trapping NUMA hinting faults.  The impact is obvious from the number of
      minor faults trapped during the basis balancing benchmark and the system
      CPU usage;
      
        autonumabench
                                                   4.0.0-rc4             4.0.0-rc4
                                                    baseline              preserve
        Time System-NUMA01                  107.13 (  0.00%)      103.13 (  3.73%)
        Time System-NUMA01_THEADLOCAL       131.87 (  0.00%)       83.30 ( 36.83%)
        Time System-NUMA02                    8.95 (  0.00%)       10.72 (-19.78%)
        Time System-NUMA02_SMT                4.57 (  0.00%)        3.99 ( 12.69%)
        Time Elapsed-NUMA01                 515.78 (  0.00%)      517.26 ( -0.29%)
        Time Elapsed-NUMA01_THEADLOCAL      384.10 (  0.00%)      384.31 ( -0.05%)
        Time Elapsed-NUMA02                  48.86 (  0.00%)       48.78 (  0.16%)
        Time Elapsed-NUMA02_SMT              47.98 (  0.00%)       48.12 ( -0.29%)
      
                     4.0.0-rc4   4.0.0-rc4
                      baseline    preserve
        User          44383.95    43971.89
        System          252.61      201.24
        Elapsed         998.68     1000.94
      
        Minor Faults   2597249     1981230
        Major Faults       365         364
      
      There is a similar drop in system CPU usage using Dave Chinner's xfsrepair
      workload
      
                                            4.0.0-rc4             4.0.0-rc4
                                             baseline              preserve
        Amean    real-xfsrepair      454.14 (  0.00%)      442.36 (  2.60%)
        Amean    syst-xfsrepair      277.20 (  0.00%)      204.68 ( 26.16%)
      
      The patch looks hacky but the alternatives looked worse.  The tidest was
      to rewalk the page tables after a hinting fault but it was more complex
      than this approach and the performance was worse.  It's not generally
      safe to just mark the page writable during the fault if it's a write
      fault as it may have been read-only for COW so that approach was
      discarded.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b191f9b1
    • M
      mm: numa: group related processes based on VMA flags instead of page table flags · bea66fbd
      Mel Gorman 提交于
      These are three follow-on patches based on the xfsrepair workload Dave
      Chinner reported was problematic in 4.0-rc1 due to changes in page table
      management -- https://lkml.org/lkml/2015/3/1/226.
      
      Much of the problem was reduced by commit 53da3bc2 ("mm: fix up numa
      read-only thread grouping logic") and commit ba68bc01 ("mm: thp:
      Return the correct value for change_huge_pmd").  It was known that the
      performance in 3.19 was still better even if is far less safe.  This
      series aims to restore the performance without compromising on safety.
      
      For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the
      three patches applied on top
      
        autonumabench
                                                      3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                                     vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
        Time System-NUMA01                  124.00 (  0.00%)      161.86 (-30.53%)      107.13 ( 13.60%)      103.13 ( 16.83%)      145.01 (-16.94%)
        Time System-NUMA01_THEADLOCAL       115.54 (  0.00%)      107.64 (  6.84%)      131.87 (-14.13%)       83.30 ( 27.90%)       92.35 ( 20.07%)
        Time System-NUMA02                    9.35 (  0.00%)       10.44 (-11.66%)        8.95 (  4.28%)       10.72 (-14.65%)        8.16 ( 12.73%)
        Time System-NUMA02_SMT                3.87 (  0.00%)        4.63 (-19.64%)        4.57 (-18.09%)        3.99 ( -3.10%)        3.36 ( 13.18%)
        Time Elapsed-NUMA01                 570.06 (  0.00%)      567.82 (  0.39%)      515.78 (  9.52%)      517.26 (  9.26%)      543.80 (  4.61%)
        Time Elapsed-NUMA01_THEADLOCAL      393.69 (  0.00%)      384.83 (  2.25%)      384.10 (  2.44%)      384.31 (  2.38%)      380.73 (  3.29%)
        Time Elapsed-NUMA02                  49.09 (  0.00%)       49.33 ( -0.49%)       48.86 (  0.47%)       48.78 (  0.63%)       50.94 ( -3.77%)
        Time Elapsed-NUMA02_SMT              47.51 (  0.00%)       47.15 (  0.76%)       47.98 ( -0.99%)       48.12 ( -1.28%)       49.56 ( -4.31%)
      
                      3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                     vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        User        46334.60    46391.94    44383.95    43971.89    44372.12
        System        252.84      284.66      252.61      201.24      249.00
        Elapsed      1062.14     1050.96      998.68     1000.94     1026.78
      
      Overall the system CPU usage is comparable and the test is naturally a
      bit variable.  The slowing of the scanner hurts numa01 but on this
      machine it is an adverse workload and patches that dramatically help it
      often hurt absolutely everything else.
      
      Due to patch 2, the fault activity is interesting
      
                                        3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                       vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        Minor Faults                   2097811     2656646     2597249     1981230     1636841
        Major Faults                       362         450         365         364         365
      
      Note the impact preserving the write bit across protection updates and
      fault reduces faults.
      
        NUMA alloc hit                 1229008     1217015     1191660     1178322     1199681
        NUMA alloc miss                      0           0           0           0           0
        NUMA interleave hit                  0           0           0           0           0
        NUMA alloc local               1228514     1216317     1190871     1177448     1199021
        NUMA base PTE updates        245706197   240041607   238195516   244704842   115012800
        NUMA huge PMD updates           479530      468448      464868      477573      224487
        NUMA page range updates      491225557   479886983   476207932   489222218   229950144
        NUMA hint faults                659753      656503      641678      656926      294842
        NUMA hint local faults          381604      373963      360478      337585      186249
        NUMA hint local percent             57          56          56          51          63
        NUMA pages migrated            5412140     6374899     6266530     5277468     5755096
        AutoNUMA cost                    5121%       5083%       4994%       5097%       2388%
      
      Here the impact of slowing the PTE scanner on migratrion failures is
      obvious as "NUMA base PTE updates" and "NUMA huge PMD updates" are
      massively reduced even though the headline performance is very similar.
      
      As xfsrepair was the reported workload here is the impact of the series
      on it.
      
        xfsrepair
                                               3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                              vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
        Min      real-fsmark        1183.29 (  0.00%)     1165.73 (  1.48%)     1152.78 (  2.58%)     1153.64 (  2.51%)     1177.62 (  0.48%)
        Min      syst-fsmark        4107.85 (  0.00%)     4027.75 (  1.95%)     3986.74 (  2.95%)     3979.16 (  3.13%)     4048.76 (  1.44%)
        Min      real-xfsrepair      441.51 (  0.00%)      463.96 ( -5.08%)      449.50 ( -1.81%)      440.08 (  0.32%)      439.87 (  0.37%)
        Min      syst-xfsrepair      195.76 (  0.00%)      278.47 (-42.25%)      262.34 (-34.01%)      203.70 ( -4.06%)      143.64 ( 26.62%)
        Amean    real-fsmark        1188.30 (  0.00%)     1177.34 (  0.92%)     1157.97 (  2.55%)     1158.21 (  2.53%)     1182.22 (  0.51%)
        Amean    syst-fsmark        4111.37 (  0.00%)     4055.70 (  1.35%)     3987.19 (  3.02%)     3998.72 (  2.74%)     4061.69 (  1.21%)
        Amean    real-xfsrepair      450.88 (  0.00%)      468.32 ( -3.87%)      454.14 ( -0.72%)      442.36 (  1.89%)      440.59 (  2.28%)
        Amean    syst-xfsrepair      199.66 (  0.00%)      290.60 (-45.55%)      277.20 (-38.84%)      204.68 ( -2.51%)      150.55 ( 24.60%)
        Stddev   real-fsmark           4.12 (  0.00%)       10.82 (-162.29%)       4.14 ( -0.28%)        5.98 (-45.05%)        4.60 (-11.53%)
        Stddev   syst-fsmark           2.63 (  0.00%)       20.32 (-671.82%)       0.37 ( 85.89%)       16.47 (-525.59%)      15.05 (-471.79%)
        Stddev   real-xfsrepair        6.87 (  0.00%)        4.55 ( 33.75%)        3.46 ( 49.58%)        1.78 ( 74.12%)        0.52 ( 92.50%)
        Stddev   syst-xfsrepair        3.02 (  0.00%)       10.30 (-241.37%)      13.17 (-336.37%)       0.71 ( 76.63%)        5.00 (-65.61%)
        CoeffVar real-fsmark           0.35 (  0.00%)        0.92 (-164.73%)       0.36 ( -2.91%)        0.52 (-48.82%)        0.39 (-12.10%)
        CoeffVar syst-fsmark           0.06 (  0.00%)        0.50 (-682.41%)       0.01 ( 85.45%)        0.41 (-543.22%)       0.37 (-478.78%)
        CoeffVar real-xfsrepair        1.52 (  0.00%)        0.97 ( 36.21%)        0.76 ( 49.94%)        0.40 ( 73.62%)        0.12 ( 92.33%)
        CoeffVar syst-xfsrepair        1.51 (  0.00%)        3.54 (-134.54%)       4.75 (-214.31%)       0.34 ( 77.20%)        3.32 (-119.63%)
        Max      real-fsmark        1193.39 (  0.00%)     1191.77 (  0.14%)     1162.90 (  2.55%)     1166.66 (  2.24%)     1188.50 (  0.41%)
        Max      syst-fsmark        4114.18 (  0.00%)     4075.45 (  0.94%)     3987.65 (  3.08%)     4019.45 (  2.30%)     4082.80 (  0.76%)
        Max      real-xfsrepair      457.80 (  0.00%)      474.60 ( -3.67%)      457.82 ( -0.00%)      444.42 (  2.92%)      441.03 (  3.66%)
        Max      syst-xfsrepair      203.11 (  0.00%)      303.65 (-49.50%)      294.35 (-44.92%)      205.33 ( -1.09%)      155.28 ( 23.55%)
      
      The really relevant lines as syst-xfsrepair which is the system CPU
      usage when running xfsrepair.  Note that on my machine the overhead was
      45% higher on 4.0-rc4 which may be part of what Dave is seeing.  Once we
      preserve the write bit across faults, it's only 2.51% higher on average.
      With the full series applied, system CPU usage is 24.6% lower on
      average.
      
      Again, the impact of preserving the write bit on minor faults is obvious
      and the impact of slowing scanning after migration failures is obvious
      on the PTE updates.  Note also that the number of pages migrated is much
      reduced even though the headline performance is comparable.
      
                                        3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                       vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        Minor Faults                 153466827   254507978   249163829   153501373   105737890
        Major Faults                       610         702         690         649         724
        NUMA base PTE updates        217735049   210756527   217729596   216937111   144344993
        NUMA huge PMD updates           129294       85044      106921      127246       79887
        NUMA pages migrated           21938995    29705270    28594162    22687324    16258075
      
                              3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                             vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        Mean sdb-avgqusz       13.47        2.54        2.55        2.47        2.49
        Mean sdb-avgrqsz      202.32      140.22      139.50      139.02      138.12
        Mean sdb-await         25.92        5.09        5.33        5.02        5.22
        Mean sdb-r_await        4.71        0.19        0.83        0.51        0.11
        Mean sdb-w_await      104.13        5.21        5.38        5.05        5.32
        Mean sdb-svctm          0.59        0.13        0.14        0.13        0.14
        Mean sdb-rrqm           0.16        0.00        0.00        0.00        0.00
        Mean sdb-wrqm           3.59     1799.43     1826.84     1812.21     1785.67
        Max  sdb-avgqusz      111.06       12.13       14.05       11.66       15.60
        Max  sdb-avgrqsz      255.60      190.34      190.01      187.33      191.78
        Max  sdb-await        168.24       39.28       49.22       44.64       65.62
        Max  sdb-r_await      660.00       52.00      280.00       76.00       12.00
        Max  sdb-w_await     7804.00       39.28       49.22       44.64       65.62
        Max  sdb-svctm          4.00        2.82        2.86        1.98        2.84
        Max  sdb-rrqm           8.30        0.00        0.00        0.00        0.00
        Max  sdb-wrqm          34.20     5372.80     5278.60     5386.60     5546.15
      
      FWIW, I also checked SPECjbb in different configurations but it's
      similar observations -- minor faults lower, PTE update activity lower
      and performance is roughly comparable against 3.19.
      
      This patch (of 3):
      
      Threads that share writable data within pages are grouped together as
      related tasks.  This decision is based on whether the PTE is marked
      dirty which is subject to timing races between the PTE scanner update
      and when the application writes the page.  If the page is file-backed,
      then background flushes and sync also affect placement.  This is
      unpredictable behaviour which is impossible to reason about so this
      patch makes grouping decisions based on the VMA flags.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bea66fbd
    • L
      mm/page_alloc.c: call kernel_map_pages in unset_migrateype_isolate · cfa86943
      Laura Abbott 提交于
      Commit 3c605096 ("mm/page_alloc: restrict max order of merging on
      isolated pageblock") changed the logic of unset_migratetype_isolate to
      check the buddy allocator and explicitly call __free_pages to merge.
      
      The page that is being freed in this path never had prep_new_page called
      so set_page_refcounted is called explicitly but there is no call to
      kernel_map_pages.  With the default kernel_map_pages this is mostly
      harmless but if kernel_map_pages does any manipulation of the page
      tables (unmapping or setting pages to read only) this may trigger a
      fault:
      
          alloc_contig_range test_pages_isolated(ceb00, ced00) failed
          Unable to handle kernel paging request at virtual address ffffffc0cec00000
          pgd = ffffffc045fc4000
          [ffffffc0cec00000] *pgd=0000000000000000
          Internal error: Oops: 9600004f [#1] PREEMPT SMP
          Modules linked in: exfatfs
          CPU: 1 PID: 23237 Comm: TimedEventQueue Not tainted 3.10.49-gc72ad36-dirty #1
          task: ffffffc03de52100 ti: ffffffc015388000 task.ti: ffffffc015388000
          PC is at memset+0xc8/0x1c0
          LR is at kernel_map_pages+0x1ec/0x244
      
      Fix this by calling kernel_map_pages to ensure the page is set in the
      page table properly
      
      Fixes: 3c605096 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
      Signed-off-by: NLaura Abbott <lauraa@codeaurora.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cfa86943
    • M
      mm/slub: fix lockups on PREEMPT && !SMP kernels · 859b7a0e
      Mark Rutland 提交于
      Commit 9aabf810 ("mm/slub: optimize alloc/free fastpath by removing
      preemption on/off") introduced an occasional hang for kernels built with
      CONFIG_PREEMPT && !CONFIG_SMP.
      
      The problem is the following loop the patch introduced to
      slab_alloc_node and slab_free:
      
          do {
              tid = this_cpu_read(s->cpu_slab->tid);
              c = raw_cpu_ptr(s->cpu_slab);
          } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
      
      GCC 4.9 has been observed to hoist the load of c and c->tid above the
      loop for !SMP kernels (as in this case raw_cpu_ptr(x) is compile-time
      constant and does not force a reload).  On arm64 the generated assembly
      looks like:
      
               ldr     x4, [x0,#8]
        loop:
               ldr     x1, [x0,#8]
               cmp     x1, x4
               b.ne    loop
      
      If the thread is preempted between the load of c->tid (into x1) and tid
      (into x4), and an allocation or free occurs in another thread (bumping
      the cpu_slab's tid), the thread will be stuck in the loop until
      s->cpu_slab->tid wraps, which may be forever in the absence of
      allocations/frees on the same CPU.
      
      This patch changes the loop condition to access c->tid with READ_ONCE.
      This ensures that the value is reloaded even when the compiler would
      otherwise assume it could cache the value, and also ensures that the
      load will not be torn.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      859b7a0e
    • G
      mm/memory hotplug: postpone the reset of obsolete pgdat · b0dc3a34
      Gu Zheng 提交于
      Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
      stress condition:
      
        BUG: unable to handle kernel paging request at 0000000000025f60
        IP: next_online_pgdat+0x1/0x50
        PGD 0
        Oops: 0000 [#1] SMP
        ACPI: Device does not support D3cold
        Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
        CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G           O 3.10.15-5885-euler0302 #1
        Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
        Workqueue: events vmstat_update
        task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
        RIP: 0010: next_online_pgdat+0x1/0x50
        RSP: 0018:ffffa800d32afce8  EFLAGS: 00010286
        RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
        RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
        RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
        R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
        R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
        FS:  0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
          refresh_cpu_vm_stats+0xd0/0x140
          vmstat_update+0x11/0x50
          process_one_work+0x194/0x3d0
          worker_thread+0x12b/0x410
          kthread+0xc6/0xd0
          ret_from_fork+0x7c/0xb0
      
      The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
      try_offline_node, which will reset all the content of pgdat to 0, as the
      pgdat is accessed lock-free, so that the users still using the pgdat
      will panic, such as the vmstat_update routine.
      
      process A:				offline node XX:
      
      vmstat_updat()
         refresh_cpu_vm_stats()
           for_each_populated_zone()
             find online node XX
           cond_resched()
      					offline cpu and memory, then try_offline_node()
      					node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
             zone = next_zone(zone)
               pg_data_t *pgdat = zone->zone_pgdat;  // here pgdat is NULL now
                 next_online_pgdat(pgdat)
                   next_online_node(pgdat->node_id);  // NULL pointer access
      
      So the solution here is postponing the reset of obsolete pgdat from
      try_offline_node() to hotadd_new_pgdat(), and just resetting
      pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset
      0 to avoid breaking pointer information in pgdat.
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Reported-by: NXishi Qiu <qiuxishi@huawei.com>
      Suggested-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0dc3a34
    • N
      mm/pagewalk.c: prevent positive return value of walk_page_test() from being passed to callers · f6837395
      Naoya Horiguchi 提交于
      walk_page_test() is purely pagewalk's internal stuff, and its positive
      return values are not intended to be passed to the callers of pagewalk.
      
      However, in the current code if the last vma in the do-while loop in
      walk_page_range() happens to return a positive value, it leaks outside
      walk_page_range().  So the user visible effect is invalid/unexpected
      return value (according to the reporter, mbind() causes it.)
      
      This patch fixes it simply by reinitializing the return value after
      checked.
      
      Another exposed interface, walk_page_vma(), already returns 0 for such
      cases so no problem.
      
      Fixes: fafaa426 ("pagewalk: improve vma handling")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKazutomo Yoshii <kazutomo.yoshii@gmail.com>
      Reported-by: NKazutomo Yoshii <kazutomo.yoshii@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6837395
    • L
      mm: fix anon_vma->degree underflow in anon_vma endless growing prevention · 3fe89b3e
      Leon Yu 提交于
      I have constantly stumbled upon "kernel BUG at mm/rmap.c:399!" after
      upgrading to 3.19 and had no luck with 4.0-rc1 neither.
      
      So, after looking into new logic introduced by commit 7a3ef208 ("mm:
      prevent endless growth of anon_vma hierarchy"), I found chances are that
      unlink_anon_vmas() is called without incrementing dst->anon_vma->degree
      in anon_vma_clone() due to allocation failure.  If dst->anon_vma is not
      NULL in error path, its degree will be incorrectly decremented in
      unlink_anon_vmas() and eventually underflow when exiting as a result of
      another call to unlink_anon_vmas().  That's how "kernel BUG at
      mm/rmap.c:399!" is triggered for me.
      
      This patch fixes the underflow by dropping dst->anon_vma when allocation
      fails.  It's safe to do so regardless of original value of dst->anon_vma
      because dst->anon_vma doesn't have valid meaning if anon_vma_clone()
      fails.  Besides, callers don't care dst->anon_vma in such case neither.
      
      Also suggested by Michal Hocko, we can clean up vma_adjust() a bit as
      anon_vma_clone() now does the work.
      
      [akpm@linux-foundation.org: tweak comment]
      Fixes: 7a3ef208 ("mm: prevent endless growth of anon_vma hierarchy")
      Signed-off-by: NLeon Yu <chianglungyu@gmail.com>
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fe89b3e
  2. 13 3月, 2015 8 次提交
  3. 12 3月, 2015 1 次提交
    • L
      mm: fix up numa read-only thread grouping logic · 53da3bc2
      Linus Torvalds 提交于
      Dave Chinner reported that commit 4d942466 ("mm: convert
      p[te|md]_mknonnuma and remaining page table manipulations") slowed down
      his xfsrepair test enormously.  In particular, it was using more system
      time due to extra TLB flushing.
      
      The ultimate reason turns out to be how the change to use the regular
      page table accessor functions broke the NUMA grouping logic.  The old
      special mknuma/mknonnuma code accessed the page table present bit and
      the magic NUMA bit directly, while the new code just changes the page
      protections using PROT_NONE and the regular vma protections.
      
      That sounds equivalent, and from a fault standpoint it really is, but a
      subtle side effect is that the *other* protection bits of the page table
      entries also change.  And the code to decide how to group the NUMA
      entries together used the writable bit to decide whether a particular
      page was likely to be shared read-only or not.
      
      And with the change to make the NUMA handling use the regular permission
      setting functions, that writable bit was basically always cleared for
      private mappings due to COW.  So even if the page actually ends up being
      written to in the end, the NUMA balancing would act as if it was always
      shared RO.
      
      This code is a heuristic anyway, so the fix - at least for now - is to
      instead check whether the page is dirty rather than writable.  The bit
      doesn't change with protection changes.
      
      NOTE! This also adds a FIXME comment to revisit this issue,
      
      Not only should we probably re-visit the whole "is this a shared
      read-only page" heuristic (we might want to take the vma permissions
      into account and base this more on those than the per-page ones, and
      also look at whether the particular access that triggers it is a write
      or not), but the whole COW issue shows that we should think about the
      NUMA fault handling some more.
      
      For example, maybe we should do the early-COW thing that a regular fault
      does.  Or maybe we should accept that while using the same bits as
      PROTNONE was a good thing (and got rid of the specual NUMA bit), we
      might still want to just preseve the other protection bits across NUMA
      faulting.
      
      Those are bigger questions, left for later.  This just fixes up the
      heuristic so that it at least approximates working again.  More analysis
      and work needed.
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>,
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53da3bc2
  4. 01 3月, 2015 4 次提交
    • J
      mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change · cc873177
      Johannes Weiner 提交于
      Historically, !__GFP_FS allocations were not allowed to invoke the OOM
      killer once reclaim had failed, but nevertheless kept looping in the
      allocator.
      
      Commit 9879de73 ("mm: page_alloc: embed OOM killing naturally into
      allocation slowpath"), which should have been a simple cleanup patch,
      accidentally changed the behavior to aborting the allocation at that
      point.  This creates problems with filesystem callers (?) that currently
      rely on the allocator waiting for other tasks to intervene.
      
      Revert the behavior as it shouldn't have been changed as part of a
      cleanup patch.
      
      Fixes: 9879de73 ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Dave Chinner <david@fromorbit.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>	[3.19.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc873177
    • J
      mm: memcontrol: use "max" instead of "infinity" in control knobs · d2973697
      Johannes Weiner 提交于
      The memcg control knobs indicate the highest possible value using the
      symbolic name "infinity", which is long and awkward to type.
      
      Switch to the string "max", which is just as descriptive but shorter and
      sweeter.
      
      This changes a user interface, so do it before the release and before
      the development flag is dropped from the default hierarchy.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2973697
    • M
      memcg: fix low limit calculation · 4e54dede
      Michal Hocko 提交于
      A memcg is considered low limited even when the current usage is equal to
      the low limit.  This leads to interesting side effects e.g.
      groups/hierarchies with no memory accounted are considered protected and
      so the reclaim will emit MEMCG_LOW event when encountering them.
      
      Another and much bigger issue was reported by Joonsoo Kim.  He has hit a
      NULL ptr dereference with the legacy cgroup API which even doesn't have
      low limit exposed.  The limit is 0 by default but the initial check fails
      for memcg with 0 consumption and parent_mem_cgroup() would return NULL if
      use_hierarchy is 0 and so page_counter_read would try to dereference NULL.
      
      I suppose that the current implementation is just an overlook because the
      documentation in Documentation/cgroups/unified-hierarchy.txt says:
      
        "The memory.low boundary on the other hand is a top-down allocated
        reserve.  A cgroup enjoys reclaim protection when it and all its
        ancestors are below their low boundaries"
      
      Fix the usage and the low limit comparision in mem_cgroup_low accordingly.
      
      Fixes: 241994ed (mm: memcontrol: default hierarchy interface for memory)
      Reported-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e54dede
    • J
      mm/nommu: fix memory leak · da616534
      Joonsoo Kim 提交于
      Maxime reported the following memory leak regression due to commit
      dbc8358c ("mm/nommu: use alloc_pages_exact() rather than its own
      implementation").
      
      On v3.19, I am facing a memory leak.  Each time I run a command one page
      is lost.  Here an example with busybox's free command:
      
        / # free
                     total       used       free     shared    buffers     cached
        Mem:          7928       1972       5956          0          0        492
        -/+ buffers/cache:       1480       6448
        / # free
                     total       used       free     shared    buffers     cached
        Mem:          7928       1976       5952          0          0        492
        -/+ buffers/cache:       1484       6444
        / # free
                     total       used       free     shared    buffers     cached
        Mem:          7928       1980       5948          0          0        492
        -/+ buffers/cache:       1488       6440
        / # free
                     total       used       free     shared    buffers     cached
        Mem:          7928       1984       5944          0          0        492
        -/+ buffers/cache:       1492       6436
        / # free
                     total       used       free     shared    buffers     cached
        Mem:          7928       1988       5940          0          0        492
        -/+ buffers/cache:       1496       6432
      
      At some point, the system fails to sastisfy 256KB allocations:
      
        free: page allocation failure: order:6, mode:0xd0
        CPU: 0 PID: 67 Comm: free Not tainted 3.19.0-05389-gacf2cf1-dirty #64
        Hardware name: STM32 (Device Tree Support)
          show_stack+0xb/0xc
          warn_alloc_failed+0x97/0xbc
          __alloc_pages_nodemask+0x295/0x35c
          __get_free_pages+0xb/0x24
          alloc_pages_exact+0x19/0x24
          do_mmap_pgoff+0x423/0x658
          vm_mmap_pgoff+0x3f/0x4e
          load_flat_file+0x20d/0x4f8
          load_flat_binary+0x3f/0x26c
          search_binary_handler+0x51/0xe4
          do_execveat_common+0x271/0x35c
          do_execve+0x19/0x1c
          ret_fast_syscall+0x1/0x4a
        Mem-info:
        Normal per-cpu:
        CPU    0: hi:    0, btch:   1 usd:   0
        active_anon:0 inactive_anon:0 isolated_anon:0
         active_file:0 inactive_file:0 isolated_file:0
         unevictable:123 dirty:0 writeback:0 unstable:0
         free:1515 slab_reclaimable:17 slab_unreclaimable:139
         mapped:0 shmem:0 pagetables:0 bounce:0
         free_cma:0
        Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks
        lowmem_reserve[]: 0 0
        Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
        123 total pagecache pages
        2048 pages of RAM
        1538 free pages
        66 reserved pages
        109 slab pages
        -46 pages shared
        0 pages swap cached
        nommu: Allocation of length 221184 from process 67 (free) failed
        Normal per-cpu:
        CPU    0: hi:    0, btch:   1 usd:   0
        active_anon:0 inactive_anon:0 isolated_anon:0
         active_file:0 inactive_file:0 isolated_file:0
         unevictable:123 dirty:0 writeback:0 unstable:0
         free:1515 slab_reclaimable:17 slab_unreclaimable:139
         mapped:0 shmem:0 pagetables:0 bounce:0
         free_cma:0
        Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks
        lowmem_reserve[]: 0 0
        Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
        123 total pagecache pages
        Unable to allocate RAM for process text/data, errno 12 SEGV
      
      This problem happens because we allocate ordered page through
      __get_free_pages() in do_mmap_private() in some cases and we try to free
      individual pages rather than ordered page in free_page_series().  In
      this case, freeing pages whose refcount is not 0 won't be freed to the
      page allocator so memory leak happens.
      
      To fix the problem, this patch changes __get_free_pages() to
      alloc_pages_exact() since alloc_pages_exact() returns
      physically-contiguous pages but each pages are refcounted.
      
      Fixes: dbc8358c ("mm/nommu: use alloc_pages_exact() rather than its own implementation").
      Reported-by: NMaxime Coquelin <mcoquelin.stm32@gmail.com>
      Tested-by: NMaxime Coquelin <mcoquelin.stm32@gmail.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>	[3.19]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da616534
  5. 24 2月, 2015 1 次提交
    • S
      mm: shmem: check for mapping owner before dereferencing · f0774d88
      Sasha Levin 提交于
      mapping->host can be NULL and shouldn't be dereferenced before being checked.
      
      [ 1295.741844] GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] SMP KASAN
      [ 1295.746387] Dumping ftrace buffer:
      [ 1295.748217]    (ftrace buffer empty)
      [ 1295.749527] Modules linked in:
      [ 1295.750268] CPU: 62 PID: 23410 Comm: trinity-c70 Not tainted 3.19.0-next-20150219-sasha-00045-g9130270f #1939
      [ 1295.750268] task: ffff8803a49db000 ti: ffff8803a4dc8000 task.ti: ffff8803a4dc8000
      [ 1295.750268] RIP: shmem_mapping (mm/shmem.c:1458)
      [ 1295.750268] RSP: 0000:ffff8803a4dcfbf8  EFLAGS: 00010206
      [ 1295.750268] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 00000000000f2804
      [ 1295.750268] RDX: 0000000000000005 RSI: 0400000000000794 RDI: 0000000000000028
      [ 1295.750268] RBP: ffff8803a4dcfc08 R08: 0000000000000000 R09: 00000000031de000
      [ 1295.750268] R10: dffffc0000000000 R11: 00000000031c1000 R12: 0400000000000794
      [ 1295.750268] R13: 00000000031c2000 R14: 00000000031de000 R15: ffff880e3bdc1000
      [ 1295.750268] FS:  00007f8703c7e700(0000) GS:ffff881164800000(0000) knlGS:0000000000000000
      [ 1295.750268] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1295.750268] CR2: 0000000004e58000 CR3: 00000003a9f3c000 CR4: 00000000000007a0
      [ 1295.750268] DR0: ffffffff81000000 DR1: 0000009494949494 DR2: 0000000000000000
      [ 1295.750268] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000d0602
      [ 1295.750268] Stack:
      [ 1295.750268]  ffff8803a4dcfec8 ffffffffbb1dc770 ffff8803a4dcfc38 ffffffffad6f230b
      [ 1295.750268]  ffffffffad6f2b0d 0000014100000000 ffff88001e17c08b ffff880d9453fe08
      [ 1295.750268]  ffff8803a4dcfd18 ffffffffad6f2ce2 ffff8803a49dbcd8 ffff8803a49dbce0
      [ 1295.750268] Call Trace:
      [ 1295.750268] mincore_page (mm/mincore.c:61)
      [ 1295.750268] ? mincore_pte_range (include/linux/spinlock.h:312 mm/mincore.c:131)
      [ 1295.750268] mincore_pte_range (mm/mincore.c:151)
      [ 1295.750268] ? mincore_unmapped_range (mm/mincore.c:113)
      [ 1295.750268] __walk_page_range (mm/pagewalk.c:51 mm/pagewalk.c:90 mm/pagewalk.c:116 mm/pagewalk.c:204)
      [ 1295.750268] walk_page_range (mm/pagewalk.c:275)
      [ 1295.750268] SyS_mincore (mm/mincore.c:191 mm/mincore.c:253 mm/mincore.c:220)
      [ 1295.750268] ? mincore_pte_range (mm/mincore.c:220)
      [ 1295.750268] ? mincore_unmapped_range (mm/mincore.c:113)
      [ 1295.750268] ? __mincore_unmapped_range (mm/mincore.c:105)
      [ 1295.750268] ? ptlock_free (mm/mincore.c:24)
      [ 1295.750268] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1610)
      [ 1295.750268] ia32_do_call (arch/x86/ia32/ia32entry.S:446)
      [ 1295.750268] Code: e5 48 c1 ea 03 53 48 89 fb 48 83 ec 08 80 3c 02 00 75 4f 48 b8 00 00 00 00 00 fc ff df 48 8b 1b 48 8d 7b 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 3f 48 b8 00 00 00 00 00 fc ff df 48 8b 5b 28 48
      
      All code
      ========
         0:	e5 48                	in     $0x48,%eax
         2:	c1 ea 03             	shr    $0x3,%edx
         5:	53                   	push   %rbx
         6:	48 89 fb             	mov    %rdi,%rbx
         9:	48 83 ec 08          	sub    $0x8,%rsp
         d:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)
        11:	75 4f                	jne    0x62
        13:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
        1a:	fc ff df
        1d:	48 8b 1b             	mov    (%rbx),%rbx
        20:	48 8d 7b 28          	lea    0x28(%rbx),%rdi
        24:	48 89 fa             	mov    %rdi,%rdx
        27:	48 c1 ea 03          	shr    $0x3,%rdx
        2b:*	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)		<-- trapping instruction
        2f:	75 3f                	jne    0x70
        31:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
        38:	fc ff df
        3b:	48 8b 5b 28          	mov    0x28(%rbx),%rbx
        3f:	48                   	rex.W
      	...
      
      Code starting with the faulting instruction
      ===========================================
         0:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)
         4:	75 3f                	jne    0x45
         6:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
         d:	fc ff df
        10:	48 8b 5b 28          	mov    0x28(%rbx),%rbx
        14:	48                   	rex.W
      	...
      [ 1295.750268] RIP shmem_mapping (mm/shmem.c:1458)
      [ 1295.750268]  RSP <ffff8803a4dcfbf8>
      
      Fixes: 97b713ba ("fs: kill BDI_CAP_SWAP_BACKED")
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f0774d88
  6. 23 2月, 2015 1 次提交
    • D
      VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) · e36cb0b8
      David Howells 提交于
      Convert the following where appropriate:
      
       (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).
      
       (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).
      
       (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry).  This is actually more
           complicated than it appears as some calls should be converted to
           d_can_lookup() instead.  The difference is whether the directory in
           question is a real dir with a ->lookup op or whether it's a fake dir with
           a ->d_automount op.
      
      In some circumstances, we can subsume checks for dentry->d_inode not being
      NULL into this, provided we the code isn't in a filesystem that expects
      d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
      use d_inode() rather than d_backing_inode() to get the inode pointer).
      
      Note that the dentry type field may be set to something other than
      DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
      manages the fall-through from a negative dentry to a lower layer.  In such a
      case, the dentry type of the negative union dentry is set to the same as the
      type of the lower dentry.
      
      However, if you know d_inode is not NULL at the call site, then you can use
      the d_is_xxx() functions even in a filesystem.
      
      There is one further complication: a 0,0 chardev dentry may be labelled
      DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE.  Strictly, this was
      intended for special directory entry types that don't have attached inodes.
      
      The following perl+coccinelle script was used:
      
      use strict;
      
      my @callers;
      open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
          die "Can't grep for S_ISDIR and co. callers";
      @callers = <$fd>;
      close($fd);
      unless (@callers) {
          print "No matches\n";
          exit(0);
      }
      
      my @cocci = (
          '@@',
          'expression E;',
          '@@',
          '',
          '- S_ISLNK(E->d_inode->i_mode)',
          '+ d_is_symlink(E)',
          '',
          '@@',
          'expression E;',
          '@@',
          '',
          '- S_ISDIR(E->d_inode->i_mode)',
          '+ d_is_dir(E)',
          '',
          '@@',
          'expression E;',
          '@@',
          '',
          '- S_ISREG(E->d_inode->i_mode)',
          '+ d_is_reg(E)' );
      
      my $coccifile = "tmp.sp.cocci";
      open($fd, ">$coccifile") || die $coccifile;
      print($fd "$_\n") || die $coccifile foreach (@cocci);
      close($fd);
      
      foreach my $file (@callers) {
          chomp $file;
          print "Processing ", $file, "\n";
          system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
      	die "spatch failed";
      }
      
      [AV: overlayfs parts skipped]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e36cb0b8
  7. 18 2月, 2015 2 次提交
  8. 17 2月, 2015 7 次提交
    • M
      vfs: remove get_xip_mem · e748dcd0
      Matthew Wilcox 提交于
      All callers of get_xip_mem() are now gone.  Remove checks for it,
      initialisers of it, documentation of it and the only implementation of it.
       Also remove mm/filemap_xip.c as it is now empty.  Also remove
      documentation of the long-gone get_xip_page().
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e748dcd0
    • M
      dax,ext2: replace xip_truncate_page with dax_truncate_page · 4c0ccfef
      Matthew Wilcox 提交于
      It takes a get_block parameter just like nobh_truncate_page() and
      block_truncate_page()
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Reviewed-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c0ccfef
    • M
      dax,ext2: replace the XIP page fault handler with the DAX page fault handler · f7ca90b1
      Matthew Wilcox 提交于
      Instead of calling aops->get_xip_mem from the fault handler, the
      filesystem passes a get_block_t that is used to find the appropriate
      blocks.
      
      This requires that all architectures implement copy_user_page().  At the
      time of writing, mips and arm do not.  Patches exist and are in progress.
      
      [akpm@linux-foundation.org: remap_file_pages went away]
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7ca90b1
    • M
      dax,ext2: replace XIP read and write with DAX I/O · d475c634
      Matthew Wilcox 提交于
      Use the generic AIO infrastructure instead of custom read and write
      methods.  In addition to giving us support for AIO, this adds the missing
      locking between read() and truncate().
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d475c634
    • M
      vfs,ext2: introduce IS_DAX(inode) · fbbbad4b
      Matthew Wilcox 提交于
      Use an inode flag to tag inodes which should avoid using the page cache.
      Convert ext2 to use it instead of mapping_is_xip().  Prevent I/Os to files
      tagged with the DAX flag from falling back to buffered I/O.
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fbbbad4b
    • M
      mm: allow page fault handlers to perform the COW · 2e4cdab0
      Matthew Wilcox 提交于
      Currently COW of an XIP file is done by first bringing in a read-only
      mapping, then retrying the fault and copying the page.  It is much more
      efficient to tell the fault handler that a COW is being attempted (by
      passing in the pre-allocated page in the vm_fault structure), and allow
      the handler to perform the COW operation itself.
      
      The handler cannot insert the page itself if there is already a read-only
      mapping at that address, so allow the handler to return VM_FAULT_LOCKED
      and set the fault_page to be NULL.  This indicates to the MM code that the
      i_mmap_lock is held instead of the page lock.
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e4cdab0
    • M
      mm: fix XIP fault vs truncate race · 283307c7
      Matthew Wilcox 提交于
      DAX is a replacement for the variation of XIP currently supported by the
      ext2 filesystem.  We have three different things in the tree called 'XIP',
      and the new focus is on access to data rather than executables, so a name
      change was in order.  DAX stands for Direct Access.  The X is for
      eXciting.
      
      The new focus on data access has resulted in more careful attention to
      races that exist in the current XIP code, but are not hit by the use-case
      that it was designed for.  XIP's architecture worked fine for ext2, but
      DAX is architected to work with modern filsystems such as ext4 and XFS.
      DAX is not intended for use with btrfs; the value that btrfs adds relies
      on manipulating data and writing data to different locations, while DAX's
      value is for write-in-place and keeping the kernel from touching the data.
      
      DAX was developed in order to support NV-DIMMs, but it's become clear that
      its usefuless extends beyond NV-DIMMs and there are several potential
      customers including the tracing machinery.  Other people want to place the
      kernel log in an area of memory, as long as they have a BIOS that does not
      clear DRAM on reboot.
      
      Patch 1 is a bug fix, probably worth including in 3.18.
      
      Patches 2 & 3 are infrastructure for DAX.
      
      Patches 4-8 replace the XIP code with its DAX equivalents, transforming
      ext2 to use the DAX code as we go.  Note that patch 10 is the
      Documentation patch.
      
      Patches 9-15 clean up after the XIP code, removing the infrastructure
      that is no longer needed and renaming various XIP things to DAX.
      Most of these patches were added after Jan found things he didn't
      like in an earlier version of the ext4 patch ... that had been copied
      from ext2.  So ext2 i being transformed to do things the same way that
      ext4 will later.  The ability to mount ext2 filesystems with the 'xip'
      option is retained, although the 'dax' option is now preferred.
      
      Patch 16 adds some DAX infrastructure to support ext4.
      
      Patch 17 adds DAX support to ext4.  It is broadly similar to ext2's DAX
      support, but it is more efficient than ext4's due to its support for
      unwritten extents.
      
      Patch 18 is another cleanup patch renaming XIP to DAX.
      
      My thanks to Mathieu Desnoyers for his reviews of the v11 patchset.  Most
      of the changes below were based on his feedback.
      
      This patch (of 18):
      
      Pagecache faults recheck i_size after taking the page lock to ensure that
      the fault didn't race against a truncate.  We don't have a page to lock in
      the XIP case, so use i_mmap_lock_read() instead.  It is locked in the
      truncate path in unmap_mapping_range() after updating i_size.  So while we
      hold it in the fault path, we are guaranteed that either i_size has
      already been updated in the truncate path, or that the truncate will
      subsequently call zap_page_range_single() and so remove the mapping we
      have just inserted.
      
      There is a window of time in which i_size has been reduced and the thread
      has a mapping to a page which will be removed from the file, but this is
      harmless as the page will not be allocated to a different purpose before
      the thread's access to it is revoked.
      
      [akpm@linux-foundation.org: switch to i_mmap_lock_read(), add comment in unmap_single_vma()]
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      283307c7
  9. 14 2月, 2015 8 次提交
    • A
      kasan: enable instrumentation of global variables · bebf56a1
      Andrey Ryabinin 提交于
      This feature let us to detect accesses out of bounds of global variables.
      This will work as for globals in kernel image, so for globals in modules.
      Currently this won't work for symbols in user-specified sections (e.g.
      __init, __read_mostly, ...)
      
      The idea of this is simple.  Compiler increases each global variable by
      redzone size and add constructors invoking __asan_register_globals()
      function.  Information about global variable (address, size, size with
      redzone ...) passed to __asan_register_globals() so we could poison
      variable's redzone.
      
      This patch also forces module_alloc() to return 8*PAGE_SIZE aligned
      address making shadow memory handling (
      kasan_module_alloc()/kasan_module_free() ) more simple.  Such alignment
      guarantees that each shadow page backing modules address space correspond
      to only one module_alloc() allocation.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bebf56a1
    • A
      mm: vmalloc: pass additional vm_flags to __vmalloc_node_range() · cb9e3c29
      Andrey Ryabinin 提交于
      For instrumenting global variables KASan will shadow memory backing memory
      for modules.  So on module loading we will need to allocate memory for
      shadow and map it at address in shadow that corresponds to the address
      allocated in module_alloc().
      
      __vmalloc_node_range() could be used for this purpose, except it puts a
      guard hole after allocated area.  Guard hole in shadow memory should be a
      problem because at some future point we might need to have a shadow memory
      at address occupied by guard hole.  So we could fail to allocate shadow
      for module_alloc().
      
      Now we have VM_NO_GUARD flag disabling guard page, so we need to pass into
      __vmalloc_node_range().  Add new parameter 'vm_flags' to
      __vmalloc_node_range() function.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb9e3c29
    • A
      mm: vmalloc: add flag preventing guard hole allocation · 71394fe5
      Andrey Ryabinin 提交于
      For instrumenting global variables KASan will shadow memory backing memory
      for modules.  So on module loading we will need to allocate memory for
      shadow and map it at address in shadow that corresponds to the address
      allocated in module_alloc().
      
      __vmalloc_node_range() could be used for this purpose, except it puts a
      guard hole after allocated area.  Guard hole in shadow memory should be a
      problem because at some future point we might need to have a shadow memory
      at address occupied by guard hole.  So we could fail to allocate shadow
      for module_alloc().
      
      Add a new vm_struct flag 'VM_NO_GUARD' indicating that vm area doesn't
      have a guard hole.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71394fe5
    • A
      kasan: enable stack instrumentation · c420f167
      Andrey Ryabinin 提交于
      Stack instrumentation allows to detect out of bounds memory accesses for
      variables allocated on stack.  Compiler adds redzones around every
      variable on stack and poisons redzones in function's prologue.
      
      Such approach significantly increases stack usage, so all in-kernel stacks
      size were doubled.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c420f167
    • A
      x86_64: kasan: add interceptors for memset/memmove/memcpy functions · 393f203f
      Andrey Ryabinin 提交于
      Recently instrumentation of builtin functions calls was removed from GCC
      5.0.  To check the memory accessed by such functions, userspace asan
      always uses interceptors for them.
      
      So now we should do this as well.  This patch declares
      memset/memmove/memcpy as weak symbols.  In mm/kasan/kasan.c we have our
      own implementation of those functions which checks memory before accessing
      it.
      
      Default memset/memmove/memcpy now now always have aliases with '__'
      prefix.  For files that built without kasan instrumentation (e.g.
      mm/slub.c) original mem* replaced (via #define) with prefixed variants,
      cause we don't want to check memory accesses there.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      393f203f
    • A
      kmemleak: disable kasan instrumentation for kmemleak · e79ed2f1
      Andrey Ryabinin 提交于
      kmalloc internally round up allocation size, and kmemleak uses rounded up
      size as object's size.  This makes kasan to complain while kmemleak scans
      memory or calculates of object's checksum.  The simplest solution here is
      to disable kasan.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e79ed2f1
    • A
      mm: slub: add kernel address sanitizer support for slub allocator · 0316bec2
      Andrey Ryabinin 提交于
      With this patch kasan will be able to catch bugs in memory allocated by
      slub.  Initially all objects in newly allocated slab page, marked as
      redzone.  Later, when allocation of slub object happens, requested by
      caller number of bytes marked as accessible, and the rest of the object
      (including slub's metadata) marked as redzone (inaccessible).
      
      We also mark object as accessible if ksize was called for this object.
      There is some places in kernel where ksize function is called to inquire
      size of really allocated area.  Such callers could validly access whole
      allocated memory, so it should be marked as accessible.
      
      Code in slub.c and slab_common.c files could validly access to object's
      metadata, so instrumentation for this files are disabled.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Signed-off-by: NDmitry Chernenkov <dmitryc@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0316bec2
    • A
      mm: slub: introduce metadata_access_enable()/metadata_access_disable() · a79316c6
      Andrey Ryabinin 提交于
      It's ok for slub to access memory that marked by kasan as inaccessible
      (object's metadata).  Kasan shouldn't print report in that case because
      these accesses are valid.  Disabling instrumentation of slub.c code is not
      enough to achieve this because slub passes pointer to object's metadata
      into external functions like memchr_inv().
      
      We don't want to disable instrumentation for memchr_inv() because this is
      quite generic function, and we don't want to miss bugs.
      
      metadata_access_enable/metadata_access_disable used to tell KASan where
      accesses to metadata starts/end, so we could temporarily disable KASan
      reports.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a79316c6