1. 29 12月, 2018 7 次提交
  2. 27 10月, 2018 4 次提交
    • A
      mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page() · 7eef5f97
      Andrea Arcangeli 提交于
      There should be no cache left by the time we overwrite the old transhuge
      pmd with the new one.  It's already too late to flush through the virtual
      address because we already copied the page data to the new physical
      address.
      
      So flush the cache before the data copy.
      
      Also delete the "end" variable to shutoff a "unused variable" warning on
      x86 where flush_cache_range() is a noop.
      
      Link: http://lkml.kernel.org/r/20181015202311.7209-1-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7eef5f97
    • A
      mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() · 7066f0f9
      Andrea Arcangeli 提交于
      change_huge_pmd() after arming the numa/protnone pmd doesn't flush the TLB
      right away.  do_huge_pmd_numa_page() flushes the TLB before calling
      migrate_misplaced_transhuge_page().  By the time do_huge_pmd_numa_page()
      runs some CPU could still access the page through the TLB.
      
      change_huge_pmd() before arming the numa/protnone transhuge pmd calls
      mmu_notifier_invalidate_range_start().  So there's no need of
      mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
      sequence in migrate_misplaced_transhuge_page() too, because by the time
      migrate_misplaced_transhuge_page() runs, the pmd mapping has already been
      invalidated in the secondary MMUs.  It has to or if a secondary MMU can
      still write to the page, the migrate_page_copy() would lose data.
      
      However an explicit mmu_notifier_invalidate_range() is needed before
      migrate_misplaced_transhuge_page() starts copying the data of the
      transhuge page or the below can happen for MMU notifier users sharing the
      primary MMU pagetables and only implementing ->invalidate_range:
      
      CPU0		CPU1		GPU sharing linux pagetables using
                                      only ->invalidate_range
      -----------	------------	---------
      				GPU secondary MMU writes to the page
      				mapped by the transhuge pmd
      change_pmd_range()
      mmu..._range_start()
      ->invalidate_range_start() noop
      change_huge_pmd()
      set_pmd_at(numa/protnone)
      pmd_unlock()
      		do_huge_pmd_numa_page()
      		CPU TLB flush globally (1)
      		CPU cannot write to page
      		migrate_misplaced_transhuge_page()
      				GPU writes to the page...
      		migrate_page_copy()
      				...GPU stops writing to the page
      CPU TLB flush (2)
      mmu..._range_end() (3)
      ->invalidate_range_stop() noop
      ->invalidate_range()
      				GPU secondary MMU is invalidated
      				and cannot write to the page anymore
      				(too late)
      
      Just like we need a CPU TLB flush (1) because the TLB flush (2) arrives
      too late, we also need a mmu_notifier_invalidate_range() before calling
      migrate_misplaced_transhuge_page(), because the ->invalidate_range() in
      (3) also arrives too late.
      
      This requirement is the result of the lazy optimization in
      change_huge_pmd() that releases the pmd_lock without first flushing the
      TLB and without first calling mmu_notifier_invalidate_range().
      
      Even converting the removed mmu_notifier_invalidate_range_only_end() into
      a mmu_notifier_invalidate_range_end() would not have been enough to fix
      this, because it run after migrate_page_copy().
      
      After the hugepage data copy is done migrate_misplaced_transhuge_page()
      can proceed and call set_pmd_at without having to flush the TLB nor any
      secondary MMUs because the secondary MMU invalidate, just like the CPU TLB
      flush, has to happen before the migrate_page_copy() is called or it would
      be a bug in the first place (and it was for drivers using
      ->invalidate_range()).
      
      KVM is unaffected because it doesn't implement ->invalidate_range().
      
      The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
      uses the generic migrate_pages which transitions the pte from
      numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs
      and all mmu notifiers there before copying the page.
      
      Link: http://lkml.kernel.org/r/20181013002430.698-3-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7066f0f9
    • A
      mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition · d7c33934
      Andrea Arcangeli 提交于
      Patch series "migrate_misplaced_transhuge_page race conditions".
      
      Aaron found a new instance of the THP MADV_DONTNEED race against
      pmdp_clear_flush* variants, that was apparently left unfixed.
      
      While looking into the race found by Aaron, I may have found two more
      issues in migrate_misplaced_transhuge_page.
      
      These race conditions would not cause kernel instability, but they'd
      corrupt userland data or leave data non zero after MADV_DONTNEED.
      
      I did only minor testing, and I don't expect to be able to reproduce this
      (especially the lack of ->invalidate_range before migrate_page_copy,
      requires the latest iommu hardware or infiniband to reproduce).  The last
      patch is noop for x86 and it needs further review from maintainers of
      archs that implement flush_cache_range() (not in CC yet).
      
      To avoid confusion, it's not the first patch that introduces the bug fixed
      in the second patch, even before removing the
      pmdp_huge_clear_flush_notify, that _notify suffix was called after
      migrate_page_copy already run.
      
      This patch (of 3):
      
      This is a corollary of ced10803 ("thp: fix MADV_DONTNEED vs.  numa
      balancing race"), 58ceeb6b ("thp: fix MADV_DONTNEED vs.  MADV_FREE
      race") and 5b7abeae ("thp: fix MADV_DONTNEED vs clear soft dirty
      race).
      
      When the above three fixes where posted Dave asked
      https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com
      but apparently this was missed.
      
      The pmdp_clear_flush* in migrate_misplaced_transhuge_page() was introduced
      in a54a407f ("mm: Close races between THP migration and PMD numa
      clearing").
      
      The important part of such commit is only the part where the page lock is
      not released until the first do_huge_pmd_numa_page() finished disarming
      the pagenuma/protnone.
      
      The addition of pmdp_clear_flush() wasn't beneficial to such commit and
      there's no commentary about such an addition either.
      
      I guess the pmdp_clear_flush() in such commit was added just in case for
      safety, but it ended up introducing the MADV_DONTNEED race condition found
      by Aaron.
      
      At that point in time nobody thought of such kind of MADV_DONTNEED race
      conditions yet (they were fixed later) so the code may have looked more
      robust by adding the pmdp_clear_flush().
      
      This specific race condition won't destabilize the kernel, but it can
      confuse userland because after MADV_DONTNEED the memory won't be zeroed
      out.
      
      This also optimizes the code and removes a superfluous TLB flush.
      
      [akpm@linux-foundation.org: reflow comment to 80 cols, fix grammar and typo (beacuse)]
      Link: http://lkml.kernel.org/r/20181013002430.698-2-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NAaron Tomlin <atomlin@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7c33934
    • J
      mm: workingset: tell cache transitions from workingset thrashing · 1899ad18
      Johannes Weiner 提交于
      Refaults happen during transitions between workingsets as well as in-place
      thrashing.  Knowing the difference between the two has a range of
      applications, including measuring the impact of memory shortage on the
      system performance, as well as the ability to smarter balance pressure
      between the filesystem cache and the swap-backed workingset.
      
      During workingset transitions, inactive cache refaults and pushes out
      established active cache.  When that active cache isn't stale, however,
      and also ends up refaulting, that's bonafide thrashing.
      
      Introduce a new page flag that tells on eviction whether the page has been
      active or not in its lifetime.  This bit is then stored in the shadow
      entry, to classify refaults as transitioning or thrashing.
      
      How many page->flags does this leave us with on 32-bit?
      
      	20 bits are always page flags
      
      	21 if you have an MMU
      
      	23 with the zone bits for DMA, Normal, HighMem, Movable
      
      	29 with the sparsemem section bits
      
      	30 if PAE is enabled
      
      	31 with this patch.
      
      So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
      that's not enough, the system can switch to discontigmem and re-gain the 6
      or 7 sparsemem section bits.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1899ad18
  3. 21 10月, 2018 1 次提交
  4. 06 10月, 2018 2 次提交
  5. 02 10月, 2018 2 次提交
    • M
      mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration · efaffc5e
      Mel Gorman 提交于
      Rate limiting of page migrations due to automatic NUMA balancing was
      introduced to mitigate the worst-case scenario of migrating at high
      frequency due to false sharing or slowly ping-ponging between nodes.
      Since then, a lot of effort was spent on correctly identifying these
      pages and avoiding unnecessary migrations and the safety net may no longer
      be required.
      
      Jirka Hladky reported a regression in 4.17 due to a scheduler patch that
      avoids spreading STREAM tasks wide prematurely. However, once the task
      was properly placed, it delayed migrating the memory due to rate limiting.
      Increasing the limit fixed the problem for him.
      
      Currently, the limit is hard-coded and does not account for the real
      capabilities of the hardware. Even if an estimate was attempted, it would
      not properly account for the number of memory controllers and it could
      not account for the amount of bandwidth used for normal accesses. Rather
      than fudging, this patch simply eliminates the rate limiting.
      
      However, Jirka reports that a STREAM configuration using multiple
      processes achieved similar performance to 4.16. In local tests, this patch
      improved performance of STREAM relative to the baseline but it is somewhat
      machine-dependent. Most workloads show little or not performance difference
      implying that there is not a heavily reliance on the throttling mechanism
      and it is safe to remove.
      
      STREAM on 2-socket machine
                               4.19.0-rc5             4.19.0-rc5
                               numab-v1r1       noratelimit-v1r1
      MB/sec copy     43298.52 (   0.00%)    44673.38 (   3.18%)
      MB/sec scale    30115.06 (   0.00%)    31293.06 (   3.91%)
      MB/sec add      32825.12 (   0.00%)    34883.62 (   6.27%)
      MB/sec triad    32549.52 (   0.00%)    34906.60 (   7.24%
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linux-MM <linux-mm@kvack.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      efaffc5e
    • S
      mm/migrate: Use spin_trylock() while resetting rate limit · 75346121
      Srikar Dronamraju 提交于
      Since this spinlock will only serialize the migrate rate limiting,
      convert the spin_lock() to a spin_trylock(). If another thread is updating, this
      task can move on.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     205332  198512   -3.32145
      1     319785  313559   -1.94693
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      8     74912   74761.9  -0.200368
      1     206585  214874   4.01239
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     189162  180536   -4.56011
      1     213760  210281   -1.62753
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     58736.8  56511.4  -3.78877
      1     105419   104899   -0.49327
      
      Avoiding stretching of window intervals may be the reason for the
      regression. Also code now uses READ_ONCE/WRITE_ONCE. That may
      also be hurting performance to some extent.
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        14,285,708      13,818,546
      migrations                1,180,621       1,149,960
      faults                    339,114         385,583
      cache-misses              55,205,631,894  55,259,546,768
      sched:sched_move_numa     843             2,257
      sched:sched_stick_numa    6               9
      sched:sched_swap_numa     219             512
      migrate:mm_migrate_pages  365             2,225
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        26907   72692
      numa_hint_faults_local  24279   62270
      numa_hit                239771  238762
      numa_huge_pte_updates   0       48
      numa_interleave         68      75
      numa_local              239688  238676
      numa_other              83      86
      numa_pages_migrated     363     2225
      numa_pte_updates        27415   98557
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,202,779       3,173,490
      migrations                37,186          36,966
      faults                    106,076         108,776
      cache-misses              12,024,873,744  12,200,075,320
      sched:sched_move_numa     931             1,264
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     1               0
      migrate:mm_migrate_pages  637             899
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        17409   21109
      numa_hint_faults_local  14367   17120
      numa_hit                73953   72934
      numa_huge_pte_updates   20      42
      numa_interleave         25      33
      numa_local              73892   72866
      numa_other              61      68
      numa_pages_migrated     668     915
      numa_pte_updates        27276   42326
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,474,013    8,312,022
      migrations                254,934      231,705
      faults                    320,506      310,242
      cache-misses              110,580,458  402,324,573
      sched:sched_move_numa     725          193
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     7            3
      migrate:mm_migrate_pages  145          93
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        22797   11838
      numa_hint_faults_local  21539   11216
      numa_hit                89308   90689
      numa_huge_pte_updates   0       0
      numa_interleave         865     1579
      numa_local              88955   89634
      numa_other              353     1055
      numa_pages_migrated     149     92
      numa_pte_updates        22930   12109
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before     After
      cs                        2,195,628  2,170,481
      migrations                11,179     10,126
      faults                    149,656    160,962
      cache-misses              8,117,515  10,834,845
      sched:sched_move_numa     49         10
      sched:sched_stick_numa    0          0
      sched:sched_swap_numa     0          0
      migrate:mm_migrate_pages  5          2
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        3577    403
      numa_hint_faults_local  3476    358
      numa_hit                26142   25898
      numa_huge_pte_updates   0       0
      numa_interleave         358     207
      numa_local              26042   25860
      numa_other              100     38
      numa_pages_migrated     5       2
      numa_pte_updates        3587    400
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        100,602,296      110,339,633
      migrations                4,135,630        4,139,812
      faults                    789,256          863,622
      cache-misses              226,160,621,058  231,838,045,660
      sched:sched_move_numa     1,366            2,196
      sched:sched_stick_numa    16               33
      sched:sched_swap_numa     374              544
      migrate:mm_migrate_pages  1,350            2,469
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        47857   85748
      numa_hint_faults_local  39768   66831
      numa_hit                240165  242213
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              240165  242211
      numa_other              0       2
      numa_pages_migrated     1224    2376
      numa_pte_updates        48354   86233
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        58,515,496      59,331,057
      migrations                564,845         552,019
      faults                    245,807         266,586
      cache-misses              73,603,757,976  73,796,312,990
      sched:sched_move_numa     996             981
      sched:sched_stick_numa    10              54
      sched:sched_swap_numa     193             286
      migrate:mm_migrate_pages  646             713
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        13422   14807
      numa_hint_faults_local  5619    5738
      numa_hit                36118   36230
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              36116   36228
      numa_other              2       2
      numa_pages_migrated     616     703
      numa_pte_updates        13374   14742
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-6-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      75346121
  6. 24 8月, 2018 2 次提交
    • N
      mm: soft-offline: close the race against page allocation · d4ae9916
      Naoya Horiguchi 提交于
      A process can be killed with SIGBUS(BUS_MCEERR_AR) when it tries to
      allocate a page that was just freed on the way of soft-offline.  This is
      undesirable because soft-offline (which is about corrected error) is
      less aggressive than hard-offline (which is about uncorrected error),
      and we can make soft-offline fail and keep using the page for good
      reason like "system is busy."
      
      Two main changes of this patch are:
      
      - setting migrate type of the target page to MIGRATE_ISOLATE. As done
        in free_unref_page_commit(), this makes kernel bypass pcplist when
        freeing the page. So we can assume that the page is in freelist just
        after put_page() returns,
      
      - setting PG_hwpoison on free page under zone->lock which protects
        freelists, so this allows us to avoid setting PG_hwpoison on a page
        that is decided to be allocated soon.
      
      [akpm@linux-foundation.org: tweak set_hwpoison_free_buddy_page() comment]
      Link: http://lkml.kernel.org/r/1531452366-11661-3-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NXishi Qiu <xishi.qiuxishi@alibaba-inc.com>
      Tested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <zy.zhengyi@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4ae9916
    • N
      mm: fix race on soft-offlining free huge pages · 6bc9b564
      Naoya Horiguchi 提交于
      Patch series "mm: soft-offline: fix race against page allocation".
      
      Xishi recently reported the issue about race on reusing the target pages
      of soft offlining.  Discussion and analysis showed that we need make
      sure that setting PG_hwpoison should be done in the right place under
      zone->lock for soft offline.  1/2 handles free hugepage's case, and 2/2
      hanldes free buddy page's case.
      
      This patch (of 2):
      
      There's a race condition between soft offline and hugetlb_fault which
      causes unexpected process killing and/or hugetlb allocation failure.
      
      The process killing is caused by the following flow:
      
        CPU 0               CPU 1              CPU 2
      
        soft offline
          get_any_page
          // find the hugetlb is free
                            mmap a hugetlb file
                            page fault
                              ...
                                hugetlb_fault
                                  hugetlb_no_page
                                    alloc_huge_page
                                    // succeed
            soft_offline_free_page
            // set hwpoison flag
                                               mmap the hugetlb file
                                               page fault
                                                 ...
                                                   hugetlb_fault
                                                     hugetlb_no_page
                                                       find_lock_page
                                                         return VM_FAULT_HWPOISON
                                                 mm_fault_error
                                                   do_sigbus
                                                   // kill the process
      
      The hugetlb allocation failure comes from the following flow:
      
        CPU 0                          CPU 1
      
                                       mmap a hugetlb file
                                       // reserve all free page but don't fault-in
        soft offline
          get_any_page
          // find the hugetlb is free
            soft_offline_free_page
            // set hwpoison flag
              dissolve_free_huge_page
              // fail because all free hugepages are reserved
                                       page fault
                                         ...
                                           hugetlb_fault
                                             hugetlb_no_page
                                               alloc_huge_page
                                                 ...
                                                   dequeue_huge_page_node_exact
                                                   // ignore hwpoisoned hugepage
                                                   // and finally fail due to no-mem
      
      The root cause of this is that current soft-offline code is written based
      on an assumption that PageHWPoison flag should be set at first to avoid
      accessing the corrupted data.  This makes sense for memory_failure() or
      hard offline, but does not for soft offline because soft offline is about
      corrected (not uncorrected) error and is safe from data lost.  This patch
      changes soft offline semantics where it sets PageHWPoison flag only after
      containment of the error page completes successfully.
      
      Link: http://lkml.kernel.org/r/1531452366-11661-2-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NXishi Qiu <xishi.qiuxishi@alibaba-inc.com>
      Suggested-by: NXishi Qiu <xishi.qiuxishi@alibaba-inc.com>
      Tested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <zy.zhengyi@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6bc9b564
  7. 23 8月, 2018 1 次提交
  8. 18 8月, 2018 1 次提交
  9. 12 5月, 2018 1 次提交
  10. 21 4月, 2018 2 次提交
    • N
      mm: enable thp migration for shmem thp · e71769ae
      Naoya Horiguchi 提交于
      My testing for the latest kernel supporting thp migration showed an
      infinite loop in offlining the memory block that is filled with shmem
      thps.  We can get out of the loop with a signal, but kernel should return
      with failure in this case.
      
      What happens in the loop is that scan_movable_pages() repeats returning
      the same pfn without any progress.  That's because page migration always
      fails for shmem thps.
      
      In memory offline code, memory blocks containing unmovable pages should be
      prevented from being offline targets by has_unmovable_pages() inside
      start_isolate_page_range().  So it's possible to change migratability for
      non-anonymous thps to avoid the issue, but it introduces more complex and
      thp-specific handling in migration code, so it might not good.
      
      So this patch is suggesting to fix the issue by enabling thp migration for
      shmem thp.  Both of anon/shmem thp are migratable so we don't need
      precheck about the type of thps.
      
      Link: http://lkml.kernel.org/r/20180406030706.GA2434@hori1.linux.bs1.fc.nec.co.jp
      Fixes: commit 72b39cfc ("mm, memory_hotplug: do not fail offlining too early")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Zi Yan <zi.yan@sent.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e71769ae
    • M
      mm: fix do_pages_move status handling · 8f175cf5
      Michal Hocko 提交于
      Li Wang has reported that LTP move_pages04 test fails with the current
      tree:
      
      LTP move_pages04:
         TFAIL  :  move_pages04.c:143: status[1] is EPERM, expected EFAULT
      
      The test allocates an array of two pages, one is present while the other
      is not (resp.  backed by zero page) and it expects EFAULT for the second
      page as the man page suggests.  We are reporting EPERM which doesn't make
      any sense and this is a result of a bug from cf5f16b23ec9 ("mm: unclutter
      THP migration").
      
      do_pages_move tries to handle as many pages in one batch as possible so we
      queue all pages with the same node target together and that corresponds to
      [start, i] range which is then used to update status array.
      add_page_for_migration will correctly notice the zero (resp.  !present)
      page and returns with EFAULT which gets written to the status.  But if
      this is the last page in the array we do not update start and so the last
      store_status after the loop will overwrite the range of the last batch
      with NUMA_NO_NODE (which corresponds to EPERM).
      
      Fix this by simply bailing out from the last flush if the pagelist is
      empty as there is clearly nothing more to do.
      
      Link: http://lkml.kernel.org/r/20180418121255.334-1-mhocko@kernel.org
      Fixes: cf5f16b23ec9 ("mm: unclutter THP migration")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NLi Wang <liwang@redhat.com>
      Tested-by: NLi Wang <liwang@redhat.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f175cf5
  11. 12 4月, 2018 6 次提交
    • M
      page cache: use xa_lock · b93b0163
      Matthew Wilcox 提交于
      Remove the address_space ->tree_lock and use the xa_lock newly added to
      the radix_tree_root.  Rename the address_space ->page_tree to ->i_pages,
      since we don't really care that it's a tree.
      
      [willy@infradead.org: fix nds32, fs/dax.c]
        Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b93b0163
    • M
      mm: unclutter THP migration · 94723aaf
      Michal Hocko 提交于
      THP migration is hacked into the generic migration with rather
      surprising semantic.  The migration allocation callback is supposed to
      check whether the THP can be migrated at once and if that is not the
      case then it allocates a simple page to migrate.  unmap_and_move then
      fixes that up by spliting the THP into small pages while moving the head
      page to the newly allocated order-0 page.  Remaning pages are moved to
      the LRU list by split_huge_page.  The same happens if the THP allocation
      fails.  This is really ugly and error prone [1].
      
      I also believe that split_huge_page to the LRU lists is inherently wrong
      because all tail pages are not migrated.  Some callers will just work
      around that by retrying (e.g.  memory hotplug).  There are other pfn
      walkers which are simply broken though.  e.g. madvise_inject_error will
      migrate head and then advances next pfn by the huge page size.
      do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
      will simply split the THP before migration if the THP migration is not
      supported then falls back to single page migration but it doesn't handle
      tail pages if the THP migration path is not able to allocate a fresh THP
      so we end up with ENOMEM and fail the whole migration which is a
      questionable behavior.  Page compaction doesn't try to migrate large
      pages so it should be immune.
      
      This patch tries to unclutter the situation by moving the special THP
      handling up to the migrate_pages layer where it actually belongs.  We
      simply split the THP page into the existing list if unmap_and_move fails
      with ENOMEM and retry.  So we will _always_ migrate all THP subpages and
      specific migrate_pages users do not have to deal with this case in a
      special way.
      
      [1] http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com
      
      Link: http://lkml.kernel.org/r/20180103082555.14592-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94723aaf
    • M
      mm, migrate: remove reason argument from new_page_t · 666feb21
      Michal Hocko 提交于
      No allocation callback is using this argument anymore.  new_page_node
      used to use this parameter to convey node_id resp.  migration error up
      to move_pages code (do_move_page_to_node_array).  The error status never
      made it into the final status field and we have a better way to
      communicate node id to the status field now.  All other allocation
      callbacks simply ignored the argument so we can drop it finally.
      
      [mhocko@suse.com: fix migration callback]
        Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
      [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
      [mhocko@kernel.org: fix build]
        Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      666feb21
    • M
      mm, numa: rework do_pages_move · a49bd4d7
      Michal Hocko 提交于
      Patch series "unclutter thp migration"
      
      Motivation:
      
      THP migration is hacked into the generic migration with rather
      surprising semantic.  The migration allocation callback is supposed to
      check whether the THP can be migrated at once and if that is not the
      case then it allocates a simple page to migrate.  unmap_and_move then
      fixes that up by splitting the THP into small pages while moving the
      head page to the newly allocated order-0 page.  Remaining pages are
      moved to the LRU list by split_huge_page.  The same happens if the THP
      allocation fails.  This is really ugly and error prone [2].
      
      I also believe that split_huge_page to the LRU lists is inherently wrong
      because all tail pages are not migrated.  Some callers will just work
      around that by retrying (e.g.  memory hotplug).  There are other pfn
      walkers which are simply broken though.  e.g. madvise_inject_error will
      migrate head and then advances next pfn by the huge page size.
      do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
      will simply split the THP before migration if the THP migration is not
      supported then falls back to single page migration but it doesn't handle
      tail pages if the THP migration path is not able to allocate a fresh THP
      so we end up with ENOMEM and fail the whole migration which is a
      questionable behavior.  Page compaction doesn't try to migrate large
      pages so it should be immune.
      
      The first patch reworks do_pages_move which relies on a very ugly
      calling semantic when the return status is pushed to the migration path
      via private pointer.  It uses pre allocated fixed size batching to
      achieve that.  We simply cannot do the same if a THP is to be split
      during the migration path which is done in the patch 3.  Patch 2 is
      follow up cleanup which removes the mentioned return status calling
      convention ugliness.
      
      On a side note:
      
      There are some semantic issues I have encountered on the way when
      working on patch 1 but I am not addressing them here.  E.g. trying to
      move THP tail pages will result in either success or EBUSY (the later
      one more likely once we isolate head from the LRU list).  Hugetlb
      reports EACCESS on tail pages.  Some errors are reported via status
      parameter but migration failures are not even though the original
      `reason' argument suggests there was an intention to do so.  From a
      quick look into git history this never worked.  I have tried to keep the
      semantic unchanged.
      
      Then there is a relatively minor thing that the page isolation might
      fail because of pages not being on the LRU - e.g. because they are
      sitting on the per-cpu LRU caches.  Easily fixable.
      
      This patch (of 3):
      
      do_pages_move is supposed to move user defined memory (an array of
      addresses) to the user defined numa nodes (an array of nodes one for
      each address).  The user provided status array then contains resulting
      numa node for each address or an error.  The semantic of this function
      is little bit confusing because only some errors are reported back.
      Notably migrate_pages error is only reported via the return value.  This
      patch doesn't try to address these semantic nuances but rather change
      the underlying implementation.
      
      Currently we are processing user input (which can be really large) in
      batches which are stored to a temporarily allocated page.  Each address
      is resolved to its struct page and stored to page_to_node structure
      along with the requested target numa node.  The array of these
      structures is then conveyed down the page migration path via private
      argument.  new_page_node then finds the corresponding structure and
      allocates the proper target page.
      
      What is the problem with the current implementation and why to change
      it? Apart from being quite ugly it also doesn't cope with unexpected
      pages showing up on the migration list inside migrate_pages path.  That
      doesn't happen currently but the follow up patch would like to make the
      thp migration code more clear and that would need to split a THP into
      the list for some cases.
      
      How does the new implementation work? Well, instead of batching into a
      fixed size array we simply batch all pages that should be migrated to
      the same node and isolate all of them into a linked list which doesn't
      require any additional storage.  This should work reasonably well
      because page migration usually migrates larger ranges of memory to a
      specific node.  So the common case should work equally well as the
      current implementation.  Even if somebody constructs an input where the
      target numa nodes would be interleaved we shouldn't see a large
      performance impact because page migration alone doesn't really benefit
      from batching.  mmap_sem batching for the lookup is quite questionable
      and isolate_lru_page which would benefit from batching is not using it
      even in the current implementation.
      
      Link: http://lkml.kernel.org/r/20180103082555.14592-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a49bd4d7
    • R
      mm/migrate: properly preserve write attribute in special migrate entry · 07707125
      Ralph Campbell 提交于
      Use of pte_write(pte) is only valid for present pte, the common code
      which set the migration entry can be reach for both valid present pte
      and special swap entry (for device memory).  Fix the code to use the
      mpfn value which properly handle both cases.
      
      On x86 this did not have any bad side effect because pte write bit is
      below PAGE_BIT_GLOBAL and thus special swap entry have it set to 0 which
      in turn means we were always creating read only special migration entry.
      
      So once migration did finish we always write protected the CPU page
      table entry (moreover this is only an issue when migrating from device
      memory to system memory).  End effect is that CPU write access would
      fault again and restore write permission.
      
      This behaviour isn't too bad; it just burns CPU cycles by forcing CPU to
      take a second fault on write access. ie, double faulting the same
      address.  There is no corruption or incorrect states (it behaves as a
      COWed page from a fork with a mapcount of 1).
      
      Link: http://lkml.kernel.org/r/20180402023506.12180-1-jglisse@redhat.comSigned-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07707125
    • M
      sched/numa: avoid trapping faults and attempting migration of file-backed dirty pages · 09a913a7
      Mel Gorman 提交于
      change_pte_range is called from task work context to mark PTEs for
      receiving NUMA faulting hints.  If the marked pages are dirty then
      migration may fail.  Some filesystems cannot migrate dirty pages without
      blocking so are skipped in MIGRATE_ASYNC mode which just wastes CPU.
      Even when they can, it can be a waste of cycles when the pages are
      shared forcing higher scan rates.  This patch avoids marking shared
      dirty pages for hinting faults but also will skip a migration if the
      page was dirtied after the scanner updated a clean page.
      
      This is most noticeable running the NASA Parallel Benchmark when backed
      by btrfs, the default root filesystem for some distributions, but also
      noticeable when using XFS.
      
      The following are results from a 4-socket machine running a 4.16-rc4
      kernel with some scheduler patches that are pending for the next merge
      window.
      
                              4.16.0-rc4             4.16.0-rc4
                       schedtip-20180309          nodirty-v1
        Time cg.D      459.07 (   0.00%)      444.21 (   3.24%)
        Time ep.D       76.96 (   0.00%)       77.69 (  -0.95%)
        Time is.D       25.55 (   0.00%)       27.85 (  -9.00%)
        Time lu.D      601.58 (   0.00%)      596.87 (   0.78%)
        Time mg.D      107.73 (   0.00%)      108.22 (  -0.45%)
      
      is.D regresses slightly in terms of absolute time but note that that
      particular load varies quite a bit from run to run.  The more relevant
      observation is the total system CPU usage.
      
                  4.16.0-rc4  4.16.0-rc4
                schedtip-20180309 nodirty-v1
        User        71471.91    70627.04
        System      11078.96     8256.13
        Elapsed       661.66      632.74
      
      That is a substantial drop in system CPU usage and overall the workload
      completes faster.  The NUMA balancing statistics are also interesting
      
        NUMA base PTE updates        111407972   139848884
        NUMA huge PMD updates           206506      264869
        NUMA page range updates      217139044   275461812
        NUMA hint faults               4300924     3719784
        NUMA hint local faults         3012539     3416618
        NUMA hint local percent             70          91
        NUMA pages migrated            1517487     1358420
      
      While more PTEs are scanned due to changes in what faults are gathered,
      it's clear that a far higher percentage of faults are local as the bulk
      of the remote hits were dirty pages that, in this case with btrfs, had
      no chance of migrating.
      
      The following is a comparison when using XFS as that is a more realistic
      filesystem choice for a data partition
      
                              4.16.0-rc4             4.16.0-rc4
                       schedtip-20180309          nodirty-v1r47
        Time cg.D      485.28 (   0.00%)      442.62 (   8.79%)
        Time ep.D       77.68 (   0.00%)       77.54 (   0.18%)
        Time is.D       26.44 (   0.00%)       24.79 (   6.24%)
        Time lu.D      597.46 (   0.00%)      597.11 (   0.06%)
        Time mg.D      142.65 (   0.00%)      105.83 (  25.81%)
      
      That is a reasonable gain on two relatively long-lived workloads.  While
      not presented, there is also a substantial drop in system CPu usage and
      the NUMA balancing stats show similar improvements in locality as btrfs
      did.
      
      Link: http://lkml.kernel.org/r/20180326094334.zserdec62gwmmfqf@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09a913a7
  12. 03 4月, 2018 1 次提交
  13. 01 2月, 2018 1 次提交
    • M
      mm, hugetlb: do not rely on overcommit limit during migration · ab5ac90a
      Michal Hocko 提交于
      hugepage migration relies on __alloc_buddy_huge_page to get a new page.
      This has 2 main disadvantages.
      
      1) it doesn't allow to migrate any huge page if the pool is used
         completely which is not an exceptional case as the pool is static and
         unused memory is just wasted.
      
      2) it leads to a weird semantic when migration between two numa nodes
         might increase the pool size of the destination NUMA node while the
         page is in use.  The issue is caused by per NUMA node surplus pages
         tracking (see free_huge_page).
      
      Address both issues by changing the way how we allocate and account
      pages allocated for migration.  Those should temporal by definition.  So
      we mark them that way (we will abuse page flags in the 3rd page) and
      update free_huge_page to free such pages to the page allocator.  Page
      migration path then just transfers the temporal status from the new page
      to the old one which will be freed on the last reference.  The global
      surplus count will never change during this path but we still have to be
      careful when migrating a per-node suprlus page.  This is now handled in
      move_hugetlb_state which is called from the migration path and it copies
      the hugetlb specific page state and fixes up the accounting when needed
      
      Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better
      reflect its purpose.  The new allocation routine for the migration path
      is __alloc_migrate_huge_page.
      
      The user visible effect of this patch is that migrated pages are really
      temporal and they travel between NUMA nodes as per the migration
      request:
      
      Before migration
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
      
      After
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
      
      with the previous implementation, both nodes would have nr_hugepages:1
      until the page is freed.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab5ac90a
  14. 30 11月, 2017 1 次提交
  15. 28 11月, 2017 1 次提交
    • K
      mm, thp: Do not make pmd/pud dirty without a reason · 152e93af
      Kirill A. Shutemov 提交于
      Currently we make page table entries dirty all the time regardless of
      access type and don't even consider if the mapping is write-protected.
      The reasoning is that we don't really need dirty tracking on THP and
      making the entry dirty upfront may save some time on first write to the
      page.
      
      Unfortunately, such approach may result in false-positive
      can_follow_write_pmd() for huge zero page or read-only shmem file.
      
      Let's only make page dirty only if we about to write to the page anyway
      (as we do for small pages).
      
      I've restructured the code to make entry dirty inside
      maybe_p[mu]d_mkwrite(). It also takes into account if the vma is
      write-protected.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      152e93af
  16. 16 11月, 2017 1 次提交
    • J
      mm/mmu_notifier: avoid call to invalidate_range() in range_end() · 4645b9fe
      Jérôme Glisse 提交于
      This is an optimization patch that only affect mmu_notifier users which
      rely on the invalidate_range() callback.  This patch avoids calling that
      callback twice in a row from inside __mmu_notifier_invalidate_range_end
      
      Existing pattern (before this patch):
          mmu_notifier_invalidate_range_start()
              pte/pmd/pud_clear_flush_notify()
                  mmu_notifier_invalidate_range()
          mmu_notifier_invalidate_range_end()
              mmu_notifier_invalidate_range()
      
      New pattern (after this patch):
          mmu_notifier_invalidate_range_start()
              pte/pmd/pud_clear_flush_notify()
                  mmu_notifier_invalidate_range()
          mmu_notifier_invalidate_range_only_end()
      
      We call the invalidate_range callback after clearing the page table
      under the page table lock and we skip the call to invalidate_range
      inside the __mmu_notifier_invalidate_range_end() function.
      
      Idea from Andrea Arcangeli
      
      Link: http://lkml.kernel.org/r/20171017031003.7481-3-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4645b9fe
  17. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  18. 14 10月, 2017 1 次提交
  19. 09 9月, 2017 4 次提交