1. 10 6月, 2020 1 次提交
  2. 03 6月, 2020 1 次提交
    • V
      mm, dump_page(): do not crash with invalid mapping pointer · 002ae705
      Vlastimil Babka 提交于
      We have seen a following problem on a RPi4 with 1G RAM:
      
          BUG: Bad page state in process systemd-hwdb  pfn:35601
          page:ffff7e0000d58040 refcount:15 mapcount:131221 mapping:efd8fe765bc80080 index:0x1 compound_mapcount: -32767
          Unable to handle kernel paging request at virtual address efd8fe765bc80080
          Mem abort info:
            ESR = 0x96000004
            Exception class = DABT (current EL), IL = 32 bits
            SET = 0, FnV = 0
            EA = 0, S1PTW = 0
          Data abort info:
            ISV = 0, ISS = 0x00000004
            CM = 0, WnR = 0
          [efd8fe765bc80080] address between user and kernel address ranges
          Internal error: Oops: 96000004 [#1] SMP
          Modules linked in: btrfs libcrc32c xor xor_neon zlib_deflate raid6_pq mmc_block xhci_pci xhci_hcd usbcore sdhci_iproc sdhci_pltfm sdhci mmc_core clk_raspberrypi gpio_raspberrypi_exp pcie_brcmstb bcm2835_dma gpio_regulator phy_generic fixed sg scsi_mod efivarfs
          Supported: No, Unreleased kernel
          CPU: 3 PID: 408 Comm: systemd-hwdb Not tainted 5.3.18-8-default #1 SLE15-SP2 (unreleased)
          Hardware name: raspberrypi rpi/rpi, BIOS 2020.01 02/21/2020
          pstate: 40000085 (nZcv daIf -PAN -UAO)
          pc : __dump_page+0x268/0x368
          lr : __dump_page+0xc4/0x368
          sp : ffff000012563860
          x29: ffff000012563860 x28: ffff80003ddc4300
          x27: 0000000000000010 x26: 000000000000003f
          x25: ffff7e0000d58040 x24: 000000000000000f
          x23: efd8fe765bc80080 x22: 0000000000020095
          x21: efd8fe765bc80080 x20: ffff000010ede8b0
          x19: ffff7e0000d58040 x18: ffffffffffffffff
          x17: 0000000000000001 x16: 0000000000000007
          x15: ffff000011689708 x14: 3030386362353637
          x13: 6566386466653a67 x12: 6e697070616d2031
          x11: 32323133313a746e x10: 756f6370616d2035
          x9 : ffff00001168a840 x8 : ffff00001077a670
          x7 : 000000000000013d x6 : ffff0000118a43b5
          x5 : 0000000000000001 x4 : ffff80003dd9e2c8
          x3 : ffff80003dd9e2c8 x2 : 911c8d7c2f483500
          x1 : dead000000000100 x0 : efd8fe765bc80080
          Call trace:
           __dump_page+0x268/0x368
           bad_page+0xd4/0x168
           check_new_page_bad+0x80/0xb8
           rmqueue_bulk.constprop.26+0x4d8/0x788
           get_page_from_freelist+0x4d4/0x1228
           __alloc_pages_nodemask+0x134/0xe48
           alloc_pages_vma+0x198/0x1c0
           do_anonymous_page+0x1a4/0x4d8
           __handle_mm_fault+0x4e8/0x560
           handle_mm_fault+0x104/0x1e0
           do_page_fault+0x1e8/0x4c0
           do_translation_fault+0xb0/0xc0
           do_mem_abort+0x50/0xb0
           el0_da+0x24/0x28
          Code: f9401025 8b8018a0 9a851005 17ffffca (f94002a0)
      
      Besides the underlying issue with page->mapping containing a bogus value
      for some reason, we can see that __dump_page() crashed by trying to read
      the pointer at mapping->host, turning a recoverable warning into full
      Oops.
      
      It can be expected that when page is reported as bad state for some
      reason, the pointers there should not be trusted blindly.
      
      So this patch treats all data in __dump_page() that depends on
      page->mapping as lava, using probe_kernel_read_strict().  Ideally this
      would include the dentry->d_parent recursively, but that would mean
      changing printk handler for %pd.  Chances of reaching the dentry
      printing part with an initially bogus mapping pointer should be rather
      low, though.
      
      Also prefix printing mapping->a_ops with a description of what is being
      printed.  In case the value is bogus, %ps will print raw value instead
      of the symbol name and then it's not obvious at all that it's printing
      a_ops.
      Reported-by: NPetr Tesarik <ptesarik@suse.cz>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Link: http://lkml.kernel.org/r/20200331165454.12263-1-vbabka@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      002ae705
  3. 03 4月, 2020 2 次提交
  4. 01 2月, 2020 2 次提交
    • Q
      mm/hotplug: silence a lockdep splat with printk() · 4a55c047
      Qian Cai 提交于
      It is not that hard to trigger lockdep splats by calling printk from
      under zone->lock.  Most of them are false positives caused by lock
      chains introduced early in the boot process and they do not cause any
      real problems (although most of the early boot lock dependencies could
      happen after boot as well).  There are some console drivers which do
      allocate from the printk context as well and those should be fixed.  In
      any case, false positives are not that trivial to workaround and it is
      far from optimal to lose lockdep functionality for something that is a
      non-issue.
      
      So change has_unmovable_pages() so that it no longer calls dump_page()
      itself - instead it returns a "struct page *" of the unmovable page back
      to the caller so that in the case of a has_unmovable_pages() failure,
      the caller can call dump_page() after releasing zone->lock.  Also, make
      dump_page() is able to report a CMA page as well, so the reason string
      from has_unmovable_pages() can be removed.
      
      Even though has_unmovable_pages doesn't hold any reference to the
      returned page this should be reasonably safe for the purpose of
      reporting the page (dump_page) because it cannot be hotremoved in the
      context of memory unplug.  The state of the page might change but that
      is the case even with the existing code as zone->lock only plays role
      for free pages.
      
      While at it, remove a similar but unnecessary debug-only printk() as
      well.  A sample of one of those lockdep splats is,
      
        WARNING: possible circular locking dependency detected
        ------------------------------------------------------
        test.sh/8653 is trying to acquire lock:
        ffffffff865a4460 (console_owner){-.-.}, at:
        console_unlock+0x207/0x750
      
        but task is already holding lock:
        ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
        __offline_isolated_pages+0x179/0x3e0
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (&(&zone->lock)->rlock){-.-.}:
               __lock_acquire+0x5b3/0xb40
               lock_acquire+0x126/0x280
               _raw_spin_lock+0x2f/0x40
               rmqueue_bulk.constprop.21+0xb6/0x1160
               get_page_from_freelist+0x898/0x22c0
               __alloc_pages_nodemask+0x2f3/0x1cd0
               alloc_pages_current+0x9c/0x110
               allocate_slab+0x4c6/0x19c0
               new_slab+0x46/0x70
               ___slab_alloc+0x58b/0x960
               __slab_alloc+0x43/0x70
               __kmalloc+0x3ad/0x4b0
               __tty_buffer_request_room+0x100/0x250
               tty_insert_flip_string_fixed_flag+0x67/0x110
               pty_write+0xa2/0xf0
               n_tty_write+0x36b/0x7b0
               tty_write+0x284/0x4c0
               __vfs_write+0x50/0xa0
               vfs_write+0x105/0x290
               redirected_tty_write+0x6a/0xc0
               do_iter_write+0x248/0x2a0
               vfs_writev+0x106/0x1e0
               do_writev+0xd4/0x180
               __x64_sys_writev+0x45/0x50
               do_syscall_64+0xcc/0x76c
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #2 (&(&port->lock)->rlock){-.-.}:
               __lock_acquire+0x5b3/0xb40
               lock_acquire+0x126/0x280
               _raw_spin_lock_irqsave+0x3a/0x50
               tty_port_tty_get+0x20/0x60
               tty_port_default_wakeup+0xf/0x30
               tty_port_tty_wakeup+0x39/0x40
               uart_write_wakeup+0x2a/0x40
               serial8250_tx_chars+0x22e/0x440
               serial8250_handle_irq.part.8+0x14a/0x170
               serial8250_default_handle_irq+0x5c/0x90
               serial8250_interrupt+0xa6/0x130
               __handle_irq_event_percpu+0x78/0x4f0
               handle_irq_event_percpu+0x70/0x100
               handle_irq_event+0x5a/0x8b
               handle_edge_irq+0x117/0x370
               do_IRQ+0x9e/0x1e0
               ret_from_intr+0x0/0x2a
               cpuidle_enter_state+0x156/0x8e0
               cpuidle_enter+0x41/0x70
               call_cpuidle+0x5e/0x90
               do_idle+0x333/0x370
               cpu_startup_entry+0x1d/0x1f
               start_secondary+0x290/0x330
               secondary_startup_64+0xb6/0xc0
      
        -> #1 (&port_lock_key){-.-.}:
               __lock_acquire+0x5b3/0xb40
               lock_acquire+0x126/0x280
               _raw_spin_lock_irqsave+0x3a/0x50
               serial8250_console_write+0x3e4/0x450
               univ8250_console_write+0x4b/0x60
               console_unlock+0x501/0x750
               vprintk_emit+0x10d/0x340
               vprintk_default+0x1f/0x30
               vprintk_func+0x44/0xd4
               printk+0x9f/0xc5
      
        -> #0 (console_owner){-.-.}:
               check_prev_add+0x107/0xea0
               validate_chain+0x8fc/0x1200
               __lock_acquire+0x5b3/0xb40
               lock_acquire+0x126/0x280
               console_unlock+0x269/0x750
               vprintk_emit+0x10d/0x340
               vprintk_default+0x1f/0x30
               vprintk_func+0x44/0xd4
               printk+0x9f/0xc5
               __offline_isolated_pages.cold.52+0x2f/0x30a
               offline_isolated_pages_cb+0x17/0x30
               walk_system_ram_range+0xda/0x160
               __offline_pages+0x79c/0xa10
               offline_pages+0x11/0x20
               memory_subsys_offline+0x7e/0xc0
               device_offline+0xd5/0x110
               state_store+0xc6/0xe0
               dev_attr_store+0x3f/0x60
               sysfs_kf_write+0x89/0xb0
               kernfs_fop_write+0x188/0x240
               __vfs_write+0x50/0xa0
               vfs_write+0x105/0x290
               ksys_write+0xc6/0x160
               __x64_sys_write+0x43/0x50
               do_syscall_64+0xcc/0x76c
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        other info that might help us debug this:
      
        Chain exists of:
          console_owner --> &(&port->lock)->rlock --> &(&zone->lock)->rlock
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(&(&zone->lock)->rlock);
                                       lock(&(&port->lock)->rlock);
                                       lock(&(&zone->lock)->rlock);
          lock(console_owner);
      
         *** DEADLOCK ***
      
        9 locks held by test.sh/8653:
         #0: ffff88839ba7d408 (sb_writers#4){.+.+}, at:
        vfs_write+0x25f/0x290
         #1: ffff888277618880 (&of->mutex){+.+.}, at:
        kernfs_fop_write+0x128/0x240
         #2: ffff8898131fc218 (kn->count#115){.+.+}, at:
        kernfs_fop_write+0x138/0x240
         #3: ffffffff86962a80 (device_hotplug_lock){+.+.}, at:
        lock_device_hotplug_sysfs+0x16/0x50
         #4: ffff8884374f4990 (&dev->mutex){....}, at:
        device_offline+0x70/0x110
         #5: ffffffff86515250 (cpu_hotplug_lock.rw_sem){++++}, at:
        __offline_pages+0xbf/0xa10
         #6: ffffffff867405f0 (mem_hotplug_lock.rw_sem){++++}, at:
        percpu_down_write+0x87/0x2f0
         #7: ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
        __offline_isolated_pages+0x179/0x3e0
         #8: ffffffff865a4920 (console_lock){+.+.}, at:
        vprintk_emit+0x100/0x340
      
        stack backtrace:
        Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10,
        BIOS U34 05/21/2019
        Call Trace:
         dump_stack+0x86/0xca
         print_circular_bug.cold.31+0x243/0x26e
         check_noncircular+0x29e/0x2e0
         check_prev_add+0x107/0xea0
         validate_chain+0x8fc/0x1200
         __lock_acquire+0x5b3/0xb40
         lock_acquire+0x126/0x280
         console_unlock+0x269/0x750
         vprintk_emit+0x10d/0x340
         vprintk_default+0x1f/0x30
         vprintk_func+0x44/0xd4
         printk+0x9f/0xc5
         __offline_isolated_pages.cold.52+0x2f/0x30a
         offline_isolated_pages_cb+0x17/0x30
         walk_system_ram_range+0xda/0x160
         __offline_pages+0x79c/0xa10
         offline_pages+0x11/0x20
         memory_subsys_offline+0x7e/0xc0
         device_offline+0xd5/0x110
         state_store+0xc6/0xe0
         dev_attr_store+0x3f/0x60
         sysfs_kf_write+0x89/0xb0
         kernfs_fop_write+0x188/0x240
         __vfs_write+0x50/0xa0
         vfs_write+0x105/0x290
         ksys_write+0xc6/0x160
         __x64_sys_write+0x43/0x50
         do_syscall_64+0xcc/0x76c
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Link: http://lkml.kernel.org/r/20200117181200.20299-1-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a55c047
    • V
      mm/debug.c: always print flags in dump_page() · 5b57b8f2
      Vlastimil Babka 提交于
      Commit 76a1850e ("mm/debug.c: __dump_page() prints an extra line")
      inadvertently removed printing of page flags for pages that are neither
      anon nor ksm nor have a mapping.  Fix that.
      
      Using pr_cont() again would be a solution, but the commit explicitly
      removed its use.  Avoiding the danger of mixing up split lines from
      multiple CPUs might be beneficial for near-panic dumps like this, so fix
      this without reintroducing pr_cont().
      
      Link: http://lkml.kernel.org/r/9f884d5c-ca60-dc7b-219c-c081c755fab6@suse.cz
      Fixes: 76a1850e ("mm/debug.c: __dump_page() prints an extra line")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reported-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b57b8f2
  5. 14 1月, 2020 1 次提交
  6. 16 11月, 2019 2 次提交
  7. 15 5月, 2019 1 次提交
  8. 30 3月, 2019 2 次提交
  9. 22 2月, 2019 1 次提交
    • R
      mm/debug.c: fix __dump_page() for poisoned pages · 311ade0e
      Robin Murphy 提交于
      Evaluating page_mapping() on a poisoned page ends up dereferencing junk
      and making PF_POISONED_CHECK() considerably crashier than intended:
      
          Unable to handle kernel NULL pointer dereference at virtual address 0000000000000006
          Mem abort info:
            ESR = 0x96000005
            Exception class = DABT (current EL), IL = 32 bits
            SET = 0, FnV = 0
            EA = 0, S1PTW = 0
          Data abort info:
            ISV = 0, ISS = 0x00000005
            CM = 0, WnR = 0
          user pgtable: 4k pages, 39-bit VAs, pgdp = 00000000c2f6ac38
          [0000000000000006] pgd=0000000000000000, pud=0000000000000000
          Internal error: Oops: 96000005 [#1] PREEMPT SMP
          Modules linked in:
          CPU: 2 PID: 491 Comm: bash Not tainted 5.0.0-rc1+ #1
          Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Dec 17 2018
          pstate: 00000005 (nzcv daif -PAN -UAO)
          pc : page_mapping+0x18/0x118
          lr : __dump_page+0x1c/0x398
          Process bash (pid: 491, stack limit = 0x000000004ebd4ecd)
          Call trace:
           page_mapping+0x18/0x118
           __dump_page+0x1c/0x398
           dump_page+0xc/0x18
           remove_store+0xbc/0x120
           dev_attr_store+0x18/0x28
           sysfs_kf_write+0x40/0x50
           kernfs_fop_write+0x130/0x1d8
           __vfs_write+0x30/0x180
           vfs_write+0xb4/0x1a0
           ksys_write+0x60/0xd0
           __arm64_sys_write+0x18/0x20
           el0_svc_common+0x94/0xf8
           el0_svc_handler+0x68/0x70
           el0_svc+0x8/0xc
          Code: f9400401 d1000422 f240003f 9a801040 (f9400402)
          ---[ end trace cdb5eb5bf435cecb ]---
      
      Fix that by not inspecting the mapping until we've determined that it's
      likely to be valid.  Now the above condition still ends up stopping the
      kernel, but in the correct manner:
      
          page:ffffffbf20000000 is uninitialized and poisoned
          raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
          raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
          page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
          ------------[ cut here ]------------
          kernel BUG at ./include/linux/mm.h:1006!
          Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
          Modules linked in:
          CPU: 1 PID: 483 Comm: bash Not tainted 5.0.0-rc1+ #3
          Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Dec 17 2018
          pstate: 40000005 (nZcv daif -PAN -UAO)
          pc : remove_store+0xbc/0x120
          lr : remove_store+0xbc/0x120
          ...
      
      Link: http://lkml.kernel.org/r/03b53ee9d7e76cda4b9b5e1e31eea080db033396.1550071778.git.robin.murphy@arm.com
      Fixes: 1c6fb1d8 ("mm: print more information about mapping in __dump_page")
      Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      311ade0e
  10. 08 2月, 2019 1 次提交
  11. 29 12月, 2018 3 次提交
  12. 27 10月, 2018 1 次提交
    • A
      mm: provide kernel parameter to allow disabling page init poisoning · f682a97a
      Alexander Duyck 提交于
      Patch series "Address issues slowing persistent memory initialization", v5.
      
      The main thing this patch set achieves is that it allows us to initialize
      each node worth of persistent memory independently.  As a result we reduce
      page init time by about 2 minutes because instead of taking 30 to 40
      seconds per node and going through each node one at a time, we process all
      4 nodes in parallel in the case of a 12TB persistent memory setup spread
      evenly over 4 nodes.
      
      This patch (of 3):
      
      On systems with a large amount of memory it can take a significant amount
      of time to initialize all of the page structs with the PAGE_POISON_PATTERN
      value.  I have seen it take over 2 minutes to initialize a system with
      over 12TB of RAM.
      
      In order to work around the issue I had to disable CONFIG_DEBUG_VM and
      then the boot time returned to something much more reasonable as the
      arch_add_memory call completed in milliseconds versus seconds.  However in
      doing that I had to disable all of the other VM debugging on the system.
      
      In order to work around a kernel that might have CONFIG_DEBUG_VM enabled
      on a system that has a large amount of memory I have added a new kernel
      parameter named "vm_debug" that can be set to "-" in order to disable it.
      
      Link: http://lkml.kernel.org/r/20180925201921.3576.84239.stgit@localhost.localdomainReviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f682a97a
  13. 14 9月, 2018 1 次提交
    • L
      mm: get rid of vmacache_flush_all() entirely · 7a9cdebd
      Linus Torvalds 提交于
      Jann Horn points out that the vmacache_flush_all() function is not only
      potentially expensive, it's buggy too.  It also happens to be entirely
      unnecessary, because the sequence number overflow case can be avoided by
      simply making the sequence number be 64-bit.  That doesn't even grow the
      data structures in question, because the other adjacent fields are
      already 64-bit.
      
      So simplify the whole thing by just making the sequence number overflow
      case go away entirely, which gets rid of all the complications and makes
      the code faster too.  Win-win.
      
      [ Oleg Nesterov points out that the VMACACHE_FULL_FLUSHES statistics
        also just goes away entirely with this ]
      Reported-by: NJann Horn <jannh@google.com>
      Suggested-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a9cdebd
  14. 04 7月, 2018 1 次提交
  15. 05 1月, 2018 1 次提交
  16. 16 11月, 2017 3 次提交
  17. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  18. 11 8月, 2017 2 次提交
    • M
      mm: make tlb_flush_pending global · 0a2dd266
      Minchan Kim 提交于
      Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING|
      COMPACTION] but upcoming patches to solve subtle TLB flush batching
      problem will use it regardless of compaction/NUMA so this patch doesn't
      remove the dependency.
      
      [akpm@linux-foundation.org: remove more ifdefs from world's ugliest printk statement]
      Link: http://lkml.kernel.org/r/20170802000818.4760-6-namit@vmware.comSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a2dd266
    • N
      mm: migrate: prevent racy access to tlb_flush_pending · 16af97dc
      Nadav Amit 提交于
      Patch series "fixes of TLB batching races", v6.
      
      It turns out that Linux TLB batching mechanism suffers from various
      races.  Races that are caused due to batching during reclamation were
      recently handled by Mel and this patch-set deals with others.  The more
      fundamental issue is that concurrent updates of the page-tables allow
      for TLB flushes to be batched on one core, while another core changes
      the page-tables.  This other core may assume a PTE change does not
      require a flush based on the updated PTE value, while it is unaware that
      TLB flushes are still pending.
      
      This behavior affects KSM (which may result in memory corruption) and
      MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior).  A
      proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
      Memory corruption in KSM is harder to produce in practice, but was
      observed by hacking the kernel and adding a delay before flushing and
      replacing the KSM page.
      
      Finally, there is also one memory barrier missing, which may affect
      architectures with weak memory model.
      
      This patch (of 7):
      
      Setting and clearing mm->tlb_flush_pending can be performed by multiple
      threads, since mmap_sem may only be acquired for read in
      task_numa_work().  If this happens, tlb_flush_pending might be cleared
      while one of the threads still changes PTEs and batches TLB flushes.
      
      This can lead to the same race between migration and
      change_protection_range() that led to the introduction of
      tlb_flush_pending.  The result of this race was data corruption, which
      means that this patch also addresses a theoretically possible data
      corruption.
      
      An actual data corruption was not observed, yet the race was was
      confirmed by adding assertion to check tlb_flush_pending is not set by
      two threads, adding artificial latency in change_protection_range() and
      using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
      Fixes: 20841405 ("mm: fix TLB flush race between migration, and
      change_protection_range")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16af97dc
  19. 13 12月, 2016 1 次提交
  20. 08 10月, 2016 1 次提交
  21. 20 9月, 2016 1 次提交
  22. 18 3月, 2016 1 次提交
    • J
      mm: introduce page reference manipulation functions · fe896d18
      Joonsoo Kim 提交于
      The success of CMA allocation largely depends on the success of
      migration and key factor of it is page reference count.  Until now, page
      reference is manipulated by direct calling atomic functions so we cannot
      follow up who and where manipulate it.  Then, it is hard to find actual
      reason of CMA allocation failure.  CMA allocation should be guaranteed
      to succeed so finding offending place is really important.
      
      In this patch, call sites where page reference is manipulated are
      converted to introduced wrapper function.  This is preparation step to
      add tracepoint to each page reference manipulation function.  With this
      facility, we can easily find reason of CMA allocation failure.  There is
      no functional change in this patch.
      
      In addition, this patch also converts reference read sites.  It will
      help a second step that renames page._count to something else and
      prevents later attempt to direct access to it (Suggested by Andrew).
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe896d18
  23. 16 3月, 2016 6 次提交
    • V
      mm, debug: move bad flags printing to bad_page() · ff8e8116
      Vlastimil Babka 提交于
      Since bad_page() is the only user of the badflags parameter of
      dump_page_badflags(), we can move the code to bad_page() and simplify a
      bit.
      
      The dump_page_badflags() function is renamed to __dump_page() and can
      still be called separately from dump_page() for temporary debug prints
      where page_owner info is not desired.
      
      The only user-visible change is that page->mem_cgroup is printed before
      the bad flags.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff8e8116
    • V
      mm, page_owner: dump page owner info from dump_page() · 4e462112
      Vlastimil Babka 提交于
      The page_owner mechanism is useful for dealing with memory leaks.  By
      reading /sys/kernel/debug/page_owner one can determine the stack traces
      leading to allocations of all pages, and find e.g.  a buggy driver.
      
      This information might be also potentially useful for debugging, such as
      the VM_BUG_ON_PAGE() calls to dump_page().  So let's print the stored
      info from dump_page().
      
      Example output:
      
        page:ffffea000292f1c0 count:1 mapcount:0 mapping:ffff8800b2f6cc18 index:0x91d
        flags: 0x1fffff8001002c(referenced|uptodate|lru|mappedtodisk)
        page dumped because: VM_BUG_ON_PAGE(1)
        page->mem_cgroup:ffff8801392c5000
        page allocated via order 0, migratetype Movable, gfp_mask 0x24213ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY)
         [<ffffffff811682c4>] __alloc_pages_nodemask+0x134/0x230
         [<ffffffff811b40c8>] alloc_pages_current+0x88/0x120
         [<ffffffff8115e386>] __page_cache_alloc+0xe6/0x120
         [<ffffffff8116ba6c>] __do_page_cache_readahead+0xdc/0x240
         [<ffffffff8116bd05>] ondemand_readahead+0x135/0x260
         [<ffffffff8116be9c>] page_cache_async_readahead+0x6c/0x70
         [<ffffffff811604c2>] generic_file_read_iter+0x3f2/0x760
         [<ffffffff811e0dc7>] __vfs_read+0xa7/0xd0
        page has been migrated, last migrate reason: compaction
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e462112
    • V
      mm, page_owner: track and print last migrate reason · 7cd12b4a
      Vlastimil Babka 提交于
      During migration, page_owner info is now copied with the rest of the
      page, so the stacktrace leading to free page allocation during migration
      is overwritten.  For debugging purposes, it might be however useful to
      know that the page has been migrated since its initial allocation.  This
      might happen many times during the lifetime for different reasons and
      fully tracking this, especially with stacktraces would incur extra
      memory costs.  As a compromise, store and print the migrate_reason of
      the last migration that occurred to the page.  This is enough to
      distinguish compaction, numa balancing etc.
      
      Example page_owner entry after the patch:
      
        Page allocated via order 0, mask 0x24200ca(GFP_HIGHUSER_MOVABLE)
        PFN 628753 type Movable Block 1228 type Movable Flags 0x1fffff80040030(dirty|lru|swapbacked)
         [<ffffffff811682c4>] __alloc_pages_nodemask+0x134/0x230
         [<ffffffff811b6325>] alloc_pages_vma+0xb5/0x250
         [<ffffffff81177491>] shmem_alloc_page+0x61/0x90
         [<ffffffff8117a438>] shmem_getpage_gfp+0x678/0x960
         [<ffffffff8117c2b9>] shmem_fallocate+0x329/0x440
         [<ffffffff811de600>] vfs_fallocate+0x140/0x230
         [<ffffffff811df434>] SyS_fallocate+0x44/0x70
         [<ffffffff8158cc2e>] entry_SYSCALL_64_fastpath+0x12/0x71
        Page has been migrated, last migrate reason: compaction
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cd12b4a
    • V
      mm, debug: replace dump_flags() with the new printk formats · b8eceeb9
      Vlastimil Babka 提交于
      With the new printk format strings for flags, we can get rid of
      dump_flags() in mm/debug.c.
      
      This also fixes dump_vma() which used dump_flags() for printing vma
      flags.  However dump_flags() did a page-flags specific filtering of bits
      higher than NR_PAGEFLAGS in order to remove the zone id part.  For
      dump_vma() this resulted in removing several VM_* flags from the
      symbolic translation.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8eceeb9
    • V
      mm, printk: introduce new format string for flags · edf14cdb
      Vlastimil Babka 提交于
      In mm we use several kinds of flags bitfields that are sometimes printed
      for debugging purposes, or exported to userspace via sysfs.  To make
      them easier to interpret independently on kernel version and config, we
      want to dump also the symbolic flag names.  So far this has been done
      with repeated calls to pr_cont(), which is unreliable on SMP, and not
      usable for e.g.  sysfs export.
      
      To get a more reliable and universal solution, this patch extends
      printk() format string for pointers to handle the page flags (%pGp),
      gfp_flags (%pGg) and vma flags (%pGv).  Existing users of
      dump_flag_names() are converted and simplified.
      
      It would be possible to pass flags by value instead of pointer, but the
      %p format string for pointers already has extensions for various kernel
      structures, so it's a good fit, and the extra indirection in a
      non-critical path is negligible.
      
      [linux@rasmusvillemoes.dk: lots of good implementation suggestions]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edf14cdb
    • V
      mm, tracing: unify mm flags handling in tracepoints and printk · 420adbe9
      Vlastimil Babka 提交于
      In tracepoints, it's possible to print gfp flags in a human-friendly
      format through a macro show_gfp_flags(), which defines a translation
      array and passes is to __print_flags().  Since the following patch will
      introduce support for gfp flags printing in printk(), it would be nice
      to reuse the array.  This is not straightforward, since __print_flags()
      can't simply reference an array defined in a .c file such as mm/debug.c
      - it has to be a macro to allow the macro magic to communicate the
      format to userspace tools such as trace-cmd.
      
      The solution is to create a macro __def_gfpflag_names which is used both
      in show_gfp_flags(), and to define the gfpflag_names[] array in
      mm/debug.c.
      
      On the other hand, mm/debug.c also defines translation tables for page
      flags and vma flags, and desire was expressed (but not implemented in
      this series) to use these also from tracepoints.  Thus, this patch also
      renames the events/gfpflags.h file to events/mmflags.h and moves the
      table definitions there, using the same macro approach as for gfpflags.
      This allows translating all three kinds of mm-specific flags both in
      tracepoints and printk.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NMichal Hocko <mhocko@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      420adbe9
  24. 16 1月, 2016 2 次提交
    • K
      mm: rework mapcount accounting to enable 4k mapping of THPs · 53f9263b
      Kirill A. Shutemov 提交于
      We're going to allow mapping of individual 4k pages of THP compound.  It
      means we need to track mapcount on per small page basis.
      
      Straight-forward approach is to use ->_mapcount in all subpages to track
      how many time this subpage is mapped with PMDs or PTEs combined.  But
      this is rather expensive: mapping or unmapping of a THP page with PMD
      would require HPAGE_PMD_NR atomic operations instead of single we have
      now.
      
      The idea is to store separately how many times the page was mapped as
      whole -- compound_mapcount.  This frees up ->_mapcount in subpages to
      track PTE mapcount.
      
      We use the same approach as with compound page destructor and compound
      order to store compound_mapcount: use space in first tail page,
      ->mapping this time.
      
      Any time we map/unmap whole compound page (THP or hugetlb) -- we
      increment/decrement compound_mapcount.  When we map part of compound
      page with PTE we operate on ->_mapcount of the subpage.
      
      page_mapcount() counts both: PTE and PMD mappings of the page.
      
      Basically, we have mapcount for a subpage spread over two counters.  It
      makes tricky to detect when last mapcount for a page goes away.
      
      We introduced PageDoubleMap() for this.  When we split THP PMD for the
      first time and there's other PMD mapping left we offset up ->_mapcount
      in all subpages by one and set PG_double_map on the compound page.
      These additional references go away with last compound_mapcount.
      
      This approach provides a way to detect when last mapcount goes away on
      per small page basis without introducing new overhead for most common
      cases.
      
      [akpm@linux-foundation.org: fix typo in comment]
      [mhocko@suse.com: ignore partial THP when moving task]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53f9263b
    • K
      mm, thp: remove compound_lock() · 3ac808fd
      Kirill A. Shutemov 提交于
      We are going to use migration entries to stabilize page counts.  It
      means we don't need compound_lock() for that.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ac808fd
  25. 15 1月, 2016 1 次提交
    • K
      mm: rework virtual memory accounting · 84638335
      Konstantin Khlebnikov 提交于
      When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
      testing the RLIMIT_DATA value to figure out if we're allowed to assign
      new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
      commited that RLIMIT_DATA in a form it's implemented now doesn't do
      anything useful because most of user-space libraries use mmap() syscall
      for dynamic memory allocations.
      
      Linus suggested to convert RLIMIT_DATA rlimit into something suitable
      for anonymous memory accounting.  But in this patch we go further, and
      the changes are bundled together as:
      
       * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
       * replace mm->shared_vm with better defined mm->data_vm
       * account anonymous executable areas as executable
       * account file-backed growsdown/up areas as stack
       * drop struct file* argument from vm_stat_account
       * enforce RLIMIT_DATA for size of data areas
      
      This way code looks cleaner: now code/stack/data classification depends
      only on vm_flags state:
      
       VM_EXEC & ~VM_WRITE            -> code  (VmExe + VmLib in proc)
       VM_GROWSUP | VM_GROWSDOWN      -> stack (VmStk)
       VM_WRITE & ~VM_SHARED & !stack -> data  (VmData)
      
      The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
      "shared", but that might be strange beast like readonly-private or VM_IO
      area.
      
       - RLIMIT_AS            limits whole address space "VmSize"
       - RLIMIT_STACK         limits stack "VmStk" (but each vma individually)
       - RLIMIT_DATA          now limits "VmData"
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Kees Cook <keescook@google.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84638335