1. 08 1月, 2022 1 次提交
  2. 30 12月, 2021 1 次提交
  3. 13 10月, 2021 1 次提交
  4. 31 8月, 2021 1 次提交
    • J
      mm: add pin memory method for checkpoint add restore · 7dc4c73d
      Jingxian He 提交于
      hulk inclusion
      category: feature
      bugzilla: 48159
      CVE: N/A
      
      ------------------------------
      
      We can use the checkpoint and restore in userspace(criu) method to dump
      and restore tasks when updating the kernel.
      Currently, criu needs dump all memory data of tasks to files.
      When the memory size is very large(larger than 1G),
      the cost time of the dumping data will be very long(more than 1 min).
      
      By pin the memory data of tasks and collect the corresponding physical pages
      mapping info in checkpoint process, we can remap the physical pages to
      restore tasks after upgrading the kernel. This pin memory method can
      restore the task data within one second.
      
      The pin memory area info is saved in the reserved memblock,
      which can keep usable in the kernel update process.
      
      The pin memory driver provides the following ioctl command for criu:
      1) SET_PIN_MEM_AREA:
      Set pin memory area, which can be remap to the restore task.
      2) CLEAR_PIN_MEM_AREA:
      Clear the pin memory area info,
      which enable user reset the pin data.
      3) REMAP_PIN_MEM_AREA:
      Remap the pages of the pin memory to the restore task.
      Signed-off-by: NJingxian He <hejingxian@huawei.com>
      Reviewed-by: NChen Wandun <chenwandun@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      7dc4c73d
  5. 19 7月, 2021 1 次提交
  6. 14 4月, 2021 2 次提交
  7. 09 4月, 2021 1 次提交
  8. 28 1月, 2021 2 次提交
  9. 12 12月, 2020 1 次提交
    • M
      proc: use untagged_addr() for pagemap_read addresses · 40d6366e
      Miles Chen 提交于
      When we try to visit the pagemap of a tagged userspace pointer, we find
      that the start_vaddr is not correct because of the tag.
      To fix it, we should untag the userspace pointers in pagemap_read().
      
      I tested with 5.10-rc4 and the issue remains.
      
      Explanation from Catalin in [1]:
      
       "Arguably, that's a user-space bug since tagged file offsets were never
        supported. In this case it's not even a tag at bit 56 as per the arm64
        tagged address ABI but rather down to bit 47. You could say that the
        problem is caused by the C library (malloc()) or whoever created the
        tagged vaddr and passed it to this function. It's not a kernel
        regression as we've never supported it.
      
        Now, pagemap is a special case where the offset is usually not
        generated as a classic file offset but rather derived by shifting a
        user virtual address. I guess we can make a concession for pagemap
        (only) and allow such offset with the tag at bit (56 - PAGE_SHIFT + 3)"
      
      My test code is based on [2]:
      
      A userspace pointer which has been tagged by 0xb4: 0xb400007662f541c8
      
      userspace program:
      
        uint64 OsLayer::VirtualToPhysical(void *vaddr) {
      	uint64 frame, paddr, pfnmask, pagemask;
      	int pagesize = sysconf(_SC_PAGESIZE);
      	off64_t off = ((uintptr_t)vaddr) / pagesize * 8; // off = 0xb400007662f541c8 / pagesize * 8 = 0x5a00003b317aa0
      	int fd = open(kPagemapPath, O_RDONLY);
      	...
      
      	if (lseek64(fd, off, SEEK_SET) != off || read(fd, &frame, 8) != 8) {
      		int err = errno;
      		string errtxt = ErrorString(err);
      		if (fd >= 0)
      			close(fd);
      		return 0;
      	}
        ...
        }
      
      kernel fs/proc/task_mmu.c:
      
        static ssize_t pagemap_read(struct file *file, char __user *buf,
      		size_t count, loff_t *ppos)
        {
      	...
      	src = *ppos;
      	svpfn = src / PM_ENTRY_BYTES; // svpfn == 0xb400007662f54
      	start_vaddr = svpfn << PAGE_SHIFT; // start_vaddr == 0xb400007662f54000
      	end_vaddr = mm->task_size;
      
      	/* watch out for wraparound */
      	// svpfn == 0xb400007662f54
      	// (mm->task_size >> PAGE) == 0x8000000
      	if (svpfn > mm->task_size >> PAGE_SHIFT) // the condition is true because of the tag 0xb4
      		start_vaddr = end_vaddr;
      
      	ret = 0;
      	while (count && (start_vaddr < end_vaddr)) { // we cannot visit correct entry because start_vaddr is set to end_vaddr
      		int len;
      		unsigned long end;
      		...
      	}
      	...
        }
      
      [1] https://lore.kernel.org/patchwork/patch/1343258/
      [2] https://github.com/stressapptest/stressapptest/blob/master/src/os.cc#L158
      
      Link: https://lkml.kernel.org/r/20201204024347.8295-1-miles.chen@mediatek.comSigned-off-by: NMiles Chen <miles.chen@mediatek.com>
      Reviewed-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
      Cc: <stable@vger.kernel.org>	[5.4-]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40d6366e
  10. 17 10月, 2020 1 次提交
  11. 14 10月, 2020 3 次提交
  12. 04 9月, 2020 1 次提交
    • C
      arm64: mte: Add PROT_MTE support to mmap() and mprotect() · 9f341931
      Catalin Marinas 提交于
      To enable tagging on a memory range, the user must explicitly opt in via
      a new PROT_MTE flag passed to mmap() or mprotect(). Since this is a new
      memory type in the AttrIndx field of a pte, simplify the or'ing of these
      bits over the protection_map[] attributes by making MT_NORMAL index 0.
      
      There are two conditions for arch_vm_get_page_prot() to return the
      MT_NORMAL_TAGGED memory type: (1) the user requested it via PROT_MTE,
      registered as VM_MTE in the vm_flags, and (2) the vma supports MTE,
      decided during the mmap() call (only) and registered as VM_MTE_ALLOWED.
      
      arch_calc_vm_prot_bits() is responsible for registering the user request
      as VM_MTE. The newly introduced arch_calc_vm_flag_bits() sets
      VM_MTE_ALLOWED if the mapping is MAP_ANONYMOUS. An MTE-capable
      filesystem (RAM-based) may be able to set VM_MTE_ALLOWED during its
      mmap() file ops call.
      
      In addition, update VM_DATA_DEFAULT_FLAGS to allow mprotect(PROT_MTE) on
      stack or brk area.
      
      The Linux mmap() syscall currently ignores unknown PROT_* flags. In the
      presence of MTE, an mmap(PROT_MTE) on a file which does not support MTE
      will not report an error and the memory will not be mapped as Normal
      Tagged. For consistency, mprotect(PROT_MTE) will not report an error
      either if the memory range does not support MTE. Two subsequent patches
      in the series will propose tightening of this behaviour.
      Co-developed-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
      Signed-off-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      9f341931
  13. 13 8月, 2020 1 次提交
  14. 10 6月, 2020 2 次提交
  15. 03 6月, 2020 1 次提交
    • H
      /proc/PID/smaps: Add PMD migration entry parsing · c94b6923
      Huang Ying 提交于
      Now, when reading /proc/PID/smaps, the PMD migration entry in page table
      is simply ignored.  To improve the accuracy of /proc/PID/smaps, its
      parsing and processing is added.
      
      To test the patch, we run pmbench to eat 400 MB memory in background,
      then run /usr/bin/migratepages and `cat /proc/PID/smaps` every second.
      The issue as follows can be reproduced within 60 seconds.
      
      Before the patch, for the fully populated 400 MB anonymous VMA, some THP
      pages under migration may be lost as below.
      
        7f3f6a7e5000-7f3f837e5000 rw-p 00000000 00:00 0
        Size:             409600 kB
        KernelPageSize:        4 kB
        MMUPageSize:           4 kB
        Rss:              407552 kB
        Pss:              407552 kB
        Shared_Clean:          0 kB
        Shared_Dirty:          0 kB
        Private_Clean:         0 kB
        Private_Dirty:    407552 kB
        Referenced:       301056 kB
        Anonymous:        407552 kB
        LazyFree:              0 kB
        AnonHugePages:    405504 kB
        ShmemPmdMapped:        0 kB
        FilePmdMapped:        0 kB
        Shared_Hugetlb:        0 kB
        Private_Hugetlb:       0 kB
        Swap:                  0 kB
        SwapPss:               0 kB
        Locked:                0 kB
        THPeligible:		1
        VmFlags: rd wr mr mw me ac
      
      After the patch, it will be always,
      
        7f3f6a7e5000-7f3f837e5000 rw-p 00000000 00:00 0
        Size:             409600 kB
        KernelPageSize:        4 kB
        MMUPageSize:           4 kB
        Rss:              409600 kB
        Pss:              409600 kB
        Shared_Clean:          0 kB
        Shared_Dirty:          0 kB
        Private_Clean:         0 kB
        Private_Dirty:    409600 kB
        Referenced:       294912 kB
        Anonymous:        409600 kB
        LazyFree:              0 kB
        AnonHugePages:    407552 kB
        ShmemPmdMapped:        0 kB
        FilePmdMapped:        0 kB
        Shared_Hugetlb:        0 kB
        Private_Hugetlb:       0 kB
        Swap:                  0 kB
        SwapPss:               0 kB
        Locked:                0 kB
        THPeligible:		1
        VmFlags: rd wr mr mw me ac
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Link: http://lkml.kernel.org/r/20200403123059.1846960-1-ying.huang@intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c94b6923
  16. 23 4月, 2020 1 次提交
  17. 08 4月, 2020 4 次提交
  18. 17 3月, 2020 1 次提交
  19. 04 2月, 2020 1 次提交
    • S
      mm: pagewalk: add 'depth' parameter to pte_hole · b7a16c7a
      Steven Price 提交于
      The pte_hole() callback is called at multiple levels of the page tables.
      Code dumping the kernel page tables needs to know what at what depth the
      missing entry is.  Add this is an extra parameter to pte_hole().  When the
      depth isn't know (e.g.  processing a vma) then -1 is passed.
      
      The depth that is reported is the actual level where the entry is missing
      (ignoring any folding that is in place), i.e.  any levels where
      PTRS_PER_P?D is set to 1 are ignored.
      
      Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
      natural numbers as levels 2/3/4.
      
      Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Tested-by: NZong Li <zong.li@sifive.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7a16c7a
  20. 25 9月, 2019 2 次提交
  21. 07 9月, 2019 2 次提交
  22. 19 7月, 2019 1 次提交
    • Y
      mm: thp: fix false negative of shmem vma's THP eligibility · c0630669
      Yang Shi 提交于
      Commit 7635d9cb ("mm, thp, proc: report THP eligibility for each
      vma") introduced THPeligible bit for processes' smaps.  But, when
      checking the eligibility for shmem vma, __transparent_hugepage_enabled()
      is called to override the result from shmem_huge_enabled().  It may
      result in the anonymous vma's THP flag override shmem's.  For example,
      running a simple test which create THP for shmem, but with anonymous THP
      disabled, when reading the process's smaps, it may show:
      
        7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
        Size:               4096 kB
        ...
        [snip]
        ...
        ShmemPmdMapped:     4096 kB
        ...
        [snip]
        ...
        THPeligible:    0
      
      And, /proc/meminfo does show THP allocated and PMD mapped too:
      
        ShmemHugePages:     4096 kB
        ShmemPmdMapped:     4096 kB
      
      This doesn't make too much sense.  The shmem objects should be treated
      separately from anonymous THP.  Calling shmem_huge_enabled() with
      checking MMF_DISABLE_THP sounds good enough.  And, we could skip stack
      and dax vma check since we already checked if the vma is shmem already.
      
      Also check if vma is suitable for THP by calling
      transhuge_vma_suitable().
      
      And minor fix to smaps output format and documentation.
      
      Link: http://lkml.kernel.org/r/1560401041-32207-3-git-send-email-yang.shi@linux.alibaba.com
      Fixes: 7635d9cb ("mm, thp, proc: report THP eligibility for each vma")
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0630669
  23. 13 7月, 2019 5 次提交
  24. 03 7月, 2019 1 次提交
  25. 15 5月, 2019 2 次提交
    • J
      mm/mmu_notifier: use correct mmu_notifier events for each invalidation · 7269f999
      Jérôme Glisse 提交于
      This updates each existing invalidation to use the correct mmu notifier
      event that represent what is happening to the CPU page table.  See the
      patch which introduced the events to see the rational behind this.
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7269f999
    • J
      mm/mmu_notifier: contextual information for event triggering invalidation · 6f4f13e8
      Jérôme Glisse 提交于
      CPU page table update can happens for many reasons, not only as a result
      of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
      a result of kernel activities (memory compression, reclaim, migration,
      ...).
      
      Users of mmu notifier API track changes to the CPU page table and take
      specific action for them.  While current API only provide range of virtual
      address affected by the change, not why the changes is happening.
      
      This patchset do the initial mechanical convertion of all the places that
      calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
      event as well as the vma if it is know (most invalidation happens against
      a given vma).  Passing down the vma allows the users of mmu notifier to
      inspect the new vma page protection.
      
      The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
      should assume that every for the range is going away when that event
      happens.  A latter patch do convert mm call path to use a more appropriate
      events for each call.
      
      This is done as 2 patches so that no call site is forgotten especialy
      as it uses this following coccinelle patch:
      
      %<----------------------------------------------------------------------
      @@
      identifier I1, I2, I3, I4;
      @@
      static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
      +enum mmu_notifier_event event,
      +unsigned flags,
      +struct vm_area_struct *vma,
      struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }
      
      @@
      @@
      -#define mmu_notifier_range_init(range, mm, start, end)
      +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)
      
      @@
      expression E1, E3, E4;
      identifier I1;
      @@
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, I1,
      I1->vm_mm, E3, E4)
      ...>
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(..., struct vm_area_struct *VMA, ...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(...) {
      struct vm_area_struct *VMA;
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN;
      @@
      FN(...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, NULL,
      E2, E3, E4)
      ...> }
      ---------------------------------------------------------------------->%
      
      Applied with:
      spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
      spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
      spatch --sp-file mmu-notifier.spatch --dir mm --in-place
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f4f13e8