1. 04 6月, 2020 6 次提交
    • L
      mm/hugetlb: avoid unnecessary check on pud and pmd entry in huge_pte_offset · 8ac0b81a
      Li Xinhai 提交于
      When huge_pte_offset() is called, the parameter sz can only be PUD_SIZE or
      PMD_SIZE.  If sz is PUD_SIZE and code can reach pud, then *pud must be
      none, or normal hugetlb entry, or non-present (migration or hwpoisoned)
      hugetlb entry, and we can directly return pud.  When sz is PMD_SIZE, pud
      must be none or present, and if code can reach pmd, we can directly return
      pmd.
      
      So after this patch the code is simplified by first check on the parameter
      sz, and avoid unnecessary checks in current code.  Same semantics of
      existing code is maintained.
      
      More details about relevant commits:
      commit 9b19df29 ("mm/hugetlb.c: make huge_pte_offset() consistent
      and document behaviour") changed the code path for pud and pmd handling,
      see comments about why this patch intends to change it.
      ...
      	pud = pud_offset(p4d, addr);
      	if (sz != PUD_SIZE && pud_none(*pud)) // [1]
      		return NULL;
      	/* hugepage or swap? */
      	if (pud_huge(*pud) || !pud_present(*pud)) // [2]
      		return (pte_t *)pud;
      
      	pmd = pmd_offset(pud, addr);
      	if (sz != PMD_SIZE && pmd_none(*pmd)) // [3]
      		return NULL;
      	/* hugepage or swap? */
      	if (pmd_huge(*pmd) || !pmd_present(*pmd)) // [4]
      		return (pte_t *)pmd;
      
      	return NULL; // [5]
      ...
      [1]: this is necessary, return NULL for sz == PMD_SIZE;
      [2]: if sz == PUD_SIZE, all valid values of pud entry will cause return;
      [3]: dead code, sz != PMD_SIZE never true;
      [4]: all valid values of pmd entry will cause return;
      [5]: dead code, because of check in [4].
      
      Now, this patch combines [1] and [2] for pud, and combines [3], [4] and
      [5] for pmd, so avoid unnecessary checks.
      
      I don't try to catch any invalid values in page table entry, as that will
      be checked by caller and avoid extra branch in this function.  Also no
      assert on sz must equal PUD_SIZE or PMD_SIZE, since this function only
      call for hugetlb mapping.
      
      For commit 3c1d7e6c ("mm/hugetlb: fix a addressing exception caused by
      huge_pte_offset"), since we don't read the entry more than once now,
      variable pud_entry and pmd_entry are not needed.
      Signed-off-by: NLi Xinhai <lixinhai.lxh@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Punit Agrawal <punit.agrawal@arm.com>
      Cc: Longpeng <longpeng2@huawei.com>
      Link: http://lkml.kernel.org/r/1587794313-16849-1-git-send-email-lixinhai.lxh@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ac0b81a
    • M
      hugetlbfs: fix changes to command line processing · c2833a5b
      Mike Kravetz 提交于
      Previously, a check for hugepages_supported was added before processing
      hugetlb command line parameters.  On some architectures such as powerpc,
      hugepages_supported() is not set to true until after command line
      processing.  Therefore, no hugetlb command line parameters would be
      accepted.
      
      Remove the additional checks for hugepages_supported.  In hugetlb_init,
      print a warning if !hugepages_supported and command line parameters were
      specified.
      Reported-by: NSandipan Das <sandipan.osd@gmail.com>
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/b1f04f9f-fa46-c2a0-7693-4a0679d2a1ee@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2833a5b
    • M
      hugetlbfs: clean up command line processing · 282f4214
      Mike Kravetz 提交于
      With all hugetlb page processing done in a single file clean up code.
      
      - Make code match desired semantics
        - Update documentation with semantics
      - Make all warnings and errors messages start with 'HugeTLB:'.
      - Consistently name command line parsing routines.
      - Warn if !hugepages_supported() and command line parameters have
        been specified.
      - Add comments to code
        - Describe some of the subtle interactions
        - Describe semantics of command line arguments
      
      This patch also fixes issues with implicitly setting the number of
      gigantic huge pages to preallocate.  Previously on X86 command line,
      
              hugepages=2 default_hugepagesz=1G
      
      would result in zero 1G pages being preallocated and,
      
              # grep HugePages_Total /proc/meminfo
              HugePages_Total:       0
              # sysctl -a | grep nr_hugepages
              vm.nr_hugepages = 2
              vm.nr_hugepages_mempolicy = 2
              # cat /proc/sys/vm/nr_hugepages
              2
      
      After this patch 2 gigantic pages will be preallocated and all the proc,
      sysfs, sysctl and meminfo files will accurately reflect this.
      
      To address the issue with gigantic pages, a small change in behavior was
      made to command line processing.  Previously the command line,
      
              hugepages=128 default_hugepagesz=2M hugepagesz=2M hugepages=256
      
      would result in the allocation of 256 2M huge pages.  The value 128 would
      be ignored without any warning.  After this patch, 128 2M pages will be
      allocated and a warning message will be displayed indicating the value of
      256 is ignored.  This change in behavior is required because allocation of
      implicitly specified gigantic pages must be done when the
      default_hugepagesz= is encountered for gigantic pages.  Previously the
      code waited until later in the boot process (hugetlb_init), to allocate
      pages of default size.  However the bootmem allocator required for
      gigantic allocations is not available at this time.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NSandipan Das <sandipan@linux.ibm.com>
      Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	[s390]
      Acked-by: NWill Deacon <will@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Longpeng <longpeng2@huawei.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Anders Roxell <anders.roxell@linaro.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200417185049.275845-5-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      282f4214
    • M
      hugetlbfs: remove hugetlb_add_hstate() warning for existing hstate · 38237830
      Mike Kravetz 提交于
      hugetlb_add_hstate() prints a warning if the hstate already exists.  This
      was originally done as part of kernel command line parsing.  If
      'hugepagesz=' was specified more than once, the warning
      
      	pr_warn("hugepagesz= specified twice, ignoring\n");
      
      would be printed.
      
      Some architectures want to enable all huge page sizes.  They would call
      hugetlb_add_hstate for all supported sizes.  However, this was done after
      command line processing and as a result hstates could have already been
      created for some sizes.  To make sure no warning were printed, there would
      often be code like:
      
      	if (!size_to_hstate(size)
      		hugetlb_add_hstate(ilog2(size) - PAGE_SHIFT)
      
      The only time we want to print the warning is as the result of command
      line processing.  So, remove the warning from hugetlb_add_hstate and add
      it to the single arch independent routine processing "hugepagesz=".  After
      this, calls to size_to_hstate() in arch specific code can be removed and
      hugetlb_add_hstate can be called without worrying about warning messages.
      
      [mike.kravetz@oracle.com: fix hugetlb initialization]
        Link: http://lkml.kernel.org/r/4c36c6ce-3774-78fa-abc4-b7346bf24348@oracle.com
        Link: http://lkml.kernel.org/r/20200428205614.246260-5-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NAnders Roxell <anders.roxell@linaro.org>
      Acked-by: NMina Almasry <almasrymina@google.com>
      Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	[s390]
      Acked-by: NWill Deacon <will@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Longpeng <longpeng2@huawei.com>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200417185049.275845-4-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20200428205614.246260-4-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38237830
    • M
      hugetlbfs: move hugepagesz= parsing to arch independent code · 359f2544
      Mike Kravetz 提交于
      Now that architectures provide arch_hugetlb_valid_size(), parsing of
      "hugepagesz=" can be done in architecture independent code.  Create a
      single routine to handle hugepagesz= parsing and remove all arch specific
      routines.  We can also remove the interface hugetlb_bad_size() as this is
      no longer used outside arch independent code.
      
      This also provides consistent behavior of hugetlbfs command line options.
      The hugepagesz= option should only be specified once for a specific size,
      but some architectures allow multiple instances.  This appears to be more
      of an oversight when code was added by some architectures to set up ALL
      huge pages sizes.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NSandipan Das <sandipan@linux.ibm.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Acked-by: NMina Almasry <almasrymina@google.com>
      Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	[s390]
      Acked-by: NWill Deacon <will@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Longpeng <longpeng2@huawei.com>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Anders Roxell <anders.roxell@linaro.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200417185049.275845-3-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20200428205614.246260-3-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      359f2544
    • M
      hugetlbfs: add arch_hugetlb_valid_size · ae94da89
      Mike Kravetz 提交于
      Patch series "Clean up hugetlb boot command line processing", v4.
      
      Longpeng(Mike) reported a weird message from hugetlb command line
      processing and proposed a solution [1].  While the proposed patch does
      address the specific issue, there are other related issues in command line
      processing.  As hugetlbfs evolved, updates to command line processing have
      been made to meet immediate needs and not necessarily in a coordinated
      manner.  The result is that some processing is done in arch specific code,
      some is done in arch independent code and coordination is problematic.
      Semantics can vary between architectures.
      
      The patch series does the following:
      - Define arch specific arch_hugetlb_valid_size routine used to validate
        passed huge page sizes.
      - Move hugepagesz= command line parsing out of arch specific code and into
        an arch independent routine.
      - Clean up command line processing to follow desired semantics and
        document those semantics.
      
      [1] https://lore.kernel.org/linux-mm/20200305033014.1152-1-longpeng2@huawei.com
      
      This patch (of 3):
      
      The architecture independent routine hugetlb_default_setup sets up the
      default huge pages size.  It has no way to verify if the passed value is
      valid, so it accepts it and attempts to validate at a later time.  This
      requires undocumented cooperation between the arch specific and arch
      independent code.
      
      For architectures that support more than one huge page size, provide a
      routine arch_hugetlb_valid_size to validate a huge page size.
      hugetlb_default_setup can use this to validate passed values.
      
      arch_hugetlb_valid_size will also be used in a subsequent patch to move
      processing of the "hugepagesz=" in arch specific code to a common routine
      in arch independent code.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	[s390]
      Acked-by: NWill Deacon <will@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Longpeng <longpeng2@huawei.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Anders Roxell <anders.roxell@linaro.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200428205614.246260-1-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20200428205614.246260-2-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20200417185049.275845-1-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20200417185049.275845-2-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae94da89
  2. 27 4月, 2020 1 次提交
  3. 22 4月, 2020 1 次提交
    • L
      mm/hugetlb: fix a addressing exception caused by huge_pte_offset · 3c1d7e6c
      Longpeng 提交于
      Our machine encountered a panic(addressing exception) after run for a
      long time and the calltrace is:
      
          RIP: hugetlb_fault+0x307/0xbe0
          RSP: 0018:ffff9567fc27f808  EFLAGS: 00010286
          RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48
          RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48
          RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080
          R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8
          R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074
          FS:  00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
            follow_hugetlb_page+0x175/0x540
            __get_user_pages+0x2a0/0x7e0
            __get_user_pages_unlocked+0x15d/0x210
            __gfn_to_pfn_memslot+0x3c5/0x460 [kvm]
            try_async_pf+0x6e/0x2a0 [kvm]
            tdp_page_fault+0x151/0x2d0 [kvm]
           ...
            kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm]
            kvm_vcpu_ioctl+0x309/0x6d0 [kvm]
            do_vfs_ioctl+0x3f0/0x540
            SyS_ioctl+0xa1/0xc0
            system_call_fastpath+0x22/0x27
      
      For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it
      may return a wrong 'pmdp' if there is a race.  Please look at the
      following code snippet:
      
          ...
          pud = pud_offset(p4d, addr);
          if (sz != PUD_SIZE && pud_none(*pud))
              return NULL;
          /* hugepage or swap? */
          if (pud_huge(*pud) || !pud_present(*pud))
              return (pte_t *)pud;
      
          pmd = pmd_offset(pud, addr);
          if (sz != PMD_SIZE && pmd_none(*pmd))
              return NULL;
          /* hugepage or swap? */
          if (pmd_huge(*pmd) || !pmd_present(*pmd))
              return (pte_t *)pmd;
          ...
      
      The following sequence would trigger this bug:
      
       - CPU0: sz = PUD_SIZE and *pud = 0 , continue
       - CPU0: "pud_huge(*pud)" is false
       - CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT)
       - CPU0: "!pud_present(*pud)" is false, continue
       - CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp
      
      However, we want CPU0 to return NULL or pudp in this case.
      
      We must make sure there is exactly one dereference of pud and pmd.
      Signed-off-by: NLongpeng <longpeng2@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200413010342.771-1-longpeng2@huawei.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c1d7e6c
  4. 11 4月, 2020 1 次提交
    • R
      mm: hugetlb: optionally allocate gigantic hugepages using cma · cf11e85f
      Roman Gushchin 提交于
      Commit 944d9fec ("hugetlb: add support for gigantic page allocation
      at runtime") has added the run-time allocation of gigantic pages.
      
      However it actually works only at early stages of the system loading,
      when the majority of memory is free.  After some time the memory gets
      fragmented by non-movable pages, so the chances to find a contiguous 1GB
      block are getting close to zero.  Even dropping caches manually doesn't
      help a lot.
      
      At large scale rebooting servers in order to allocate gigantic hugepages
      is quite expensive and complex.  At the same time keeping some constant
      percentage of memory in reserved hugepages even if the workload isn't
      using it is a big waste: not all workloads can benefit from using 1 GB
      pages.
      
      The following solution can solve the problem:
      1) On boot time a dedicated cma area* is reserved. The size is passed
         as a kernel argument.
      2) Run-time allocations of gigantic hugepages are performed using the
         cma allocator and the dedicated cma area
      
      In this case gigantic hugepages can be allocated successfully with a
      high probability, however the memory isn't completely wasted if nobody
      is using 1GB hugepages: it can be used for pagecache, anon memory, THPs,
      etc.
      
      * On a multi-node machine a per-node cma area is allocated on each node.
        Following gigantic hugetlb allocation are using the first available
        numa node if the mask isn't specified by a user.
      
      Usage:
      1) configure the kernel to allocate a cma area for hugetlb allocations:
         pass hugetlb_cma=10G as a kernel argument
      
      2) allocate hugetlb pages as usual, e.g.
         echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
      
      If the option isn't enabled or the allocation of the cma area failed,
      the current behavior of the system is preserved.
      
      x86 and arm-64 are covered by this patch, other architectures can be
      trivially added later.
      
      The patch contains clean-ups and fixes proposed and implemented by Aslan
      Bakirov and Randy Dunlap.  It also contains ideas and suggestions
      proposed by Rik van Riel, Michal Hocko and Mike Kravetz.  Thanks!
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NAndreas Schaufler <andreas.schaufler@gmx.de>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Cc: Aslan Bakirov <aslan@fb.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Link: http://lkml.kernel.org/r/20200407163840.92263-3-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf11e85f
  5. 08 4月, 2020 1 次提交
  6. 03 4月, 2020 15 次提交
    • V
      mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge() · d4af73e3
      Vlastimil Babka 提交于
      Commit f1e61557 ("mm: pack compound_dtor and compound_order into one
      word in struct page") changed compound_dtor from a pointer to an array
      index in order to pack it.  To check if page has the hugeltbfs
      compound_dtor, we can just compare the index directly without fetching the
      function pointer.  Said commit did that with PageHuge() and we can do the
      same with PageHeadHuge() to make the code a bit smaller and faster.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Neha Agarwal <nehaagarwal@google.com>
      Link: http://lkml.kernel.org/r/20200311172440.6988-1-vbabka@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4af73e3
    • M
      mm/hugetlb.c: clean code by removing unnecessary initialization · 353b2de4
      Mateusz Nosek 提交于
      Previously variable 'check_addr' was initialized, but was not read later
      before reassigning.  So the initialization can be removed.
      Signed-off-by: NMateusz Nosek <mateusznosek0@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Link: http://lkml.kernel.org/r/20200303212354.25226-1-mateusznosek0@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      353b2de4
    • M
      hugetlb: support file_region coalescing again · a9b3f867
      Mina Almasry 提交于
      An earlier patch in this series disabled file_region coalescing in order
      to hang the hugetlb_cgroup uncharge info on the file_region entries.
      
      This patch re-adds support for coalescing of file_region entries.
      Essentially everytime we add an entry, we call a recursive function that
      tries to coalesce the added region with the regions next to it.  The worst
      case call depth for this function is 3: one to coalesce with the region
      next to it, one to coalesce to the region prev, and one to reach the base
      case.
      
      This is an important performance optimization as private mappings add
      their entries page by page, and we could incur big performance costs for
      large mappings with lots of file_region entries in their resv_map.
      
      [almasrymina@google.com: fix CONFIG_CGROUP_HUGETLB ifdefs]
        Link: http://lkml.kernel.org/r/20200214204544.231482-1-almasrymina@google.com
      [almasrymina@google.com: remove check_coalesce_bug debug code]
        Link: http://lkml.kernel.org/r/20200219233610.13808-1-almasrymina@google.comSigned-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-7-almasrymina@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9b3f867
    • M
      hugetlb_cgroup: support noreserve mappings · 08cf9faf
      Mina Almasry 提交于
      Support MAP_NORESERVE accounting as part of the new counter.
      
      For each hugepage allocation, at allocation time we check if there is a
      reservation for this allocation or not.  If there is a reservation for
      this allocation, then this allocation was charged at reservation time, and
      we don't re-account it.  If there is no reserevation for this allocation,
      we charge the appropriate hugetlb_cgroup.
      
      The hugetlb_cgroup to uncharge for this allocation is stored in
      page[3].private.  We use new APIs added in an earlier patch to set this
      pointer.
      Signed-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-6-almasrymina@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08cf9faf
    • M
      hugetlb_cgroup: add accounting for shared mappings · 075a61d0
      Mina Almasry 提交于
      For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
      in the resv_map entries, in file_region->reservation_counter.
      
      After a call to region_chg, we charge the approprate hugetlb_cgroup, and
      if successful, we pass on the hugetlb_cgroup info to a follow up
      region_add call.  When a file_region entry is added to the resv_map via
      region_add, we put the pointer to that cgroup in
      file_region->reservation_counter.  If charging doesn't succeed, we report
      the error to the caller, so that the kernel fails the reservation.
      
      On region_del, which is when the hugetlb memory is unreserved, we also
      uncharge the file_region->reservation_counter.
      
      [akpm@linux-foundation.org: forward declare struct file_region]
      Signed-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-5-almasrymina@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      075a61d0
    • M
      hugetlb: disable region_add file_region coalescing · 0db9d74e
      Mina Almasry 提交于
      A follow up patch in this series adds hugetlb cgroup uncharge info the
      file_region entries in resv->regions.  The cgroup uncharge info may differ
      for different regions, so they can no longer be coalesced at region_add
      time.  So, disable region coalescing in region_add in this patch.
      
      Behavior change:
      
      Say a resv_map exists like this [0->1], [2->3], and [5->6].
      
      Then a region_chg/add call comes in region_chg/add(f=0, t=5).
      
      Old code would generate resv->regions: [0->5], [5->6].
      New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
      [5->6].
      
      Special care needs to be taken to handle the resv->adds_in_progress
      variable correctly.  In the past, only 1 region would be added for every
      region_chg and region_add call.  But now, each call may add multiple
      regions, so we can no longer increment adds_in_progress by 1 in
      region_chg, or decrement adds_in_progress by 1 after region_add or
      region_abort.  Instead, region_chg calls add_reservation_in_range() to
      count the number of regions needed and allocates those, and that info is
      passed to region_add and region_abort to decrement adds_in_progress
      correctly.
      
      We've also modified the assumption that region_add after region_chg never
      fails.  region_chg now pre-allocates at least 1 region for region_add.  If
      region_add needs more regions than region_chg has allocated for it, then
      it may fail.
      
      [almasrymina@google.com: fix file_region entry allocations]
        Link: http://lkml.kernel.org/r/20200219012736.20363-1-almasrymina@google.comSigned-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Link: http://lkml.kernel.org/r/20200211213128.73302-4-almasrymina@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0db9d74e
    • M
      hugetlb_cgroup: add reservation accounting for private mappings · e9fe92ae
      Mina Almasry 提交于
      Normally the pointer to the cgroup to uncharge hangs off the struct page,
      and gets queried when it's time to free the page.  With hugetlb_cgroup
      reservations, this is not possible.  Because it's possible for a page to
      be reserved by one task and actually faulted in by another task.
      
      The best place to put the hugetlb_cgroup pointer to uncharge for
      reservations is in the resv_map.  But, because the resv_map has different
      semantics for private and shared mappings, the code patch to
      charge/uncharge shared and private mappings is different.  This patch
      implements charging and uncharging for private mappings.
      
      For private mappings, the counter to uncharge is in
      resv_map->reservation_counter.  On initializing the resv_map this is set
      to NULL.  On reservation of a region in private mapping, the tasks
      hugetlb_cgroup is charged and the hugetlb_cgroup is placed is
      resv_map->reservation_counter.
      
      On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.
      
      [akpm@linux-foundation.org: forward declare struct resv_map]
      Signed-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-3-almasrymina@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9fe92ae
    • M
      hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations · 1adc4d41
      Mina Almasry 提交于
      Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb usage
      or hugetlb reservation counter.
      
      Adds a new interface to uncharge a hugetlb_cgroup counter via
      hugetlb_cgroup_uncharge_counter.
      
      Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
      hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.
      Signed-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-2-almasrymina@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1adc4d41
    • M
      hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race · 87bf91d3
      Mike Kravetz 提交于
      hugetlbfs page faults can race with truncate and hole punch operations.
      Current code in the page fault path attempts to handle this by 'backing
      out' operations if we encounter the race.  One obvious omission in the
      current code is removing a page newly added to the page cache.  This is
      pretty straight forward to address, but there is a more subtle and
      difficult issue of backing out hugetlb reservations.  To handle this
      correctly, the 'reservation state' before page allocation needs to be
      noted so that it can be properly backed out.  There are four distinct
      possibilities for reservation state: shared/reserved, shared/no-resv,
      private/reserved and private/no-resv.  Backing out a reservation may
      require memory allocation which could fail so that needs to be taken
      into account as well.
      
      Instead of writing the required complicated code for this rare
      occurrence, just eliminate the race.  i_mmap_rwsem is now held in read
      mode for the duration of page fault processing.  Hold i_mmap_rwsem in
      write mode when modifying i_size.  In this way, truncation can not
      proceed when page faults are being processed.  In addition, i_size
      will not change during fault processing so a single check can be made
      to ensure faults are not beyond (proposed) end of file.  Faults can
      still race with hole punch, but that race is handled by existing code
      and the use of hugetlb_fault_mutex.
      
      With this modification, checks for races with truncation in the page
      fault path can be simplified and removed.  remove_inode_hugepages no
      longer needs to take hugetlb_fault_mutex in the case of truncation.
      Comments are expanded to explain reasoning behind locking.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Link: http://lkml.kernel.org/r/20200316205756.146666-3-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87bf91d3
    • M
      hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization · c0d0381a
      Mike Kravetz 提交于
      Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.
      
      While discussing the issue with huge_pte_offset [1], I remembered that
      there were more outstanding hugetlb races.  These issues are:
      
      1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
         invalid via a call to huge_pmd_unshare by another thread.
      2) hugetlbfs page faults can race with truncation causing invalid global
         reserve counts and state.
      
      A previous attempt was made to use i_mmap_rwsem in this manner as
      described at [2].  However, those patches were reverted starting with [3]
      due to locking issues.
      
      To effectively use i_mmap_rwsem to address the above issues it needs to be
      held (in read mode) during page fault processing.  However, during fault
      processing we need to lock the page we will be adding.  Lock ordering
      requires we take page lock before i_mmap_rwsem.  Waiting until after
      taking the page lock is too late in the fault process for the
      synchronization we want to do.
      
      To address this lock ordering issue, the following patches change the lock
      ordering for hugetlb pages.  This is not too invasive as hugetlbfs
      processing is done separate from core mm in many places.  However, I don't
      really like this idea.  Much ugliness is contained in the new routine
      hugetlb_page_mapping_lock_write() of patch 1.
      
      The only other way I can think of to address these issues is by catching
      all the races.  After catching a race, cleanup, backout, retry ...  etc,
      as needed.  This can get really ugly, especially for huge page
      reservations.  At one time, I started writing some of the reservation
      backout code for page faults and it got so ugly and complicated I went
      down the path of adding synchronization to avoid the races.  Any other
      suggestions would be welcome.
      
      [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
      [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
      [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
      [4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
      [5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/
      
      This patch (of 2):
      
      While looking at BUGs associated with invalid huge page map counts, it was
      discovered and observed that a huge pte pointer could become 'invalid' and
      point to another task's page table.  Consider the following:
      
      A task takes a page fault on a shared hugetlbfs file and calls
      huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
      shared pmd.
      
      Now, another task truncates the hugetlbfs file.  As part of truncation, it
      unmaps everyone who has the file mapped.  If the range being truncated is
      covered by a shared pmd, huge_pmd_unshare will be called.  For all but the
      last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
      to the pmd.  If the task in the middle of the page fault is not the last
      user, the ptep returned by huge_pte_alloc now points to another task's
      page table or worse.  This leads to bad things such as incorrect page
      map/reference counts or invalid memory references.
      
      To fix, expand the use of i_mmap_rwsem as follows:
      - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
        huge_pmd_share is only called via huge_pte_alloc, so callers of
        huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
        of huge_pte_alloc continue to hold the semaphore until finished with
        the ptep.
      - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
      
      One problem with this scheme is that it requires taking i_mmap_rwsem
      before taking the page lock during page faults.  This is not the order
      specified in the rest of mm code.  Handling of hugetlbfs pages is mostly
      isolated today.  Therefore, we use this alternative locking order for
      PageHuge() pages.
      
               mapping->i_mmap_rwsem
                 hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
                   page->flags PG_locked (lock_page)
      
      To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
      introduced to write lock the i_mmap_rwsem associated with a page.
      
      In most cases it is easy to get address_space via vma->vm_file->f_mapping.
      However, in the case of migration or memory errors for anon pages we do
      not have an associated vma.  A new routine _get_hugetlb_page_mapping()
      will use anon_vma to get address_space in these cases.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0d0381a
    • P
      mm/gup: allow to react to fatal signals · 71335f37
      Peter Xu 提交于
      The existing gup code does not react to the fatal signals in many code
      paths.  For example, in one retry path of gup we're still using
      down_read() rather than down_read_killable().  Also, when doing page
      faults we don't pass in FAULT_FLAG_KILLABLE as well, which means that
      within the faulting process we'll wait in non-killable way as well.  These
      were spotted by Linus during the code review of some other patches.
      
      Let's allow the gup code to react to fatal signals to improve the
      responsiveness of threads when during gup and being killed.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160256.9887-1-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71335f37
    • P
      mm/gup: allow VM_FAULT_RETRY for multiple times · 4426e945
      Peter Xu 提交于
      This is the gup counterpart of the change that allows the VM_FAULT_RETRY
      to happen for more than once.  One thing to mention is that we must check
      the fatal signal here before retry because the GUP can be interrupted by
      that, otherwise we can loop forever.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220195357.16371-1-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4426e945
    • P
      mm/gup: rename "nonblocking" to "locked" where proper · 4f6da934
      Peter Xu 提交于
      Patch series "mm: Page fault enhancements", v6.
      
      This series contains cleanups and enhancements to current page fault
      logic.  The whole idea comes from the discussion between Andrea and Linus
      on the bug reported by syzbot here:
      
        https://lkml.org/lkml/2017/11/2/833
      
      Basically it does two things:
      
        (a) Allows the page fault logic to be more interactive on not only
            SIGKILL, but also the rest of userspace signals, and,
      
        (b) Allows the page fault retry (VM_FAULT_RETRY) to happen for more
            than once.
      
      For (a): with the changes we should be able to react faster when page
      faults are working in parallel with userspace signals like SIGSTOP and
      SIGCONT (and more), and with that we can remove the buggy part in
      userfaultfd and benefit the whole page fault mechanism on faster signal
      processing to reach the userspace.
      
      For (b), we should be able to allow the page fault handler to loop for
      even more than twice.  Some context: for now since we have
      FAULT_FLAG_ALLOW_RETRY we can allow to retry the page fault once with the
      same interrupt context, however never more than twice.  This can be not
      only a potential cleanup to remove this assumption since AFAIU the code
      itself doesn't really have this twice-only limitation (though that should
      be a protective approach in the past), at the same time it'll greatly
      simplify future works like userfaultfd write-protect where it's possible
      to retry for more than twice (please have a look at [1] below for a
      possible user that might require the page fault to be handled for a third
      time; if we can remove the retry limitation we can simply drop that patch
      and those complexity).
      
      This patch (of 16):
      
      There's plenty of places around __get_user_pages() that has a parameter
      "nonblocking" which does not really mean that "it won't block" (because it
      can really block) but instead it shows whether the mmap_sem is released by
      up_read() during the page fault handling mostly when VM_FAULT_RETRY is
      returned.
      
      We have the correct naming in e.g.  get_user_pages_locked() or
      get_user_pages_remote() as "locked", however there're still many places
      that are using the "nonblocking" as name.
      
      Renaming the places to "locked" where proper to better suite the
      functionality of the variable.  While at it, fixing up some of the
      comments accordingly.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220155353.8676-2-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f6da934
    • J
      mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages · 47e29d32
      John Hubbard 提交于
      For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
      scheme tends to overflow too easily, each tail page increments the head
      page->_refcount by GUP_PIN_COUNTING_BIAS (1024).  That limits the number
      of huge pages that can be pinned.
      
      This patch removes that limitation, by using an exact form of pin counting
      for compound pages of order > 1.  The "order > 1" is required because this
      approach uses the 3rd struct page in the compound page, and order 1
      compound pages only have two pages, so that won't work there.
      
      A new struct page field, hpage_pinned_refcount, has been added, replacing
      a padding field in the union (so no new space is used).
      
      This enhancement also has a useful side effect: huge pages and compound
      pages (of order > 1) do not suffer from the "potential false positives"
      problem that is discussed in the page_dma_pinned() comment block.  That is
      because these compound pages have extra space for tracking things, so they
      get exact pin counts instead of overloading page->_refcount.
      
      Documentation/core-api/pin_user_pages.rst is updated accordingly.
      Suggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47e29d32
    • J
      mm/gup: track FOLL_PIN pages · 3faa52c0
      John Hubbard 提交于
      Add tracking of pages that were pinned via FOLL_PIN.  This tracking is
      implemented via overloading of page->_refcount: pins are added by adding
      GUP_PIN_COUNTING_BIAS (1024) to the refcount.  This provides a fuzzy
      indication of pinning, and it can have false positives (and that's OK).
      Please see the pre-existing Documentation/core-api/pin_user_pages.rst for
      details.
      
      As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
      (typically via pin_user_pages*()) are required to ultimately free such
      pages via unpin_user_page().
      
      Please also note the limitation, discussed in pin_user_pages.rst under the
      "TODO: for 1GB and larger huge pages" section.  (That limitation will be
      removed in a following patch.)
      
      The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
      thought of as "FOLL_GET for DIO and/or RDMA use".
      
      Pages that have been pinned via FOLL_PIN are identifiable via a new
      function call:
      
         bool page_maybe_dma_pinned(struct page *page);
      
      What to do in response to encountering such a page, is left to later
      patchsets. There is discussion about this in [1], [2], [3], and [4].
      
      This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
      
      [1] Some slow progress on get_user_pages() (Apr 2, 2019):
          https://lwn.net/Articles/784574/
      [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
          https://lwn.net/Articles/774411/
      [3] The trouble with get_user_pages() (Apr 30, 2018):
          https://lwn.net/Articles/753027/
      [4] LWN kernel index: get_user_pages():
          https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
      
      [jhubbard@nvidia.com: add kerneldoc]
        Link: http://lkml.kernel.org/r/20200307021157.235726-1-jhubbard@nvidia.com
      [imbrenda@linux.ibm.com: if pin fails, we need to unpin, a simple put_page will not be enough]
        Link: http://lkml.kernel.org/r/20200306132537.783769-2-imbrenda@linux.ibm.com
      [akpm@linux-foundation.org: fix put_compound_head defined but not used]
      Suggested-by: NJan Kara <jack@suse.cz>
      Suggested-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NClaudio Imbrenda <imbrenda@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3faa52c0
  7. 05 1月, 2020 1 次提交
    • W
      mm/hugetlb: defer freeing of huge pages if in non-task context · c77c0a8a
      Waiman Long 提交于
      The following lockdep splat was observed when a certain hugetlbfs test
      was run:
      
        ================================
        WARNING: inconsistent lock state
        4.18.0-159.el8.x86_64+debug #1 Tainted: G        W --------- -  -
        --------------------------------
        inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
        swapper/30/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
        ffffffff9acdc038 (hugetlb_lock){+.?.}, at: free_huge_page+0x36f/0xaa0
        {SOFTIRQ-ON-W} state was registered at:
          lock_acquire+0x14f/0x3b0
          _raw_spin_lock+0x30/0x70
          __nr_hugepages_store_common+0x11b/0xb30
          hugetlb_sysctl_handler_common+0x209/0x2d0
          proc_sys_call_handler+0x37f/0x450
          vfs_write+0x157/0x460
          ksys_write+0xb8/0x170
          do_syscall_64+0xa5/0x4d0
          entry_SYSCALL_64_after_hwframe+0x6a/0xdf
        irq event stamp: 691296
        hardirqs last  enabled at (691296): [<ffffffff99bb034b>] _raw_spin_unlock_irqrestore+0x4b/0x60
        hardirqs last disabled at (691295): [<ffffffff99bb0ad2>] _raw_spin_lock_irqsave+0x22/0x81
        softirqs last  enabled at (691284): [<ffffffff97ff0c63>] irq_enter+0xc3/0xe0
        softirqs last disabled at (691285): [<ffffffff97ff0ebe>] irq_exit+0x23e/0x2b0
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(hugetlb_lock);
          <Interrupt>
            lock(hugetlb_lock);
      
         *** DEADLOCK ***
            :
        Call Trace:
         <IRQ>
         __lock_acquire+0x146b/0x48c0
         lock_acquire+0x14f/0x3b0
         _raw_spin_lock+0x30/0x70
         free_huge_page+0x36f/0xaa0
         bio_check_pages_dirty+0x2fc/0x5c0
         clone_endio+0x17f/0x670 [dm_mod]
         blk_update_request+0x276/0xe50
         scsi_end_request+0x7b/0x6a0
         scsi_io_completion+0x1c6/0x1570
         blk_done_softirq+0x22e/0x350
         __do_softirq+0x23d/0xad8
         irq_exit+0x23e/0x2b0
         do_IRQ+0x11a/0x200
         common_interrupt+0xf/0xf
         </IRQ>
      
      Both the hugetbl_lock and the subpool lock can be acquired in
      free_huge_page().  One way to solve the problem is to make both locks
      irq-safe.  However, Mike Kravetz had learned that the hugetlb_lock is
      held for a linear scan of ALL hugetlb pages during a cgroup reparentling
      operation.  So it is just too long to have irq disabled unless we can
      break hugetbl_lock down into finer-grained locks with shorter lock hold
      times.
      
      Another alternative is to defer the freeing to a workqueue job.  This
      patch implements the deferred freeing by adding a free_hpage_workfn()
      work function to do the actual freeing.  The free_huge_page() call in a
      non-task context saves the page to be freed in the hpage_freelist linked
      list in a lockless manner using the llist APIs.
      
      The generic workqueue is used to process the work, but a dedicated
      workqueue can be used instead if it is desirable to have the huge page
      freed ASAP.
      
      Thanks to Kirill Tkhai <ktkhai@virtuozzo.com> for suggesting the use of
      llist APIs which simplfy the code.
      
      Link: http://lkml.kernel.org/r/20191217170331.30893-1-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavidlohr Bueso <dbueso@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c77c0a8a
  8. 02 12月, 2019 7 次提交
  9. 19 10月, 2019 1 次提交
  10. 25 9月, 2019 1 次提交
  11. 14 8月, 2019 1 次提交
    • M
      hugetlbfs: fix hugetlb page migration/fault race causing SIGBUS · 4643d67e
      Mike Kravetz 提交于
      Li Wang discovered that LTP/move_page12 V2 sometimes triggers SIGBUS in
      the kernel-v5.2.3 testing.  This is caused by a race between hugetlb
      page migration and page fault.
      
      If a hugetlb page can not be allocated to satisfy a page fault, the task
      is sent SIGBUS.  This is normal hugetlbfs behavior.  A hugetlb fault
      mutex exists to prevent two tasks from trying to instantiate the same
      page.  This protects against the situation where there is only one
      hugetlb page, and both tasks would try to allocate.  Without the mutex,
      one would fail and SIGBUS even though the other fault would be
      successful.
      
      There is a similar race between hugetlb page migration and fault.
      Migration code will allocate a page for the target of the migration.  It
      will then unmap the original page from all page tables.  It does this
      unmap by first clearing the pte and then writing a migration entry.  The
      page table lock is held for the duration of this clear and write
      operation.  However, the beginnings of the hugetlb page fault code
      optimistically checks the pte without taking the page table lock.  If
      clear (as it can be during the migration unmap operation), a hugetlb
      page allocation is attempted to satisfy the fault.  Note that the page
      which will eventually satisfy this fault was already allocated by the
      migration code.  However, the allocation within the fault path could
      fail which would result in the task incorrectly being sent SIGBUS.
      
      Ideally, we could take the hugetlb fault mutex in the migration code
      when modifying the page tables.  However, locks must be taken in the
      order of hugetlb fault mutex, page lock, page table lock.  This would
      require significant rework of the migration code.  Instead, the issue is
      addressed in the hugetlb fault code.  After failing to allocate a huge
      page, take the page table lock and check for huge_pte_none before
      returning an error.  This is the same check that must be made further in
      the code even if page allocation is successful.
      
      Link: http://lkml.kernel.org/r/20190808000533.7701-1-mike.kravetz@oracle.com
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NLi Wang <liwang@redhat.com>
      Tested-by: NLi Wang <liwang@redhat.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Cyril Hrubis <chrubis@suse.cz>
      Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4643d67e
  12. 29 6月, 2019 1 次提交
  13. 21 5月, 2019 1 次提交
  14. 15 5月, 2019 2 次提交