1. 21 4月, 2022 1 次提交
  2. 26 1月, 2022 1 次提交
    • A
      mm: mempolicy: fix THP allocations escaping mempolicy restrictions · 32a6b9a8
      Andrey Ryabinin 提交于
      stable inclusion
      from stable-v5.10.89
      commit ee6f34215c5dfa2257298cc362cd79e14af5a25a
      bugzilla: 186140 https://gitee.com/openeuler/kernel/issues/I4S8HA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=ee6f34215c5dfa2257298cc362cd79e14af5a25a
      
      --------------------------------
      
      alloc_pages_vma() may try to allocate THP page on the local NUMA node
      first:
      
      	page = __alloc_pages_node(hpage_node,
      		gfp | __GFP_THISNODE | __GFP_NORETRY, order);
      
      And if the allocation fails it retries allowing remote memory:
      
      	if (!page && (gfp & __GFP_DIRECT_RECLAIM))
          		page = __alloc_pages_node(hpage_node,
      					gfp, order);
      
      However, this retry allocation completely ignores memory policy nodemask
      allowing allocation to escape restrictions.
      
      The first appearance of this bug seems to be the commit ac5b2c18
      ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings").
      
      The bug disappeared later in the commit 89c83fb5 ("mm, thp:
      consolidate THP gfp handling into alloc_hugepage_direct_gfpmask") and
      reappeared again in slightly different form in the commit 76e654cc
      ("mm, page_alloc: allow hugepage fallback to remote nodes when
      madvised")
      
      Fix this by passing correct nodemask to the __alloc_pages() call.
      
      The demonstration/reproducer of the problem:
      
          $ mount -oremount,size=4G,huge=always /dev/shm/
          $ echo always > /sys/kernel/mm/transparent_hugepage/defrag
          $ cat mbind_thp.c
          #include <unistd.h>
          #include <sys/mman.h>
          #include <sys/stat.h>
          #include <fcntl.h>
          #include <assert.h>
          #include <stdlib.h>
          #include <stdio.h>
          #include <numaif.h>
      
          #define SIZE 2ULL << 30
          int main(int argc, char **argv)
          {
              int fd;
              unsigned long long i;
              char *addr;
              pid_t pid;
              char buf[100];
              unsigned long nodemask = 1;
      
              fd = open("/dev/shm/test", O_RDWR|O_CREAT);
              assert(fd > 0);
              assert(ftruncate(fd, SIZE) == 0);
      
              addr = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
                                 MAP_SHARED, fd, 0);
      
              assert(mbind(addr, SIZE, MPOL_BIND, &nodemask, 2, MPOL_MF_STRICT|MPOL_MF_MOVE)==0);
              for (i = 0; i < SIZE; i+=4096) {
                addr[i] = 1;
              }
              pid = getpid();
              snprintf(buf, sizeof(buf), "grep shm /proc/%d/numa_maps", pid);
              system(buf);
              sleep(10000);
      
              return 0;
          }
          $ gcc mbind_thp.c -o mbind_thp -lnuma
          $ numactl -H
          available: 2 nodes (0-1)
          node 0 cpus: 0 2
          node 0 size: 1918 MB
          node 0 free: 1595 MB
          node 1 cpus: 1 3
          node 1 size: 2014 MB
          node 1 free: 1731 MB
          node distances:
          node   0   1
            0:  10  20
            1:  20  10
          $ rm -f /dev/shm/test; taskset -c 0 ./mbind_thp
          7fd970a00000 bind:0 file=/dev/shm/test dirty=524288 active=0 N0=396800 N1=127488 kernelpagesize_kB=4
      
      Link: https://lkml.kernel.org/r/20211208165343.22349-1-arbn@yandex-team.com
      Fixes: ac5b2c18 ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
      Signed-off-by: NAndrey Ryabinin <arbn@yandex-team.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      32a6b9a8
  3. 29 11月, 2021 3 次提交
  4. 14 7月, 2021 6 次提交
  5. 03 11月, 2020 1 次提交
  6. 14 10月, 2020 2 次提交
  7. 15 8月, 2020 1 次提交
  8. 13 8月, 2020 5 次提交
  9. 17 7月, 2020 1 次提交
  10. 10 6月, 2020 3 次提交
  11. 04 6月, 2020 1 次提交
  12. 08 4月, 2020 6 次提交
  13. 03 4月, 2020 4 次提交
  14. 18 2月, 2020 2 次提交
  15. 01 2月, 2020 1 次提交
  16. 14 1月, 2020 1 次提交
    • V
      mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations · cc638f32
      Vlastimil Babka 提交于
      THP page faults now attempt a __GFP_THISNODE allocation first, which
      should only compact existing free memory, followed by another attempt
      that can allocate from any node using reclaim/compaction effort
      specified by global defrag setting and madvise.
      
      This patch makes the following changes to the scheme:
      
       - Before the patch, the first allocation relies on a check for
         pageblock order and __GFP_IO to prevent excessive reclaim. This
         however affects also the second attempt, which is not limited to
         single node.
      
         Instead of that, reuse the existing check for costly order
         __GFP_NORETRY allocations, and make sure the first THP attempt uses
         __GFP_NORETRY. As a side-effect, all costly order __GFP_NORETRY
         allocations will bail out if compaction needs reclaim, while
         previously they only bailed out when compaction was deferred due to
         previous failures.
      
         This should be still acceptable within the __GFP_NORETRY semantics.
      
       - Before the patch, the second allocation attempt (on all nodes) was
         passing __GFP_NORETRY. This is redundant as the check for pageblock
         order (discussed above) was stronger. It's also contrary to
         madvise(MADV_HUGEPAGE) which means some effort to allocate THP is
         requested.
      
         After this patch, the second attempt doesn't pass __GFP_THISNODE nor
         __GFP_NORETRY.
      
      To sum up, THP page faults now try the following attempts:
      
      1. local node only THP allocation with no reclaim, just compaction.
      2. for madvised VMA's or when synchronous compaction is enabled always - THP
         allocation from any node with effort determined by global defrag setting
         and VMA madvise
      3. fallback to base pages on any node
      
      Link: http://lkml.kernel.org/r/08a3f4dd-c3ce-0009-86c5-9ee51aba8557@suse.cz
      Fixes: b39d0ee2 ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc638f32
  17. 02 12月, 2019 1 次提交
    • L
      mm/mempolicy.c: fix checking unmapped holes for mbind · f18da660
      Li Xinhai 提交于
      mbind() is required to report EFAULT if range, specified by addr and
      len, contains unmapped holes.  In current implementation, below rules
      are applied for this checking:
      
       1: Unmapped holes at any part of the specified range should be reported
          as EFAULT if mbind() for none MPOL_DEFAULT cases;
      
       2: Unmapped holes at any part of the specified range should be ignored
          (do not reprot EFAULT) if mbind() for MPOL_DEFAULT case;
      
       3: The whole range in an unmapped hole should be reported as EFAULT;
      
      Note that rule 2 does not fullfill the mbind() API definition, but since
      that behavior has existed for long days (the internal flag
      MPOL_MF_DISCONTIG_OK is for this purpose), this patch does not plan to
      change it.
      
      In current code, application observed inconsistent behavior on rule 1
      and rule 2 respectively.  That inconsistency is fixed as below details.
      
      Cases of rule 1:
      
       - Hole at head side of range. Current code reprot EFAULT, no change by
         this patch.
      
          [  vma  ][ hole ][  vma  ]
                      [  range  ]
      
       - Hole at middle of range. Current code report EFAULT, no change by
         this patch.
      
          [  vma  ][ hole ][ vma ]
             [     range      ]
      
       - Hole at tail side of range. Current code do not report EFAULT, this
         patch fixes it.
      
          [  vma  ][ hole ][ vma ]
             [  range  ]
      
      Cases of rule 2:
      
       - Hole at head side of range. Current code reports EFAULT, this patch
         fixes it.
      
          [  vma  ][ hole ][  vma  ]
                      [  range  ]
      
       - Hole at middle of range. Current code does not report EFAULT, no
         change by this patch.
      
          [  vma  ][ hole ][ vma]
             [     range      ]
      
       - Hole at tail side of range. Current code does not report EFAULT, no
         change by this patch.
      
          [  vma  ][ hole ][ vma]
             [  range  ]
      
      This patch has no changes to rule 3.
      
      The unmapped hole checking can also be handled by using .pte_hole(),
      instead of .test_walk().  But .pte_hole() is called for holes inside and
      outside vma, which causes more cost, so this patch keeps the original
      design with .test_walk().
      
      Link: http://lkml.kernel.org/r/1573218104-11021-3-git-send-email-lixinhai.lxh@gmail.com
      Fixes: 6f4576e3 ("mempolicy: apply page table walker on queue_pages_range()")
      Signed-off-by: NLi Xinhai <lixinhai.lxh@gmail.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: linux-man <linux-man@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f18da660