1. 25 6月, 2015 2 次提交
    • A
      mm: clarify that the function operates on hugepage pte · 8809aa2d
      Aneesh Kumar K.V 提交于
      We have confusing functions to clear pmd, pmd_clear_* and pmd_clear.  Add
      _huge_ to pmdp_clear functions so that we are clear that they operate on
      hugepage pte.
      
      We don't bother about other functions like pmdp_set_wrprotect,
      pmdp_clear_flush_young, because they operate on PTE bits and hence
      indicate they are operating on hugepage ptes
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8809aa2d
    • N
      mm: soft-offline: don't free target page in successful page migration · add05cec
      Naoya Horiguchi 提交于
      Stress testing showed that soft offline events for a process iterating
      "mmap-pagefault-munmap" loop can trigger
      VM_BUG_ON(PAGE_FLAGS_CHECK_AT_PREP) in __free_one_page():
      
        Soft offlining page 0x70fe1 at 0x70100008d000
        Soft offlining page 0x705fb at 0x70300008d000
        page:ffffea0001c3f840 count:0 mapcount:0 mapping:          (null) index:0x2
        flags: 0x1fffff80800000(hwpoison)
        page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 25) - 1))
        ------------[ cut here ]------------
        kernel BUG at /src/linux-dev/mm/page_alloc.c:585!
        invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
        Modules linked in: cfg80211 rfkill crc32c_intel microcode ppdev parport_pc pcspkr serio_raw virtio_balloon parport i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy
        CPU: 3 PID: 1779 Comm: test_base_madv_ Not tainted 4.0.0-v4.0-150511-1451-00009-g82360a3730e6 #139
        RIP: free_pcppages_bulk+0x52a/0x6f0
        Call Trace:
          drain_pages_zone+0x3d/0x50
          drain_local_pages+0x1d/0x30
          on_each_cpu_mask+0x46/0x80
          drain_all_pages+0x14b/0x1e0
          soft_offline_page+0x432/0x6e0
          SyS_madvise+0x73c/0x780
          system_call_fastpath+0x12/0x17
        Code: ff 89 45 b4 48 8b 45 c0 48 83 b8 a8 00 00 00 00 0f 85 e3 fb ff ff 0f 1f 00 0f 0b 48 8b 7d 90 48 c7 c6 e8 95 a6 81 e8 e6 32 02 00 <0f> 0b 8b 45 cc 49 89 47 30 41 8b 47 18 83 f8 ff 0f 85 10 ff ff
        RIP  [<ffffffff811a806a>] free_pcppages_bulk+0x52a/0x6f0
         RSP <ffff88007a117d28>
        ---[ end trace 53926436e76d1f35 ]---
      
      When soft offline successfully migrates page, the source page is supposed
      to be freed.  But there is a race condition where a source page looks
      isolated (i.e.  the refcount is 0 and the PageHWPoison is set) but
      somewhat linked to pcplist.  Then another soft offline event calls
      drain_all_pages() and tries to free such hwpoisoned page, which is
      forbidden.
      
      This odd page state seems to happen due to the race between put_page() in
      putback_lru_page() and __pagevec_lru_add_fn().  But I don't want to play
      with tweaking drain code as done in commit 9ab3b598 "mm: hwpoison:
      drop lru_add_drain_all() in __soft_offline_page()", or to change page
      freeing code for this soft offline's purpose.
      
      Instead, let's think about the difference between hard offline and soft
      offline.  There is an interesting difference in how to isolate the in-use
      page between these, that is, hard offline marks PageHWPoison of the target
      page at first, and doesn't free it by keeping its refcount 1.  OTOH, soft
      offline tries to free the target page then marks PageHWPoison.  This
      difference might be the source of complexity and result in bugs like the
      above.  So making soft offline isolate with keeping refcount can be a
      solution for this problem.
      
      We can pass to page migration code the "reason" which shows the caller, so
      let's use this more to avoid calling putback_lru_page() when called from
      soft offline, which effectively does the isolation for soft offline.  With
      this change, target pages of soft offline never be reused without changing
      migratetype, so this patch also removes the related code.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      add05cec
  2. 16 4月, 2015 1 次提交
  3. 15 4月, 2015 2 次提交
  4. 13 2月, 2015 2 次提交
    • M
      mm: convert p[te|md]_mknonnuma and remaining page table manipulations · 4d942466
      Mel Gorman 提交于
      With PROT_NONE, the traditional page table manipulation functions are
      sufficient.
      
      [andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()]
      [akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NAneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d942466
    • M
      mm: numa: do not dereference pmd outside of the lock during NUMA hinting fault · 5d833062
      Mel Gorman 提交于
      Automatic NUMA balancing depends on being able to protect PTEs to trap a
      fault and gather reference locality information.  Very broadly speaking
      it would mark PTEs as not present and use another bit to distinguish
      between NUMA hinting faults and other types of faults.  It was
      universally loved by everybody and caused no problems whatsoever.  That
      last sentence might be a lie.
      
      This series is very heavily based on patches from Linus and Aneesh to
      replace the existing PTE/PMD NUMA helper functions with normal change
      protections.  I did alter and add parts of it but I consider them
      relatively minor contributions.  At their suggestion, acked-bys are in
      there but I've no problem converting them to Signed-off-by if requested.
      
      AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh
      for that.  I tested trinity under kvm-tool and passed and ran a few
      other basic tests.  At the time of writing, only the short-lived tests
      have completed but testing of V2 indicated that long-term testing had no
      surprises.  In most cases I'm leaving out detail as it's not that
      interesting.
      
      specjbb single JVM: There was negligible performance difference in the
      	benchmark itself for short runs. However, system activity is
      	higher and interrupts are much higher over time -- possibly TLB
      	flushes. Migrations are also higher. Overall, this is more overhead
      	but considering the problems faced with the old approach I think
      	we just have to suck it up and find another way of reducing the
      	overhead.
      
      specjbb multi JVM: Negligible performance difference to the actual benchmark
      	but like the single JVM case, the system overhead is noticeably
      	higher.  Again, interrupts are a major factor.
      
      autonumabench: This was all over the place and about all that can be
      	reasonably concluded is that it's different but not necessarily
      	better or worse.
      
      autonumabench
                                           3.18.0-rc5            3.18.0-rc5
                                       mmotm-20141119         protnone-v3r3
      User    NUMA01               32380.24 (  0.00%)    21642.92 ( 33.16%)
      User    NUMA01_THEADLOCAL    22481.02 (  0.00%)    22283.22 (  0.88%)
      User    NUMA02                3137.00 (  0.00%)     3116.54 (  0.65%)
      User    NUMA02_SMT            1614.03 (  0.00%)     1543.53 (  4.37%)
      System  NUMA01                 322.97 (  0.00%)     1465.89 (-353.88%)
      System  NUMA01_THEADLOCAL       91.87 (  0.00%)       49.32 ( 46.32%)
      System  NUMA02                  37.83 (  0.00%)       14.61 ( 61.38%)
      System  NUMA02_SMT               7.36 (  0.00%)        7.45 ( -1.22%)
      Elapsed NUMA01                 716.63 (  0.00%)      599.29 ( 16.37%)
      Elapsed NUMA01_THEADLOCAL      553.98 (  0.00%)      539.94 (  2.53%)
      Elapsed NUMA02                  83.85 (  0.00%)       83.04 (  0.97%)
      Elapsed NUMA02_SMT              86.57 (  0.00%)       79.15 (  8.57%)
      CPU     NUMA01                4563.00 (  0.00%)     3855.00 ( 15.52%)
      CPU     NUMA01_THEADLOCAL     4074.00 (  0.00%)     4136.00 ( -1.52%)
      CPU     NUMA02                3785.00 (  0.00%)     3770.00 (  0.40%)
      CPU     NUMA02_SMT            1872.00 (  0.00%)     1959.00 ( -4.65%)
      
      System CPU usage of NUMA01 is worse but it's an adverse workload on this
      machine so I'm reluctant to conclude that it's a problem that matters.  On
      the other workloads that are sensible on this machine, system CPU usage is
      great.  Overall time to complete the benchmark is comparable
      
                3.18.0-rc5  3.18.0-rc5
              mmotm-20141119protnone-v3r3
      User        59612.50    48586.44
      System        460.22     1537.45
      Elapsed      1442.20     1304.29
      
      NUMA alloc hit                 5075182     5743353
      NUMA alloc miss                      0           0
      NUMA interleave hit                  0           0
      NUMA alloc local               5075174     5743339
      NUMA base PTE updates        637061448   443106883
      NUMA huge PMD updates          1243434      864747
      NUMA page range updates     1273699656   885857347
      NUMA hint faults               1658116     1214277
      NUMA hint local faults          959487      754113
      NUMA hint local percent             57          62
      NUMA pages migrated            5467056    61676398
      
      The NUMA pages migrated look terrible but when I looked at a graph of the
      activity over time I see that the massive spike in migration activity was
      during NUMA01.  This correlates with high system CPU usage and could be
      simply down to bad luck but any modifications that affect that workload
      would be related to scan rates and migrations, not the protection
      mechanism.  For all other workloads, migration activity was comparable.
      
      Overall, headline performance figures are comparable but the overhead is
      higher, mostly in interrupts.  To some extent, higher overhead from this
      approach was anticipated but not to this degree.  It's going to be
      necessary to reduce this again with a separate series in the future.  It's
      still worth going ahead with this series though as it's likely to avoid
      constant headaches with Xen and is probably easier to maintain.
      
      This patch (of 10):
      
      A transhuge NUMA hinting fault may find the page is migrating and should
      wait until migration completes.  The check is race-prone because the pmd
      is deferenced outside of the page lock and while the race is tiny, it'll
      be larger if the PMD is cleared while marking PMDs for hinting fault.
      This patch closes the race.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d833062
  5. 12 2月, 2015 1 次提交
    • N
      mm/hugetlb: take page table lock in follow_huge_pmd() · e66f17ff
      Naoya Horiguchi 提交于
      We have a race condition between move_pages() and freeing hugepages, where
      move_pages() calls follow_page(FOLL_GET) for hugepages internally and
      tries to get its refcount without preventing concurrent freeing.  This
      race crashes the kernel, so this patch fixes it by moving FOLL_GET code
      for hugepages into follow_huge_pmd() with taking the page table lock.
      
      This patch intentionally removes page==NULL check after pte_page.
      This is justified because pte_page() never returns NULL for any
      architectures or configurations.
      
      This patch changes the behavior of follow_huge_pmd() for tail pages and
      then tail pages can be pinned/returned.  So the caller must be changed to
      properly handle the returned tail pages.
      
      We could have a choice to add the similar locking to
      follow_huge_(addr|pud) for consistency, but it's not necessary because
      currently these functions don't support FOLL_GET flag, so let's leave it
      for future development.
      
      Here is the reproducer:
      
        $ cat movepages.c
        #include <stdio.h>
        #include <stdlib.h>
        #include <numaif.h>
      
        #define ADDR_INPUT      0x700000000000UL
        #define HPS             0x200000
        #define PS              0x1000
      
        int main(int argc, char *argv[]) {
                int i;
                int nr_hp = strtol(argv[1], NULL, 0);
                int nr_p  = nr_hp * HPS / PS;
                int ret;
                void **addrs;
                int *status;
                int *nodes;
                pid_t pid;
      
                pid = strtol(argv[2], NULL, 0);
                addrs  = malloc(sizeof(char *) * nr_p + 1);
                status = malloc(sizeof(char *) * nr_p + 1);
                nodes  = malloc(sizeof(char *) * nr_p + 1);
      
                while (1) {
                        for (i = 0; i < nr_p; i++) {
                                addrs[i] = (void *)ADDR_INPUT + i * PS;
                                nodes[i] = 1;
                                status[i] = 0;
                        }
                        ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                              MPOL_MF_MOVE_ALL);
                        if (ret == -1)
                                err("move_pages");
      
                        for (i = 0; i < nr_p; i++) {
                                addrs[i] = (void *)ADDR_INPUT + i * PS;
                                nodes[i] = 0;
                                status[i] = 0;
                        }
                        ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                              MPOL_MF_MOVE_ALL);
                        if (ret == -1)
                                err("move_pages");
                }
                return 0;
        }
      
        $ cat hugepage.c
        #include <stdio.h>
        #include <sys/mman.h>
        #include <string.h>
      
        #define ADDR_INPUT      0x700000000000UL
        #define HPS             0x200000
      
        int main(int argc, char *argv[]) {
                int nr_hp = strtol(argv[1], NULL, 0);
                char *p;
      
                while (1) {
                        p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
                                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
                        if (p != (void *)ADDR_INPUT) {
                                perror("mmap");
                                break;
                        }
                        memset(p, 0, nr_hp * HPS);
                        munmap(p, nr_hp * HPS);
                }
        }
      
        $ sysctl vm.nr_hugepages=40
        $ ./hugepage 10 &
        $ ./movepages 10 $(pgrep -f hugepage)
      
      Fixes: e632a938 ("mm: migrate: add hugepage migration code to move_pages()")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NHugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e66f17ff
  6. 11 2月, 2015 1 次提交
  7. 17 12月, 2014 1 次提交
  8. 14 12月, 2014 1 次提交
    • H
      mm: unmapped page migration avoid unmap+remap overhead · 2ebba6b7
      Hugh Dickins 提交于
      Page migration's __unmap_and_move(), and rmap's try_to_unmap(), were
      created for use on pages almost certainly mapped into userspace.  But
      nowadays compaction often applies them to unmapped page cache pages: which
      may exacerbate contention on i_mmap_rwsem quite unnecessarily, since
      try_to_unmap_file() makes no preliminary page_mapped() check.
      
      Now check page_mapped() in __unmap_and_move(); and avoid repeating the
      same overhead in rmap_walk_file() - don't remove_migration_ptes() when we
      never inserted any.
      
      (The PageAnon(page) comment blocks now look even sillier than before, but
      clean that up on some other occasion.  And note in passing that
      try_to_unmap_one() does not use a migration entry when PageSwapCache, so
      remove_migration_ptes() will then not update that swap entry to newpage
      pte: not a big deal, but something else to clean up later.)
      
      Davidlohr remarked in "mm,fs: introduce helpers around the i_mmap_mutex"
      conversion to i_mmap_rwsem, that "The biggest winner of these changes is
      migration": a part of the reason might be all of that unnecessary taking
      of i_mmap_mutex in page migration; and it's rather a shame that I didn't
      get around to sending this patch in before his - this one is much less
      useful after Davidlohr's conversion to rwsem, but still good.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ebba6b7
  9. 10 10月, 2014 1 次提交
    • K
      mm/balloon_compaction: redesign ballooned pages management · d6d86c0a
      Konstantin Khlebnikov 提交于
      Sasha Levin reported KASAN splash inside isolate_migratepages_range().
      Problem is in the function __is_movable_balloon_page() which tests
      AS_BALLOON_MAP in page->mapping->flags.  This function has no protection
      against anonymous pages.  As result it tried to check address space flags
      inside struct anon_vma.
      
      Further investigation shows more problems in current implementation:
      
      * Special branch in __unmap_and_move() never works:
        balloon_page_movable() checks page flags and page_count.  In
        __unmap_and_move() page is locked, reference counter is elevated, thus
        balloon_page_movable() always fails.  As a result execution goes to the
        normal migration path.  virtballoon_migratepage() returns
        MIGRATEPAGE_BALLOON_SUCCESS instead of MIGRATEPAGE_SUCCESS,
        move_to_new_page() thinks this is an error code and assigns
        newpage->mapping to NULL.  Newly migrated page lose connectivity with
        balloon an all ability for further migration.
      
      * lru_lock erroneously required in isolate_migratepages_range() for
        isolation ballooned page.  This function releases lru_lock periodically,
        this makes migration mostly impossible for some pages.
      
      * balloon_page_dequeue have a tight race with balloon_page_isolate:
        balloon_page_isolate could be executed in parallel with dequeue between
        picking page from list and locking page_lock.  Race is rare because they
        use trylock_page() for locking.
      
      This patch fixes all of them.
      
      Instead of fake mapping with special flag this patch uses special state of
      page->_mapcount: PAGE_BALLOON_MAPCOUNT_VALUE = -256.  Buddy allocator uses
      PAGE_BUDDY_MAPCOUNT_VALUE = -128 for similar purpose.  Storing mark
      directly in struct page makes everything safer and easier.
      
      PagePrivate is used to mark pages present in page list (i.e.  not
      isolated, like PageLRU for normal pages).  It replaces special rules for
      reference counter and makes balloon migration similar to migration of
      normal pages.  This flag is protected by page_lock together with link to
      the balloon device.
      Signed-off-by: NKonstantin Khlebnikov <k.khlebnikov@samsung.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Link: http://lkml.kernel.org/p/53E6CEAA.9020105@oracle.com
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: <stable@vger.kernel.org>	[3.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6d86c0a
  10. 03 10月, 2014 1 次提交
    • M
      mm: migrate: Close race between migration completion and mprotect · d3cb8bf6
      Mel Gorman 提交于
      A migration entry is marked as write if pte_write was true at the time the
      entry was created. The VMA protections are not double checked when migration
      entries are being removed as mprotect marks write-migration-entries as
      read. It means that potentially we take a spurious fault to mark PTEs write
      again but it's straight-forward. However, there is a race between write
      migrations being marked read and migrations finishing. This potentially
      allows a PTE to be write that should have been read. Close this race by
      double checking the VMA permissions using maybe_mkwrite when migration
      completes.
      
      [torvalds@linux-foundation.org: use maybe_mkwrite]
      Cc: stable@vger.kernel.org
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3cb8bf6
  11. 09 8月, 2014 1 次提交
    • J
      mm: memcontrol: rewrite uncharge API · 0a31bc97
      Johannes Weiner 提交于
      The memcg uncharging code that is involved towards the end of a page's
      lifetime - truncation, reclaim, swapout, migration - is impressively
      complicated and fragile.
      
      Because anonymous and file pages were always charged before they had their
      page->mapping established, uncharges had to happen when the page type
      could still be known from the context; as in unmap for anonymous, page
      cache removal for file and shmem pages, and swap cache truncation for swap
      pages.  However, these operations happen well before the page is actually
      freed, and so a lot of synchronization is necessary:
      
      - Charging, uncharging, page migration, and charge migration all need
        to take a per-page bit spinlock as they could race with uncharging.
      
      - Swap cache truncation happens during both swap-in and swap-out, and
        possibly repeatedly before the page is actually freed.  This means
        that the memcg swapout code is called from many contexts that make
        no sense and it has to figure out the direction from page state to
        make sure memory and memory+swap are always correctly charged.
      
      - On page migration, the old page might be unmapped but then reused,
        so memcg code has to prevent untimely uncharging in that case.
        Because this code - which should be a simple charge transfer - is so
        special-cased, it is not reusable for replace_page_cache().
      
      But now that charged pages always have a page->mapping, introduce
      mem_cgroup_uncharge(), which is called after the final put_page(), when we
      know for sure that nobody is looking at the page anymore.
      
      For page migration, introduce mem_cgroup_migrate(), which is called after
      the migration is successful and the new page is fully rmapped.  Because
      the old page is no longer uncharged after migration, prevent double
      charges by decoupling the page's memcg association (PCG_USED and
      pc->mem_cgroup) from the page holding an actual charge.  The new bits
      PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
      to the new page during migration.
      
      mem_cgroup_migrate() is suitable for replace_page_cache() as well,
      which gets rid of mem_cgroup_replace_page_cache().  However, care
      needs to be taken because both the source and the target page can
      already be charged and on the LRU when fuse is splicing: grab the page
      lock on the charge moving side to prevent changing pc->mem_cgroup of a
      page under migration.  Also, the lruvecs of both pages change as we
      uncharge the old and charge the new during migration, and putback may
      race with us, so grab the lru lock and isolate the pages iff on LRU to
      prevent races and ensure the pages are on the right lruvec afterward.
      
      Swap accounting is massively simplified: because the page is no longer
      uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
      transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
      before the final put_page() in page reclaim.
      
      Finally, page_cgroup changes are now protected by whatever protection the
      page itself offers: anonymous pages are charged under the page table lock,
      whereas page cache insertions, swapin, and migration hold the page lock.
      Uncharging happens under full exclusion with no outstanding references.
      Charging and uncharging also ensure that the page is off-LRU, which
      serializes against charge migration.  Remove the very costly page_cgroup
      lock and set pc->flags non-atomically.
      
      [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
      [vdavydov@parallels.com: fix flags definition]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Tested-by: NJet Chen <jet.chen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NFelipe Balbi <balbi@ti.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a31bc97
  12. 27 7月, 2014 1 次提交
    • H
      mm: fix direct reclaim writeback regression · 8bdd6380
      Hugh Dickins 提交于
      Shortly before 3.16-rc1, Dave Jones reported:
      
        WARNING: CPU: 3 PID: 19721 at fs/xfs/xfs_aops.c:971
                 xfs_vm_writepage+0x5ce/0x630 [xfs]()
        CPU: 3 PID: 19721 Comm: trinity-c61 Not tainted 3.15.0+ #3
        Call Trace:
          xfs_vm_writepage+0x5ce/0x630 [xfs]
          shrink_page_list+0x8f9/0xb90
          shrink_inactive_list+0x253/0x510
          shrink_lruvec+0x563/0x6c0
          shrink_zone+0x3b/0x100
          shrink_zones+0x1f1/0x3c0
          try_to_free_pages+0x164/0x380
          __alloc_pages_nodemask+0x822/0xc90
          alloc_pages_vma+0xaf/0x1c0
          handle_mm_fault+0xa31/0xc50
        etc.
      
       970   if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
       971                   PF_MEMALLOC))
      
      I did not respond at the time, because a glance at the PageDirty block
      in shrink_page_list() quickly shows that this is impossible: we don't do
      writeback on file pages (other than tmpfs) from direct reclaim nowadays.
      Dave was hallucinating, but it would have been disrespectful to say so.
      
      However, my own /var/log/messages now shows similar complaints
      
        WARNING: CPU: 1 PID: 28814 at fs/ext4/inode.c:1881 ext4_writepage+0xa7/0x38b()
        WARNING: CPU: 0 PID: 27347 at fs/ext4/inode.c:1764 ext4_writepage+0xa7/0x38b()
      
      from stressing some mmotm trees during July.
      
      Could a dirty xfs or ext4 file page somehow get marked PageSwapBacked,
      so fail shrink_page_list()'s page_is_file_cache() test, and so proceed
      to mapping->a_ops->writepage()?
      
      Yes, 3.16-rc1's commit 68711a74 ("mm, migration: add destination
      page freeing callback") has provided such a way to compaction: if
      migrating a SwapBacked page fails, its newpage may be put back on the
      list for later use with PageSwapBacked still set, and nothing will clear
      it.
      
      Whether that can do anything worse than issue WARN_ON_ONCEs, and get
      some statistics wrong, is unclear: easier to fix than to think through
      the consequences.
      
      Fixing it here, before the put_new_page(), addresses the bug directly,
      but is probably the worst place to fix it.  Page migration is doing too
      many parts of the job on too many levels: fixing it in
      move_to_new_page() to complement its SetPageSwapBacked would be
      preferable, except why is it (and newpage->mapping and newpage->index)
      done there, rather than down in migrate_page_move_mapping(), once we are
      sure of success? Not a cleanup to get into right now, especially not
      with memcg cleanups coming in 3.17.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bdd6380
  13. 24 6月, 2014 1 次提交
    • H
      mm: let mm_find_pmd fix buggy race with THP fault · f72e7dcd
      Hugh Dickins 提交于
      Trinity has reported:
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
          IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
          CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G        W
                                  3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
          lock_acquire (arch/x86/include/asm/current.h:14
                        kernel/locking/lockdep.c:3602)
          _raw_spin_lock (include/linux/spinlock_api_smp.h:143
                          kernel/locking/spinlock.c:151)
          remove_migration_pte (mm/migrate.c:137)
          rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
          remove_migration_ptes (mm/migrate.c:224)
          migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
          migrate_misplaced_page (mm/migrate.c:1733)
          __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
          handle_mm_fault (mm/memory.c:3948)
          __get_user_pages (mm/memory.c:1851)
          __mlock_vma_pages_range (mm/mlock.c:255)
          __mm_populate (mm/mlock.c:711)
          SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)
      
      I believe this comes about because, whereas collapsing and splitting THP
      functions take anon_vma lock in write mode (which excludes concurrent
      rmap walks), faulting THP functions (write protection and misplaced
      NUMA) do not - and mostly they do not need to.
      
      But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
      an instant (indeed, for a long instant, given the inter-CPU TLB flush in
      there), leaves *pmd neither present not trans_huge.
      
      Which can confuse a concurrent rmap walk, as when removing migration
      ptes, seen in the dumped trace.  Although that rmap walk has a 4k page
      to insert, anon_vmas containing THPs are in no way segregated from
      4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
      that instant when a trans_huge pmd is temporarily absent.
      
      I don't think we need strengthen the locking at the THP end: it's easily
      handled with an ACCESS_ONCE() before testing both conditions.
      
      And since mm_find_pmd() had only one caller who wanted a THP rather than
      a pmd, let's slightly repurpose it to fail when it hits a THP or
      non-present pmd, and open code split_huge_page_address() again.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Lameter <cl@gentwo.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f72e7dcd
  14. 05 6月, 2014 3 次提交
  15. 21 3月, 2014 1 次提交
    • H
      mm: fix swapops.h:131 bug if remap_file_pages raced migration · 7e09e738
      Hugh Dickins 提交于
      Add remove_linear_migration_ptes_from_nonlinear(), to fix an interesting
      little include/linux/swapops.h:131 BUG_ON(!PageLocked) found by trinity:
      indicating that remove_migration_ptes() failed to find one of the
      migration entries that was temporarily inserted.
      
      The problem comes from remap_file_pages()'s switch from vma_interval_tree
      (good for inserting the migration entry) to i_mmap_nonlinear list (no good
      for locating it again); but can only be a problem if the remap_file_pages()
      range does not cover the whole of the vma (zap_pte() clears the range).
      
      remove_migration_ptes() needs a file_nonlinear method to go down the
      i_mmap_nonlinear list, applying linear location to look for migration
      entries in those vmas too, just in case there was this race.
      
      The file_nonlinear method does need rmap_walk_control.arg to do this;
      but it never needed vma passed in - vma comes from its own iteration.
      Reported-and-tested-by: NDave Jones <davej@redhat.com>
      Reported-and-tested-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e09e738
  16. 11 3月, 2014 1 次提交
    • J
      mm: fix GFP_THISNODE callers and clarify · e97ca8e5
      Johannes Weiner 提交于
      GFP_THISNODE is for callers that implement their own clever fallback to
      remote nodes.  It restricts the allocation to the specified node and
      does not invoke reclaim, assuming that the caller will take care of it
      when the fallback fails, e.g.  through a subsequent allocation request
      without GFP_THISNODE set.
      
      However, many current GFP_THISNODE users only want the node exclusive
      aspect of the flag, without actually implementing their own fallback or
      triggering reclaim if necessary.  This results in things like page
      migration failing prematurely even when there is easily reclaimable
      memory available, unless kswapd happens to be running already or a
      concurrent allocation attempt triggers the necessary reclaim.
      
      Convert all callsites that don't implement their own fallback strategy
      to __GFP_THISNODE.  This restricts the allocation a single node too, but
      at the same time allows the allocator to enter the slowpath, wake
      kswapd, and invoke direct reclaim if necessary, to make the allocation
      happen when memory is full.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Jan Stancek <jstancek@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e97ca8e5
  17. 28 1月, 2014 1 次提交
  18. 24 1月, 2014 2 次提交
  19. 22 1月, 2014 8 次提交
  20. 13 11月, 2014 1 次提交
  21. 22 12月, 2013 1 次提交
    • B
      aio/migratepages: make aio migrate pages sane · 8e321fef
      Benjamin LaHaise 提交于
      The arbitrary restriction on page counts offered by the core
      migrate_page_move_mapping() code results in rather suspicious looking
      fiddling with page reference counts in the aio_migratepage() operation.
      To fix this, make migrate_page_move_mapping() take an extra_count parameter
      that allows aio to tell the code about its own reference count on the page
      being migrated.
      
      While cleaning up aio_migratepage(), make it validate that the old page
      being passed in is actually what aio_migratepage() expects to prevent
      misbehaviour in the case of races.
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      8e321fef
  22. 19 12月, 2013 5 次提交
  23. 22 11月, 2013 1 次提交
    • D
      mm: thp: give transparent hugepage code a separate copy_page · 30b0a105
      Dave Hansen 提交于
      Right now, the migration code in migrate_page_copy() uses copy_huge_page()
      for hugetlbfs and thp pages:
      
             if (PageHuge(page) || PageTransHuge(page))
                      copy_huge_page(newpage, page);
      
      So, yay for code reuse.  But:
      
        void copy_huge_page(struct page *dst, struct page *src)
        {
              struct hstate *h = page_hstate(src);
      
      and a non-hugetlbfs page has no page_hstate().  This works 99% of the
      time because page_hstate() determines the hstate from the page order
      alone.  Since the page order of a THP page matches the default hugetlbfs
      page order, it works.
      
      But, if you change the default huge page size on the boot command-line
      (say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
      so page_hstate() returns null and copy_huge_page() oopses pretty fast
      since copy_huge_page() dereferences the hstate:
      
        void copy_huge_page(struct page *dst, struct page *src)
        {
              struct hstate *h = page_hstate(src);
              if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
        ...
      
      Mel noticed that the migration code is really the only user of these
      functions.  This moves all the copy code over to migrate.c and makes
      copy_huge_page() work for THP by checking for it explicitly.
      
      I believe the bug was introduced in commit b32967ff ("mm: numa: Add
      THP migration for the NUMA working set scanning fault case")
      
      [akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Tested-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30b0a105