diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration new file mode 100644 index 0000000000000000000000000000000000000000..c52820fcf500a8cd7733ca7859d5a9baae05ee46 --- /dev/null +++ b/Documentation/vm/page_migration @@ -0,0 +1,129 @@ +Page migration +-------------- + +Page migration allows the moving of the physical location of pages between +nodes in a numa system while the process is running. This means that the +virtual addresses that the process sees do not change. However, the +system rearranges the physical location of those pages. + +The main intend of page migration is to reduce the latency of memory access +by moving pages near to the processor where the process accessing that memory +is running. + +Page migration allows a process to manually relocate the node on which its +pages are located through the MF_MOVE and MF_MOVE_ALL options while setting +a new memory policy. The pages of process can also be relocated +from another process using the sys_migrate_pages() function call. The +migrate_pages function call takes two sets of nodes and moves pages of a +process that are located on the from nodes to the destination nodes. + +Manual migration is very useful if for example the scheduler has relocated +a process to a processor on a distant node. A batch scheduler or an +administrator may detect the situation and move the pages of the process +nearer to the new processor. At some point in the future we may have +some mechanism in the scheduler that will automatically move the pages. + +Larger installations usually partition the system using cpusets into +sections of nodes. Paul Jackson has equipped cpusets with the ability to +move pages when a task is moved to another cpuset. This allows automatic +control over locality of a process. If a task is moved to a new cpuset +then also all its pages are moved with it so that the performance of the +process does not sink dramatically (as is the case today). + +Page migration allows the preservation of the relative location of pages +within a group of nodes for all migration techniques which will preserve a +particular memory allocation pattern generated even after migrating a +process. This is necessary in order to preserve the memory latencies. +Processes will run with similar performance after migration. + +Page migration occurs in several steps. First a high level +description for those trying to use migrate_pages() and then +a low level description of how the low level details work. + +A. Use of migrate_pages() +------------------------- + +1. Remove pages from the LRU. + + Lists of pages to be migrated are generated by scanning over + pages and moving them into lists. This is done by + calling isolate_lru_page() or __isolate_lru_page(). + Calling isolate_lru_page increases the references to the page + so that it cannot vanish under us. + +2. Generate a list of newly allocates page to move the contents + of the first list to. + +3. The migrate_pages() function is called which attempts + to do the migration. It returns the moved pages in the + list specified as the third parameter and the failed + migrations in the fourth parameter. The first parameter + will contain the pages that could still be retried. + +4. The leftover pages of various types are returned + to the LRU using putback_to_lru_pages() or otherwise + disposed of. The pages will still have the refcount as + increased by isolate_lru_pages()! + +B. Operation of migrate_pages() +-------------------------------- + +migrate_pages does several passes over its list of pages. A page is moved +if all references to a page are removable at the time. + +Steps: + +1. Lock the page to be migrated + +2. Insure that writeback is complete. + +3. Make sure that the page has assigned swap cache entry if + it is an anonyous page. The swap cache reference is necessary + to preserve the information contain in the page table maps. + +4. Prep the new page that we want to move to. It is locked + and set to not being uptodate so that all accesses to the new + page immediately lock while we are moving references. + +5. All the page table references to the page are either dropped (file backed) + or converted to swap references (anonymous pages). This should decrease the + reference count. + +6. The radix tree lock is taken + +7. The refcount of the page is examined and we back out if references remain + otherwise we know that we are the only one referencing this page. + +8. The radix tree is checked and if it does not contain the pointer to this + page then we back out. + +9. The mapping is checked. If the mapping is gone then a truncate action may + be in progress and we back out. + +10. The new page is prepped with some settings from the old page so that accesses + to the new page will be discovered to have the correct settings. + +11. The radix tree is changed to point to the new page. + +12. The reference count of the old page is dropped because the reference has now + been removed. + +13. The radix tree lock is dropped. + +14. The page contents are copied to the new page. + +15. The remaining page flags are copied to the new page. + +16. The old page flags are cleared to indicate that the page does + not use any information anymore. + +17. Queued up writeback on the new page is triggered. + +18. If swap pte's were generated for the page then remove them again. + +19. The locks are dropped from the old and new page. + +20. The new page is moved to the LRU. + +Christoph Lameter, December 19, 2005. + diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 9d6fbeef210472b21a7a974677303183747bc06e..0f1ea2d6ed8635ec2db577254a4265eee1534728 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -91,7 +91,7 @@ static inline void page_dup_rmap(struct page *page) * Called from mm/vmscan.c to handle paging out */ int page_referenced(struct page *, int is_locked); -int try_to_unmap(struct page *); +int try_to_unmap(struct page *, int ignore_refs); /* * Called from mm/filemap_xip.c to unmap empty zero page @@ -111,7 +111,7 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *); #define anon_vma_link(vma) do {} while (0) #define page_referenced(page,l) TestClearPageReferenced(page) -#define try_to_unmap(page) SWAP_FAIL +#define try_to_unmap(page, refs) SWAP_FAIL #endif /* CONFIG_MMU */ diff --git a/include/linux/swap.h b/include/linux/swap.h index e53fef7051e6f3722c6e059e9f7ebe4a9fbf8e0d..d359fc022433fec159261b43e7f1026aa6b3761c 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -191,6 +191,8 @@ static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) #ifdef CONFIG_MIGRATION extern int isolate_lru_page(struct page *p); extern int putback_lru_pages(struct list_head *l); +extern int migrate_page(struct page *, struct page *); +extern void migrate_page_copy(struct page *, struct page *); extern int migrate_pages(struct list_head *l, struct list_head *t, struct list_head *moved, struct list_head *failed); #else diff --git a/mm/rmap.c b/mm/rmap.c index d85a99d28c0387a28d92538a88b92ddfef0efbb2..13fad5fcdf7996eab9812d5cb781d9157f8cc55e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -52,6 +52,7 @@ #include #include #include +#include #include @@ -541,7 +542,8 @@ void page_remove_rmap(struct page *page) * Subfunctions of try_to_unmap: try_to_unmap_one called * repeatedly from either try_to_unmap_anon or try_to_unmap_file. */ -static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma) +static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, + int ignore_refs) { struct mm_struct *mm = vma->vm_mm; unsigned long address; @@ -564,7 +566,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma) * skipped over this mm) then we should reactivate it. */ if ((vma->vm_flags & VM_LOCKED) || - ptep_clear_flush_young(vma, address, pte)) { + (ptep_clear_flush_young(vma, address, pte) + && !ignore_refs)) { ret = SWAP_FAIL; goto out_unmap; } @@ -698,7 +701,7 @@ static void try_to_unmap_cluster(unsigned long cursor, pte_unmap_unlock(pte - 1, ptl); } -static int try_to_unmap_anon(struct page *page) +static int try_to_unmap_anon(struct page *page, int ignore_refs) { struct anon_vma *anon_vma; struct vm_area_struct *vma; @@ -709,7 +712,7 @@ static int try_to_unmap_anon(struct page *page) return ret; list_for_each_entry(vma, &anon_vma->head, anon_vma_node) { - ret = try_to_unmap_one(page, vma); + ret = try_to_unmap_one(page, vma, ignore_refs); if (ret == SWAP_FAIL || !page_mapped(page)) break; } @@ -726,7 +729,7 @@ static int try_to_unmap_anon(struct page *page) * * This function is only called from try_to_unmap for object-based pages. */ -static int try_to_unmap_file(struct page *page) +static int try_to_unmap_file(struct page *page, int ignore_refs) { struct address_space *mapping = page->mapping; pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); @@ -740,7 +743,7 @@ static int try_to_unmap_file(struct page *page) spin_lock(&mapping->i_mmap_lock); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { - ret = try_to_unmap_one(page, vma); + ret = try_to_unmap_one(page, vma, ignore_refs); if (ret == SWAP_FAIL || !page_mapped(page)) goto out; } @@ -825,16 +828,16 @@ static int try_to_unmap_file(struct page *page) * SWAP_AGAIN - we missed a mapping, try again later * SWAP_FAIL - the page is unswappable */ -int try_to_unmap(struct page *page) +int try_to_unmap(struct page *page, int ignore_refs) { int ret; BUG_ON(!PageLocked(page)); if (PageAnon(page)) - ret = try_to_unmap_anon(page); + ret = try_to_unmap_anon(page, ignore_refs); else - ret = try_to_unmap_file(page); + ret = try_to_unmap_file(page, ignore_refs); if (!page_mapped(page)) ret = SWAP_SUCCESS; diff --git a/mm/vmscan.c b/mm/vmscan.c index aa4b80dbe3ad6023d3d705cbad21ea73f69023ec..8f326ce2b690bf63953dde56bf365773fad52c21 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -483,7 +483,7 @@ static int shrink_list(struct list_head *page_list, struct scan_control *sc) if (!sc->may_swap) goto keep_locked; - switch (try_to_unmap(page)) { + switch (try_to_unmap(page, 0)) { case SWAP_FAIL: goto activate_locked; case SWAP_AGAIN: @@ -623,7 +623,7 @@ static int swap_page(struct page *page) struct address_space *mapping = page_mapping(page); if (page_mapped(page) && mapping) - if (try_to_unmap(page) != SWAP_SUCCESS) + if (try_to_unmap(page, 0) != SWAP_SUCCESS) goto unlock_retry; if (PageDirty(page)) { @@ -659,6 +659,154 @@ static int swap_page(struct page *page) retry: return -EAGAIN; } + +/* + * Page migration was first developed in the context of the memory hotplug + * project. The main authors of the migration code are: + * + * IWAMOTO Toshihiro + * Hirokazu Takahashi + * Dave Hansen + * Christoph Lameter + */ + +/* + * Remove references for a page and establish the new page with the correct + * basic settings to be able to stop accesses to the page. + */ +static int migrate_page_remove_references(struct page *newpage, + struct page *page, int nr_refs) +{ + struct address_space *mapping = page_mapping(page); + struct page **radix_pointer; + + /* + * Avoid doing any of the following work if the page count + * indicates that the page is in use or truncate has removed + * the page. + */ + if (!mapping || page_mapcount(page) + nr_refs != page_count(page)) + return 1; + + /* + * Establish swap ptes for anonymous pages or destroy pte + * maps for files. + * + * In order to reestablish file backed mappings the fault handlers + * will take the radix tree_lock which may then be used to stop + * processses from accessing this page until the new page is ready. + * + * A process accessing via a swap pte (an anonymous page) will take a + * page_lock on the old page which will block the process until the + * migration attempt is complete. At that time the PageSwapCache bit + * will be examined. If the page was migrated then the PageSwapCache + * bit will be clear and the operation to retrieve the page will be + * retried which will find the new page in the radix tree. Then a new + * direct mapping may be generated based on the radix tree contents. + * + * If the page was not migrated then the PageSwapCache bit + * is still set and the operation may continue. + */ + try_to_unmap(page, 1); + + /* + * Give up if we were unable to remove all mappings. + */ + if (page_mapcount(page)) + return 1; + + write_lock_irq(&mapping->tree_lock); + + radix_pointer = (struct page **)radix_tree_lookup_slot( + &mapping->page_tree, + page_index(page)); + + if (!page_mapping(page) || page_count(page) != nr_refs || + *radix_pointer != page) { + write_unlock_irq(&mapping->tree_lock); + return 1; + } + + /* + * Now we know that no one else is looking at the page. + * + * Certain minimal information about a page must be available + * in order for other subsystems to properly handle the page if they + * find it through the radix tree update before we are finished + * copying the page. + */ + get_page(newpage); + newpage->index = page->index; + newpage->mapping = page->mapping; + if (PageSwapCache(page)) { + SetPageSwapCache(newpage); + set_page_private(newpage, page_private(page)); + } + + *radix_pointer = newpage; + __put_page(page); + write_unlock_irq(&mapping->tree_lock); + + return 0; +} + +/* + * Copy the page to its new location + */ +void migrate_page_copy(struct page *newpage, struct page *page) +{ + copy_highpage(newpage, page); + + if (PageError(page)) + SetPageError(newpage); + if (PageReferenced(page)) + SetPageReferenced(newpage); + if (PageUptodate(page)) + SetPageUptodate(newpage); + if (PageActive(page)) + SetPageActive(newpage); + if (PageChecked(page)) + SetPageChecked(newpage); + if (PageMappedToDisk(page)) + SetPageMappedToDisk(newpage); + + if (PageDirty(page)) { + clear_page_dirty_for_io(page); + set_page_dirty(newpage); + } + + ClearPageSwapCache(page); + ClearPageActive(page); + ClearPagePrivate(page); + set_page_private(page, 0); + page->mapping = NULL; + + /* + * If any waiters have accumulated on the new page then + * wake them up. + */ + if (PageWriteback(newpage)) + end_page_writeback(newpage); +} + +/* + * Common logic to directly migrate a single page suitable for + * pages that do not use PagePrivate. + * + * Pages are locked upon entry and exit. + */ +int migrate_page(struct page *newpage, struct page *page) +{ + BUG_ON(PageWriteback(page)); /* Writeback must be complete */ + + if (migrate_page_remove_references(newpage, page, 2)) + return -EAGAIN; + + migrate_page_copy(newpage, page); + + return 0; +} + /* * migrate_pages * @@ -672,11 +820,6 @@ static int swap_page(struct page *page) * are movable anymore because t has become empty * or no retryable pages exist anymore. * - * SIMPLIFIED VERSION: This implementation of migrate_pages - * is only swapping out pages and never touches the second - * list. The direct migration patchset - * extends this function to avoid the use of swap. - * * Return: Number of pages not migrated when "to" ran empty. */ int migrate_pages(struct list_head *from, struct list_head *to, @@ -697,6 +840,9 @@ int migrate_pages(struct list_head *from, struct list_head *to, retry = 0; list_for_each_entry_safe(page, page2, from, lru) { + struct page *newpage = NULL; + struct address_space *mapping; + cond_resched(); rc = 0; @@ -704,6 +850,9 @@ int migrate_pages(struct list_head *from, struct list_head *to, /* page was freed from under us. So we are done. */ goto next; + if (to && list_empty(to)) + break; + /* * Skip locked pages during the first two passes to give the * functions holding the lock time to release the page. Later we @@ -740,12 +889,64 @@ int migrate_pages(struct list_head *from, struct list_head *to, } } + if (!to) { + rc = swap_page(page); + goto next; + } + + newpage = lru_to_page(to); + lock_page(newpage); + /* - * Page is properly locked and writeback is complete. + * Pages are properly locked and writeback is complete. * Try to migrate the page. */ - rc = swap_page(page); - goto next; + mapping = page_mapping(page); + if (!mapping) + goto unlock_both; + + /* + * Trigger writeout if page is dirty + */ + if (PageDirty(page)) { + switch (pageout(page, mapping)) { + case PAGE_KEEP: + case PAGE_ACTIVATE: + goto unlock_both; + + case PAGE_SUCCESS: + unlock_page(newpage); + goto next; + + case PAGE_CLEAN: + ; /* try to migrate the page below */ + } + } + /* + * If we have no buffer or can release the buffer + * then do a simple migration. + */ + if (!page_has_buffers(page) || + try_to_release_page(page, GFP_KERNEL)) { + rc = migrate_page(newpage, page); + goto unlock_both; + } + + /* + * On early passes with mapped pages simply + * retry. There may be a lock held for some + * buffers that may go away. Later + * swap them out. + */ + if (pass > 4) { + unlock_page(newpage); + newpage = NULL; + rc = swap_page(page); + goto next; + } + +unlock_both: + unlock_page(newpage); unlock_page: unlock_page(page); @@ -758,7 +959,10 @@ int migrate_pages(struct list_head *from, struct list_head *to, list_move(&page->lru, failed); nr_failed++; } else { - /* Success */ + if (newpage) { + /* Successful migration. Return page to LRU */ + move_to_lru(newpage); + } list_move(&page->lru, moved); } }