- 24 2月, 2013 40 次提交
-
-
由 Mel Gorman 提交于
Andrew Morton pointed out that page_xchg_last_nid() and reset_page_last_nid() were "getting nuttily large" and asked that it be investigated. reset_page_last_nid() is on the page free path and it would be unfortunate to make that path more expensive than it needs to be. Due to the internal use of page_xchg_last_nid() it is already too expensive but fortunately, it should also be impossible for the page->flags to be updated in parallel when we call reset_page_last_nid(). Instead of unlining the function, it uses a simplier implementation that assumes no parallel updates and should now be sufficiently short for inlining. page_xchg_last_nid() is called in paths that are already quite expensive (splitting huge page, fault handling, migration) and it is reasonable to uninline. There was not really a good place to place the function but mm/mmzone.c was the closest fit IMO. This patch saved 128 bytes of text in the vmlinux file for the kernel configuration I used for testing automatic NUMA balancing. Signed-off-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Memcg swap accounting is currently enabled by enable_swap_cgroup when the root cgroup is created. mem_cgroup_init acts as a memcg subsystem initializer which sounds like a much better place for enable_swap_cgroup as well. We already register memsw files from there so it makes a lot of sense to merge those two into a single enable_swap_cgroup function. This patch doesn't introduce any semantic changes. Signed-off-by: NMichal Hocko <mhocko@suse.cz> Cc: Zhouping Liu <zliu@redhat.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Li Zefan <lizefan@huawei.com> Cc: CAI Qian <caiqian@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Zhouping Liu has reported that memsw files are exported even though swap accounting is runtime disabled if MEMCG_SWAP is enabled. This behavior has been introduced by commit af36f906 ("memcg: always create memsw files if CGROUP_MEM_RES_CTLR_SWAP") and it causes any attempt to open the file to return EOPNOTSUPP. Although EOPNOTSUPP should say be clear that memsw operations are not supported in the given configuration it is fair to say that this behavior could be quite confusing. Let's tear memsw files out of default cgroup files and add them only if the swap accounting is really enabled (either by MEMCG_SWAP_ENABLED or swapaccount=1 boot parameter). We can hook into mem_cgroup_init which is called when the memcg subsystem is initialized and which happens after boot command line is processed. Signed-off-by: NMichal Hocko <mhocko@suse.cz> Reported-by: NZhouping Liu <zliu@redhat.com> Tested-by: NZhouping Liu <zliu@redhat.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Li Zefan <lizefan@huawei.com> Cc: CAI Qian <caiqian@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Paul Szabo 提交于
When calculating amount of dirtyable memory, min_free_kbytes should be subtracted because it is not intended for dirty pages. Addresses http://bugs.debian.org/695182 [akpm@linux-foundation.org: fix up min_free_kbytes extern declarations] [akpm@linux-foundation.org: fix min() warning] Signed-off-by: NPaul Szabo <psz@maths.usyd.edu.au> Acked-by: NRik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
The comment in commit 4fc3f1d6 ("mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable") says: | Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(), | to make it clearer that it's an exclusive write-lock in | that case - suggested by Rik van Riel. But that commit renames only anon_vma_lock() Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Ingo Molnar <mingo@kernel.org> Reviewed-by: NRik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Shaohua Li 提交于
swap_lock is heavily contended when I test swap to 3 fast SSD (even slightly slower than swap to 2 such SSD). The main contention comes from swap_info_get(). This patch tries to fix the gap with adding a new per-partition lock. Global data like nr_swapfiles, total_swap_pages, least_priority and swap_list are still protected by swap_lock. nr_swap_pages is an atomic now, it can be changed without swap_lock. In theory, it's possible get_swap_page() finds no swap pages but actually there are free swap pages. But sounds not a big problem. Accessing partition specific data (like scan_swap_map and so on) is only protected by swap_info_struct.lock. Changing swap_info_struct.flags need hold swap_lock and swap_info_struct.lock, because scan_scan_map() will check it. read the flags is ok with either the locks hold. If both swap_lock and swap_info_struct.lock must be hold, we always hold the former first to avoid deadlock. swap_entry_free() can change swap_list. To delete that code, we add a new highest_priority_index. Whenever get_swap_page() is called, we check it. If it's valid, we use it. It's a pity get_swap_page() still holds swap_lock(). But in practice, swap_lock() isn't heavily contended in my test with this patch (or I can say there are other much more heavier bottlenecks like TLB flush). And BTW, looks get_swap_page() doesn't really need the lock. We never free swap_info[] and we check SWAP_WRITEOK flag. The only risk without the lock is we could swapout to some low priority swap, but we can quickly recover after several rounds of swap, so sounds not a big deal to me. But I'd prefer to fix this if it's a real problem. "swap: make each swap partition have one address_space" improved the swapout speed from 1.7G/s to 2G/s. This patch further improves the speed to 2.3G/s, so around 15% improvement. It's a multi-process test, so TLB flush isn't the biggest bottleneck before the patches. [arnd@arndb.de: fix it for nommu] [hughd@google.com: add missing unlock] [minchan@kernel.org: get rid of lockdep whinge on sys_swapon] Signed-off-by: NShaohua Li <shli@fusionio.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Shaohua Li 提交于
When I use several fast SSD to do swap, swapper_space.tree_lock is heavily contended. This makes each swap partition have one address_space to reduce the lock contention. There is an array of address_space for swap. The swap entry type is the index to the array. In my test with 3 SSD, this increases the swapout throughput 20%. [akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache] Signed-off-by: NShaohua Li <shli@fusionio.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Shaohua Li 提交于
According to akpm, this saves 1/2k text and makes things simple for the next patch. Numbers from Minchan: add/remove: 1/0 grow/shrink: 6/22 up/down: 92/-516 (-424) function old new delta page_mapping - 48 +48 do_task_stat 2292 2308 +16 page_remove_rmap 240 248 +8 load_elf_binary 4500 4508 +8 update_queue 532 536 +4 scsi_probe_and_add_lun 2892 2896 +4 lookup_fast 644 648 +4 vcs_read 1040 1036 -4 __ip_route_output_key 1904 1900 -4 ip_route_input_noref 2508 2500 -8 shmem_file_aio_read 784 772 -12 __isolate_lru_page 272 256 -16 shmem_replace_page 708 688 -20 mark_buffer_dirty 228 208 -20 __set_page_dirty_buffers 240 220 -20 __remove_mapping 276 256 -20 update_mmu_cache 500 476 -24 set_page_dirty_balance 92 68 -24 set_page_dirty 172 148 -24 page_evictable 88 64 -24 page_cache_pipe_buf_steal 248 224 -24 clear_page_dirty_for_io 340 316 -24 test_set_page_writeback 400 372 -28 test_clear_page_writeback 516 488 -28 invalidate_inode_page 156 128 -28 page_mkclean 432 400 -32 flush_dcache_page 360 328 -32 __set_page_dirty_nobuffers 324 280 -44 shrink_page_list 2412 2356 -56 Signed-off-by: NShaohua Li <shli@fusionio.com> Suggested-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Acked-by: NRik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
When correcting commit 04fa5d6a ("mm: migrate: check page_count of THP before migrating") Hugh Dickins noted that the control flow for transhuge migration was difficult to follow. Unconditionally calling put_page() in numamigrate_isolate_page() made the failure paths of both migrate_misplaced_transhuge_page() and migrate_misplaced_page() more complex that they should be. Further, he was extremely wary that an unlock_page() should ever happen after a put_page() even if the put_page() should never be the final put_page. Hugh implemented the following cleanup to simplify the path by calling putback_lru_page() inside numamigrate_isolate_page() if it failed to isolate and always calling unlock_page() within migrate_misplaced_transhuge_page(). There is no functional change after this patch is applied but the code is easier to follow and unlock_page() always happens before put_page(). [mgorman@suse.de: changelog only] Signed-off-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Zijlstra 提交于
page->_last_nid fits into page->flags on 64-bit. The unlikely 32-bit NUMA configuration with NUMA Balancing will still need an extra page field. As Peter notes "Completely dropping 32bit support for CONFIG_NUMA_BALANCING would simplify things, but it would also remove the warning if we grow enough 64bit only page-flags to push the last-cpu out." [mgorman@suse.de: minor modifications] Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Zijlstra 提交于
This is a preparation patch for moving page->_last_nid into page->flags that moves page flag layout information to a separate header. This patch is necessary because otherwise there would be a circular dependency between mm_types.h and mm.h. Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
The current definitions for count_vm_numa_events() is wrong for !CONFIG_NUMA_BALANCING as the following would miss the side-effect. count_vm_numa_events(NUMA_FOO, bar++); There are no such users of count_vm_numa_events() but this patch fixes it as it is a potential pitfall. Ideally both would be converted to static inline but NUMA_PTE_UPDATES is not defined if !CONFIG_NUMA_BALANCING and creating dummy constants just to have a static inline would be similarly clumsy. Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Wanpeng Li pointed out that numamigrate_isolate_page() assumes that only one base page is being migrated when in fact it can also be checking THP. The consequences are that a migration will be attempted when a target node is nearly full and fail later. It's unlikely to be user-visible but it should be fixed. While we are there, migrate_balanced_pgdat() should treat nr_migrate_pages as an unsigned long as it is treated as a watermark. Signed-off-by: NMel Gorman <mgorman@suse.de> Suggested-by: NWanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
s/me/be/ and clarify the comment a bit when we're changing it anyway. Signed-off-by: NMel Gorman <mgorman@suse.de> Suggested-by: NSimon Jeons <simon.jeons@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ming Lei 提交于
If one storage interface or usb network interface(iSCSI case) exists in current configuration, memory allocation with GFP_KERNEL during usb_device_reset() might trigger I/O transfer on the storage interface itself and cause deadlock because the 'us->dev_mutex' is held in .pre_reset() and the storage interface can't do I/O transfer when the reset is triggered by other interface, or the error handling can't be completed if the reset is triggered by the storage itself (error handling path). Signed-off-by: NMing Lei <ming.lei@canonical.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Reviewed-by: NJiri Kosina <jkosina@suse.cz> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ming Lei 提交于
Apply the introduced memalloc_noio_save() and memalloc_noio_restore() to force memory allocation with no I/O during runtime_resume/runtime_suspend callback on device with the flag of 'memalloc_noio' set. Signed-off-by: NMing Lei <ming.lei@canonical.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ming Lei 提交于
Deadlock might be caused by allocating memory with GFP_KERNEL in runtime_resume and runtime_suspend callback of network devices in iSCSI situation, so mark network devices and its ancestor as 'memalloc_noio' with the introduced pm_runtime_set_memalloc_noio(). Signed-off-by: NMing Lei <ming.lei@canonical.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ming Lei 提交于
Apply the introduced pm_runtime_set_memalloc_noio on block device so that PM core will teach mm to not allocate memory with GFP_IOFS when calling the runtime_resume and runtime_suspend callback for block devices and its ancestors. Signed-off-by: NMing Lei <ming.lei@canonical.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ming Lei 提交于
Introduce the flag memalloc_noio in 'struct dev_pm_info' to help PM core to teach mm not allocating memory with GFP_KERNEL flag for avoiding probable deadlock. As explained in the comment, any GFP_KERNEL allocation inside runtime_resume() or runtime_suspend() on any one of device in the path from one block or network device to the root device in the device tree may cause deadlock, the introduced pm_runtime_set_memalloc_noio() sets or clears the flag on device in the path recursively. Signed-off-by: NMing Lei <ming.lei@canonical.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ming Lei 提交于
This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of 'struct task_struct'), so that the flag can be set by one task to avoid doing I/O inside memory allocation in the task's context. The patch trys to solve one deadlock problem caused by block device, and the problem may happen at least in the below situations: - during block device runtime resume, if memory allocation with GFP_KERNEL is called inside runtime resume callback of any one of its ancestors(or the block device itself), the deadlock may be triggered inside the memory allocation since it might not complete until the block device becomes active and the involed page I/O finishes. The situation is pointed out first by Alan Stern. It is not a good approach to convert all GFP_KERNEL[1] in the path into GFP_NOIO because several subsystems may be involved(for example, PCI, USB and SCSI may be involved for usb mass stoarage device, network devices involved too in the iSCSI case) - during block device runtime suspend, because runtime resume need to wait for completion of concurrent runtime suspend. - during error handling of usb mass storage deivce, USB bus reset will be put on the device, so there shouldn't have any memory allocation with GFP_KERNEL during USB bus reset, otherwise the deadlock similar with above may be triggered. Unfortunately, any usb device may include one mass storage interface in theory, so it requires all usb interface drivers to handle the situation. In fact, most usb drivers don't know how to handle bus reset on the device and don't provide .pre_set() and .post_reset() callback at all, so USB core has to unbind and bind driver for these devices. So it is still not practical to resort to GFP_NOIO for solving the problem. Also the introduced solution can be used by block subsystem or block drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing actual I/O transfer. It is not a good idea to convert all these GFP_KERNEL in the affected path into GFP_NOIO because these functions doing that may be implemented as library and will be called in many other contexts. In fact, memalloc_noio_flags() can convert some of current static GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts, at least almost all GFP_NOIO in USB subsystem can be converted into GFP_KERNEL after applying the approach and make allocation with GFP_NOIO only happen in runtime resume/bus reset/block I/O transfer contexts generally. [1], several GFP_KERNEL allocation examples in runtime resume path - pci subsystem acpi_os_allocate <-acpi_ut_allocate <-ACPI_ALLOCATE_ZEROED <-acpi_evaluate_object <-__acpi_bus_set_power <-acpi_bus_set_power <-acpi_pci_set_power_state <-platform_pci_set_power_state <-pci_platform_power_transition <-__pci_complete_power_transition <-pci_set_power_state <-pci_restore_standard_config <-pci_pm_runtime_resume - usb subsystem usb_get_status <-finish_port_resume <-usb_port_resume <-generic_resume <-usb_resume_device <-usb_resume_both <-usb_runtime_resume - some individual usb drivers usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, .... That is just what I have found. Unfortunately, this allocation can only be found by human being now, and there should be many not found since any function in the resume path(call tree) may allocate memory with GFP_KERNEL. Signed-off-by: NMing Lei <ming.lei@canonical.com> Signed-off-by: NMinchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Zlatko Calusic 提交于
From: Zlatko Calusic <zlatko.calusic@iskon.hr> Commit 92df3a72 ("mm: vmscan: throttle reclaim if encountering too many dirty pages under writeback") introduced waiting on congested zones based on a sane algorithm in shrink_inactive_list(). What this means is that there's no more need for throttling and additional heuristics in balance_pgdat(). So, let's remove it and tidy up the code. Signed-off-by: NZlatko Calusic <zlatko.calusic@iskon.hr> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
num_poisoned_pages counts up the number of pages isolated by memory errors. But for thp, only one subpage is isolated because memory error handler splits it, so it's wrong to add (1 << compound_trans_order). [akpm@linux-foundation.org: tweak comment] Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Currently soft_offline_page() is hard to maintain because it has many return points and goto statements. All of this mess come from get_any_page(). This function should only get page refcount as the name implies, but it does some page isolating actions like SetPageHWPoison() and dequeuing hugepage. This patch corrects it and introduces some internal subroutines to make soft offlining code more readable and maintainable. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: NAndi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Jiang Liu <jiang.liu@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Xishi Qiu 提交于
Since MCE is an x86 concept, and this code is in mm/, it would be better to use the name num_poisoned_pages instead of mce_bad_pages. [akpm@linux-foundation.org: fix mm/sparse.c] Signed-off-by: NXishi Qiu <qiuxishi@huawei.com> Signed-off-by: NJiang Liu <jiang.liu@huawei.com> Suggested-by: NBorislav Petkov <bp@alien8.de> Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Xishi Qiu 提交于
There are too many return points randomly intermingled with some "goto done" return points. So adjust the function structure, one for the success path, the other for the failure path. Use atomic_long_inc instead of atomic_long_add. Signed-off-by: NXishi Qiu <qiuxishi@huawei.com> Signed-off-by: NJiang Liu <jiang.liu@huawei.com> Suggested-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Xishi Qiu 提交于
When doing $ echo paddr > /sys/devices/system/memory/soft_offline_page to offline a *free* page, the value of mce_bad_pages will be added, and the page is set HWPoison flag, but it is still managed by page buddy alocator. $ cat /proc/meminfo | grep HardwareCorrupted shows the value. If we offline the same page, the value of mce_bad_pages will be added *again*, this means the value is incorrect now. Assume the page is still free during this short time. soft_offline_page() get_any_page() "else if (is_free_buddy_page(p))" branch return 0 "goto done"; "atomic_long_add(1, &mce_bad_pages);" This patch: Move poisoned page check at the beginning of the function in order to fix the error. Signed-off-by: NXishi Qiu <qiuxishi@huawei.com> Signed-off-by: NJiang Liu <jiang.liu@huawei.com> Tested-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
Several functions test MIGRATE_ISOLATE and some of those are hotpath but MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie, CMA, memory-hotplug and memory-failure) which are not common config option. So let's not add unnecessary overhead and code when we don't enable CONFIG_MEMORY_ISOLATION. Signed-off-by: NMinchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NMichal Nazarewicz <mina86@mina86.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jiang Liu 提交于
Function put_page_bootmem() is used to free pages allocated by bootmem allocator, so it should increase totalram_pages when freeing pages into the buddy system. Signed-off-by: NJiang Liu <jiang.liu@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Maciej Rutecki <maciej.rutecki@gmail.com> Cc: Chris Clayton <chris2553@googlemail.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jiang Liu 提交于
Now all users of "number of pages managed by the buddy system" have been converted to use zone->managed_pages, so set zone->present_pages to what it should be: present_pages = spanned_pages - absent_pages; Signed-off-by: NJiang Liu <jiang.liu@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Maciej Rutecki <maciej.rutecki@gmail.com> Cc: Chris Clayton <chris2553@googlemail.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jiang Liu 提交于
Now we have zone->managed_pages for "pages managed by the buddy system in the zone", so replace zone->present_pages with zone->managed_pages if what the user really wants is number of allocatable pages. Signed-off-by: NJiang Liu <jiang.liu@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Maciej Rutecki <maciej.rutecki@gmail.com> Cc: Chris Clayton <chris2553@googlemail.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tang Chen 提交于
mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in memblock_overlaps_region(). The definition of struct movablecore_map is protected by CONFIG_HAVE_MEMBLOCK_NODE_MAP but its use in memblock_overlaps_region() is not. So add CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect the use of movablecore_map in memblock_overlaps_region(). Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Reported-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tang Chen 提交于
We now provide an option for users who don't want to specify physical memory address in kernel commandline. /* * For movablemem_map=acpi: * * SRAT: |_____| |_____| |_________| |_________| ...... * node id: 0 1 1 2 * hotpluggable: n y y n * movablemem_map: |_____| |_________| * * Using movablemem_map, we can prevent memblock from allocating memory * on ZONE_MOVABLE at boot time. */ So user just specify movablemem_map=acpi, and the kernel will use hotpluggable info in SRAT to determine which memory ranges should be set as ZONE_MOVABLE. If all the memory ranges in SRAT is hotpluggable, then no memory can be used by kernel. But before parsing SRAT, memblock has already reserve some memory ranges for other purposes, such as for kernel image, and so on. We cannot prevent kernel from using these memory. So we need to exclude these ranges even if these memory is hotpluggable. Furthermore, there could be several memory ranges in the single node which the kernel resides in. We may skip one range that have memory reserved by memblock, but if the rest of memory is too small, then the kernel will fail to boot. So, make the whole node which the kernel resides in un-hotpluggable. Then the kernel has enough memory to use. NOTE: Using this way will cause NUMA performance down because the whole node will be set as ZONE_MOVABLE, and kernel cannot use memory on it. If users don't want to lose NUMA performance, just don't use it. [akpm@linux-foundation.org: fix warning] [akpm@linux-foundation.org: use strcmp()] Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Len Brown <lenb@kernel.org> Cc: "Brown, Len" <len.brown@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tang Chen 提交于
When implementing movablemem_map boot option, we introduced an array movablemem_map.map[] to store the memory ranges to be set as ZONE_MOVABLE. Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify the whole node memory range, we need to extend it to the node end so that we can use it to prevent memblock from allocating memory in the ranges user didn't specify. We now implement movablemem_map boot option like this: /* * For movablemem_map=nn[KMG]@ss[KMG]: * * SRAT: |_____| |_____| |_________| |_________| ...... * node id: 0 1 1 2 * user specified: |__| |___| * movablemem_map: |___| |_________| |______| ...... * * Using movablemem_map, we can prevent memblock from allocating memory * on ZONE_MOVABLE at boot time. * * NOTE: In this case, SRAT info will be ingored. */ [akpm@linux-foundation.org: clean up code, fix build warning] Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Len Brown <lenb@kernel.org> Cc: "Brown, Len" <len.brown@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tang Chen 提交于
On linux, the pages used by kernel could not be migrated. As a result, if a memory range is used by kernel, it cannot be hot-removed. So if we want to hot-remove memory, we should prevent kernel from using it. The way now used to prevent this is specify a memory range by movablemem_map boot option and set it as ZONE_MOVABLE. But when the system is booting, memblock will allocate memory, and reserve the memory for kernel. And before we parse SRAT, and know the node memory ranges, memblock is working. And it may allocate memory in ranges to be set as ZONE_MOVABLE. This memory can be used by kernel, and never be freed. So, let's parse SRAT before memblock is called first. And it is early enough. The first call of memblock_find_in_range_node() is in: setup_arch() |-->setup_real_mode() so, this patch add a function early_parse_srat() to parse SRAT, and call it before setup_real_mode() is called. NOTE: 1) early_parse_srat() is called before numa_init(), and has initialized numa_meminfo. So DO NOT clear numa_nodes_parsed in numa_init() and DO NOT zero numa_meminfo in numa_init(), otherwise we will lose memory numa info. 2) I don't know why using count of memory affinities parsed from SRAT as a return value in original acpi_numa_init(). So I add a static variable srat_mem_cnt to remember this count and use it as the return value of the new acpi_numa_init() [mhocko@suse.cz: parse SRAT before memblock is ready fix] Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Reviewed-by: NWen Congyang <wency@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Len Brown <lenb@kernel.org> Cc: "Brown, Len" <len.brown@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tang Chen 提交于
Ensure the bootmem will not allocate memory from areas that may be ZONE_MOVABLE. The map info is from movablecore_map boot option. Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Reviewed-by: NWen Congyang <wency@cn.fujitsu.com> Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com> Tested-by: NLin Feng <linfeng@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tang Chen 提交于
If kernelcore or movablecore is specified at the same time with movablemem_map, movablemem_map will have higher priority to be satisfied. This patch will make find_zone_movable_pfns_for_nodes() calculate zone_movable_pfn[] with the limit from zone_movable_limit[]. Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Reviewed-by: NWen Congyang <wency@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com> Tested-by: NLin Feng <linfeng@cn.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tang Chen 提交于
Introduce a new array zone_movable_limit[] to store the ZONE_MOVABLE limit from movablemem_map boot option for all nodes. The function sanitize_zone_movable_limit() will find out to which node the ranges in movable_map.map[] belongs, and calculates the low boundary of ZONE_MOVABLE for each node. Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Signed-off-by: NLiu Jiang <jiang.liu@huawei.com> Reviewed-by: NWen Congyang <wency@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com> Tested-by: NLin Feng <linfeng@cn.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tang Chen 提交于
Add functions to parse movablemem_map boot option. Since the option could be specified more then once, all the maps will be stored in the global variable movablemem_map.map array. And also, we keep the array in monotonic increasing order by start_pfn. And merge all overlapped ranges. [akpm@linux-foundation.org: improve comment] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: remove unneeded parens] Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com> Reviewed-by: NWen Congyang <wency@cn.fujitsu.com> Tested-by: NLin Feng <linfeng@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yasuaki Ishimatsu 提交于
During the implementation of SRAT support, we met a problem. In setup_arch(), we have the following call series: 1) memblock is ready; 2) some functions use memblock to allocate memory; 3) parse ACPI tables, such as SRAT. Before 3), we don't know which memory is hotpluggable, and as a result, we cannot prevent memblock from allocating hotpluggable memory. So, in 2), there could be some hotpluggable memory allocated by memblock. Now, we are trying to parse SRAT earlier, before memblock is ready. But I think we need more investigation on this topic. So in this v5, I dropped all the SRAT support, and v5 is just the same as v3, and it is based on 3.8-rc3. As we planned, we will support getting info from SRAT without users' participation at last. And we will post another patch-set to do so. And also, I think for now, we can add this boot option as the first step of supporting movable node. Since Linux cannot migrate the direct mapped pages, the only way for now is to limit the whole node containing only movable memory. Using SRAT is one way. But even if we can use SRAT, users still need an interface to enable/disable this functionality if they don't want to loose their NUMA performance. So I think, a user interface is always needed. For now, users can disable this functionality by not specifying the boot option. Later, we will post SRAT support, and add another option value "movablecore_map=acpi" to using SRAT. This patch: If system can create movable node which all memory of the node is allocated as ZONE_MOVABLE, setup_node_data() cannot allocate memory for the node's pg_data_t. So, use memblock_alloc_try_nid() instead of memblock_alloc_nid() to retry when the first allocation fails. Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Signed-off-by: NJiang Liu <jiang.liu@huawei.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-