1. 24 2月, 2013 40 次提交
    • G
      memcg: prevent changes to move_charge_at_immigrate during task attach · ee5e8472
      Glauber Costa 提交于
      In memcg, we use the cgroup_lock basically to synchronize against
      attaching new children to a cgroup.  We do this because we rely on
      cgroup core to provide us with this information.
      
      We need to guarantee that upon child creation, our tunables are
      consistent.  For those, the calls to cgroup_lock() all live in handlers
      like mem_cgroup_hierarchy_write(), where we change a tunable in the
      group that is hierarchy-related.  For instance, the use_hierarchy flag
      cannot be changed if the cgroup already have children.
      
      Furthermore, those values are propagated from the parent to the child
      when a new child is created.  So if we don't lock like this, we can end
      up with the following situation:
      
      A                                   B
       memcg_css_alloc()                       mem_cgroup_hierarchy_write()
       copy use hierarchy from parent          change use hierarchy in parent
       finish creation.
      
      This is mainly because during create, we are still not fully connected
      to the css tree.  So all iterators and the such that we could use, will
      fail to show that the group has children.
      
      My observation is that all of creation can proceed in parallel with
      those tasks, except value assignment.  So what this patch series does is
      to first move all value assignment that is dependent on parent values
      from css_alloc to css_online, where the iterators all work, and then we
      lock only the value assignment.  This will guarantee that parent and
      children always have consistent values.  Together with an online test,
      that can be derived from the observation that the refcount of an online
      memcg can be made to be always positive, we should be able to
      synchronize our side without the cgroup lock.
      
      This patch:
      
      Currently, we rely on the cgroup_lock() to prevent changes to
      move_charge_at_immigrate during task migration.  However, this is only
      needed because the current strategy keeps checking this value throughout
      the whole process.  Since all we need is serialization, one needs only
      to guarantee that whatever decision we made in the beginning of a
      specific migration is respected throughout the process.
      
      We can achieve this by just saving it in mc.  By doing this, no kind of
      locking is needed.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee5e8472
    • G
      memcg: reduce the size of struct memcg 244-fold. · 45cf7ebd
      Glauber Costa 提交于
      In order to maintain all the memcg bookkeeping, we need per-node
      descriptors, which will in turn contain a per-zone descriptor.
      
      Because we want to statically allocate those, this array ends up being
      very big.  Part of the reason is that we allocate something large enough
      to hold MAX_NUMNODES, the compile time constant that holds the maximum
      number of nodes we would ever consider.
      
      However, we can do better in some cases if the firmware help us.  This
      is true for modern x86 machines; coincidentally one of the architectures
      in which MAX_NUMNODES tends to be very big.
      
      By using the firmware-provided maximum number of nodes instead of
      MAX_NUMNODES, we can reduce the memory footprint of struct memcg
      considerably.  In the extreme case in which we have only one node, this
      reduces the size of the structure from ~ 64k to ~2k.  This is
      particularly important because it means that we will no longer resort to
      the vmalloc area for the struct memcg on defconfigs.  We also have
      enough room for an extra node and still be outside vmalloc.
      
      One also has to keep in mind that with the industry's ability to fit
      more processors in a die as fast as the FED prints money, a nodes = 2
      configuration is already respectably big.
      
      [akpm@linux-foundation.org: add check for invalid nid, remove inline]
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45cf7ebd
    • M
      mm: init: report on last-nid information stored in page->flags · a4e1b4c6
      Mel Gorman 提交于
      Answering the question "how much space remains in the page->flags" is
      time-consuming.  mminit_loglevel can help answer the question but it
      does not take last_nid information into account.  This patch corrects it
      and while there it corrects the messages related to page flag usage,
      pgshifts and node/zone id.  When applied the relevant output looks
      something like this but will depend on the kernel configuration.
      
        mminit::pageflags_layout_widths Section 0 Node 9 Zone 2 Lastnid 9 Flags 25
        mminit::pageflags_layout_shifts Section 19 Node 9 Zone 2 Lastnid 9
        mminit::pageflags_layout_pgshifts Section 0 Node 55 Zone 53 Lastnid 44
        mminit::pageflags_layout_nodezoneid Node/Zone ID: 64 -> 53
        mminit::pageflags_layout_usage location: 64 -> 44 layout 44 -> 25 unused 25 -> 0 page-flags
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4e1b4c6
    • M
      mm: uninline page_xchg_last_nid() · 4468b8f1
      Mel Gorman 提交于
      Andrew Morton pointed out that page_xchg_last_nid() and
      reset_page_last_nid() were "getting nuttily large" and asked that it be
      investigated.
      
      reset_page_last_nid() is on the page free path and it would be
      unfortunate to make that path more expensive than it needs to be.  Due
      to the internal use of page_xchg_last_nid() it is already too expensive
      but fortunately, it should also be impossible for the page->flags to be
      updated in parallel when we call reset_page_last_nid().  Instead of
      unlining the function, it uses a simplier implementation that assumes no
      parallel updates and should now be sufficiently short for inlining.
      
      page_xchg_last_nid() is called in paths that are already quite expensive
      (splitting huge page, fault handling, migration) and it is reasonable to
      uninline.  There was not really a good place to place the function but
      mm/mmzone.c was the closest fit IMO.
      
      This patch saved 128 bytes of text in the vmlinux file for the kernel
      configuration I used for testing automatic NUMA balancing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4468b8f1
    • M
      memcg: clean up swap accounting initialization code · 6acc8b02
      Michal Hocko 提交于
      Memcg swap accounting is currently enabled by enable_swap_cgroup when
      the root cgroup is created.  mem_cgroup_init acts as a memcg subsystem
      initializer which sounds like a much better place for enable_swap_cgroup
      as well.  We already register memsw files from there so it makes a lot
      of sense to merge those two into a single enable_swap_cgroup function.
      
      This patch doesn't introduce any semantic changes.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Zhouping Liu <zliu@redhat.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: CAI Qian <caiqian@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6acc8b02
    • M
      memcg: do not create memsw files if swap accounting is disabled · 2d11085e
      Michal Hocko 提交于
      Zhouping Liu has reported that memsw files are exported even though swap
      accounting is runtime disabled if MEMCG_SWAP is enabled.  This behavior
      has been introduced by commit af36f906 ("memcg: always create memsw
      files if CGROUP_MEM_RES_CTLR_SWAP") and it causes any attempt to open
      the file to return EOPNOTSUPP.  Although EOPNOTSUPP should say be clear
      that memsw operations are not supported in the given configuration it is
      fair to say that this behavior could be quite confusing.
      
      Let's tear memsw files out of default cgroup files and add them only if
      the swap accounting is really enabled (either by MEMCG_SWAP_ENABLED or
      swapaccount=1 boot parameter).  We can hook into mem_cgroup_init which
      is called when the memcg subsystem is initialized and which happens
      after boot command line is processed.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NZhouping Liu <zliu@redhat.com>
      Tested-by: NZhouping Liu <zliu@redhat.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: CAI Qian <caiqian@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d11085e
    • P
      page-writeback.c: subtract min_free_kbytes from dirtyable memory · 75f7ad8e
      Paul Szabo 提交于
      When calculating amount of dirtyable memory, min_free_kbytes should be
      subtracted because it is not intended for dirty pages.
      
      Addresses http://bugs.debian.org/695182
      
      [akpm@linux-foundation.org: fix up min_free_kbytes extern declarations]
      [akpm@linux-foundation.org: fix min() warning]
      Signed-off-by: NPaul Szabo <psz@maths.usyd.edu.au>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75f7ad8e
    • K
      mm/rmap: rename anon_vma_unlock() => anon_vma_unlock_write() · 08b52706
      Konstantin Khlebnikov 提交于
      The comment in commit 4fc3f1d6 ("mm/rmap, migration: Make
      rmap_walk_anon() and try_to_unmap_anon() more scalable") says:
      
      | Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
      | to make it clearer that it's an exclusive write-lock in
      | that case - suggested by Rik van Riel.
      
      But that commit renames only anon_vma_lock()
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08b52706
    • S
      swap: add per-partition lock for swapfile · ec8acf20
      Shaohua Li 提交于
      swap_lock is heavily contended when I test swap to 3 fast SSD (even
      slightly slower than swap to 2 such SSD).  The main contention comes
      from swap_info_get().  This patch tries to fix the gap with adding a new
      per-partition lock.
      
      Global data like nr_swapfiles, total_swap_pages, least_priority and
      swap_list are still protected by swap_lock.
      
      nr_swap_pages is an atomic now, it can be changed without swap_lock.  In
      theory, it's possible get_swap_page() finds no swap pages but actually
      there are free swap pages.  But sounds not a big problem.
      
      Accessing partition specific data (like scan_swap_map and so on) is only
      protected by swap_info_struct.lock.
      
      Changing swap_info_struct.flags need hold swap_lock and
      swap_info_struct.lock, because scan_scan_map() will check it.  read the
      flags is ok with either the locks hold.
      
      If both swap_lock and swap_info_struct.lock must be hold, we always hold
      the former first to avoid deadlock.
      
      swap_entry_free() can change swap_list.  To delete that code, we add a
      new highest_priority_index.  Whenever get_swap_page() is called, we
      check it.  If it's valid, we use it.
      
      It's a pity get_swap_page() still holds swap_lock().  But in practice,
      swap_lock() isn't heavily contended in my test with this patch (or I can
      say there are other much more heavier bottlenecks like TLB flush).  And
      BTW, looks get_swap_page() doesn't really need the lock.  We never free
      swap_info[] and we check SWAP_WRITEOK flag.  The only risk without the
      lock is we could swapout to some low priority swap, but we can quickly
      recover after several rounds of swap, so sounds not a big deal to me.
      But I'd prefer to fix this if it's a real problem.
      
      "swap: make each swap partition have one address_space" improved the
      swapout speed from 1.7G/s to 2G/s.  This patch further improves the
      speed to 2.3G/s, so around 15% improvement.  It's a multi-process test,
      so TLB flush isn't the biggest bottleneck before the patches.
      
      [arnd@arndb.de: fix it for nommu]
      [hughd@google.com: add missing unlock]
      [minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec8acf20
    • S
      swap: make each swap partition have one address_space · 33806f06
      Shaohua Li 提交于
      When I use several fast SSD to do swap, swapper_space.tree_lock is
      heavily contended.  This makes each swap partition have one
      address_space to reduce the lock contention.  There is an array of
      address_space for swap.  The swap entry type is the index to the array.
      
      In my test with 3 SSD, this increases the swapout throughput 20%.
      
      [akpm@linux-foundation.org: revert unneeded change to  __add_to_swap_cache]
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33806f06
    • S
      mm: don't inline page_mapping() · 9800339b
      Shaohua Li 提交于
      According to akpm, this saves 1/2k text and makes things simple for the
      next patch.
      
      Numbers from Minchan:
      
      add/remove: 1/0 grow/shrink: 6/22 up/down: 92/-516 (-424)
      function                                     old     new   delta
      page_mapping                                   -      48     +48
      do_task_stat                                2292    2308     +16
      page_remove_rmap                             240     248      +8
      load_elf_binary                             4500    4508      +8
      update_queue                                 532     536      +4
      scsi_probe_and_add_lun                      2892    2896      +4
      lookup_fast                                  644     648      +4
      vcs_read                                    1040    1036      -4
      __ip_route_output_key                       1904    1900      -4
      ip_route_input_noref                        2508    2500      -8
      shmem_file_aio_read                          784     772     -12
      __isolate_lru_page                           272     256     -16
      shmem_replace_page                           708     688     -20
      mark_buffer_dirty                            228     208     -20
      __set_page_dirty_buffers                     240     220     -20
      __remove_mapping                             276     256     -20
      update_mmu_cache                             500     476     -24
      set_page_dirty_balance                        92      68     -24
      set_page_dirty                               172     148     -24
      page_evictable                                88      64     -24
      page_cache_pipe_buf_steal                    248     224     -24
      clear_page_dirty_for_io                      340     316     -24
      test_set_page_writeback                      400     372     -28
      test_clear_page_writeback                    516     488     -28
      invalidate_inode_page                        156     128     -28
      page_mkclean                                 432     400     -32
      flush_dcache_page                            360     328     -32
      __set_page_dirty_nobuffers                   324     280     -44
      shrink_page_list                            2412    2356     -56
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9800339b
    • H
      mm: numa: cleanup flow of transhuge page migration · 340ef390
      Hugh Dickins 提交于
      When correcting commit 04fa5d6a ("mm: migrate: check page_count of
      THP before migrating") Hugh Dickins noted that the control flow for
      transhuge migration was difficult to follow.  Unconditionally calling
      put_page() in numamigrate_isolate_page() made the failure paths of both
      migrate_misplaced_transhuge_page() and migrate_misplaced_page() more
      complex that they should be.  Further, he was extremely wary that an
      unlock_page() should ever happen after a put_page() even if the
      put_page() should never be the final put_page.
      
      Hugh implemented the following cleanup to simplify the path by calling
      putback_lru_page() inside numamigrate_isolate_page() if it failed to
      isolate and always calling unlock_page() within
      migrate_misplaced_transhuge_page().
      
      There is no functional change after this patch is applied but the code
      is easier to follow and unlock_page() always happens before put_page().
      
      [mgorman@suse.de: changelog only]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      340ef390
    • P
      mm: fold page->_last_nid into page->flags where possible · 75980e97
      Peter Zijlstra 提交于
      page->_last_nid fits into page->flags on 64-bit.  The unlikely 32-bit
      NUMA configuration with NUMA Balancing will still need an extra page
      field.  As Peter notes "Completely dropping 32bit support for
      CONFIG_NUMA_BALANCING would simplify things, but it would also remove
      the warning if we grow enough 64bit only page-flags to push the last-cpu
      out."
      
      [mgorman@suse.de: minor modifications]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75980e97
    • P
      mm: move page flags layout to separate header · bbeae5b0
      Peter Zijlstra 提交于
      This is a preparation patch for moving page->_last_nid into page->flags
      that moves page flag layout information to a separate header.  This
      patch is necessary because otherwise there would be a circular
      dependency between mm_types.h and mm.h.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbeae5b0
    • M
      mm: numa: handle side-effects in count_vm_numa_events() for !CONFIG_NUMA_BALANCING · 3c0ff468
      Mel Gorman 提交于
      The current definitions for count_vm_numa_events() is wrong for
      !CONFIG_NUMA_BALANCING as the following would miss the side-effect.
      
      	count_vm_numa_events(NUMA_FOO, bar++);
      
      There are no such users of count_vm_numa_events() but this patch fixes
      it as it is a potential pitfall.  Ideally both would be converted to
      static inline but NUMA_PTE_UPDATES is not defined if
      !CONFIG_NUMA_BALANCING and creating dummy constants just to have a
      static inline would be similarly clumsy.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c0ff468
    • M
      mm: numa: take THP into account when migrating pages for NUMA balancing · 3abef4e6
      Mel Gorman 提交于
      Wanpeng Li pointed out that numamigrate_isolate_page() assumes that only
      one base page is being migrated when in fact it can also be checking
      THP.
      
      The consequences are that a migration will be attempted when a target
      node is nearly full and fail later.  It's unlikely to be user-visible
      but it should be fixed.  While we are there, migrate_balanced_pgdat()
      should treat nr_migrate_pages as an unsigned long as it is treated as a
      watermark.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Suggested-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3abef4e6
    • M
      mm: numa: fix minor typo in numa_next_scan · 34f0315a
      Mel Gorman 提交于
      s/me/be/ and clarify the comment a bit when we're changing it anyway.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Suggested-by: NSimon Jeons <simon.jeons@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34f0315a
    • K
    • M
      usb: forbid memory allocation with I/O during bus reset · 4d769def
      Ming Lei 提交于
      If one storage interface or usb network interface(iSCSI case) exists in
      current configuration, memory allocation with GFP_KERNEL during
      usb_device_reset() might trigger I/O transfer on the storage interface
      itself and cause deadlock because the 'us->dev_mutex' is held in
      .pre_reset() and the storage interface can't do I/O transfer when the
      reset is triggered by other interface, or the error handling can't be
      completed if the reset is triggered by the storage itself (error
      handling path).
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Reviewed-by: NJiri Kosina <jkosina@suse.cz>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d769def
    • M
      pm / runtime: force memory allocation with no I/O during Runtime PM callbcack · db88175f
      Ming Lei 提交于
      Apply the introduced memalloc_noio_save() and memalloc_noio_restore() to
      force memory allocation with no I/O during runtime_resume/runtime_suspend
      callback on device with the flag of 'memalloc_noio' set.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Cc: Jiri Kosina <jiri.kosina@suse.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db88175f
    • M
      net/core: apply pm_runtime_set_memalloc_noio on network devices · 9802c8e2
      Ming Lei 提交于
      Deadlock might be caused by allocating memory with GFP_KERNEL in
      runtime_resume and runtime_suspend callback of network devices in iSCSI
      situation, so mark network devices and its ancestor as 'memalloc_noio'
      with the introduced pm_runtime_set_memalloc_noio().
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Cc: Jiri Kosina <jiri.kosina@suse.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9802c8e2
    • M
      block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices · 25e823c8
      Ming Lei 提交于
      Apply the introduced pm_runtime_set_memalloc_noio on block device so
      that PM core will teach mm to not allocate memory with GFP_IOFS when
      calling the runtime_resume and runtime_suspend callback for block
      devices and its ancestors.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Cc: Jiri Kosina <jiri.kosina@suse.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25e823c8
    • M
      pm / runtime: introduce pm_runtime_set_memalloc_noio() · e823407f
      Ming Lei 提交于
      Introduce the flag memalloc_noio in 'struct dev_pm_info' to help PM core
      to teach mm not allocating memory with GFP_KERNEL flag for avoiding
      probable deadlock.
      
      As explained in the comment, any GFP_KERNEL allocation inside
      runtime_resume() or runtime_suspend() on any one of device in the path
      from one block or network device to the root device in the device tree
      may cause deadlock, the introduced pm_runtime_set_memalloc_noio() sets
      or clears the flag on device in the path recursively.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Cc: Jiri Kosina <jiri.kosina@suse.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e823407f
    • M
      mm: teach mm by current context info to not do I/O during memory allocation · 21caf2fc
      Ming Lei 提交于
      This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
      'struct task_struct'), so that the flag can be set by one task to avoid
      doing I/O inside memory allocation in the task's context.
      
      The patch trys to solve one deadlock problem caused by block device, and
      the problem may happen at least in the below situations:
      
      - during block device runtime resume, if memory allocation with
        GFP_KERNEL is called inside runtime resume callback of any one of its
        ancestors(or the block device itself), the deadlock may be triggered
        inside the memory allocation since it might not complete until the block
        device becomes active and the involed page I/O finishes.  The situation
        is pointed out first by Alan Stern.  It is not a good approach to
        convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
        subsystems may be involved(for example, PCI, USB and SCSI may be
        involved for usb mass stoarage device, network devices involved too in
        the iSCSI case)
      
      - during block device runtime suspend, because runtime resume need to
        wait for completion of concurrent runtime suspend.
      
      - during error handling of usb mass storage deivce, USB bus reset will
        be put on the device, so there shouldn't have any memory allocation with
        GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
        above may be triggered.  Unfortunately, any usb device may include one
        mass storage interface in theory, so it requires all usb interface
        drivers to handle the situation.  In fact, most usb drivers don't know
        how to handle bus reset on the device and don't provide .pre_set() and
        .post_reset() callback at all, so USB core has to unbind and bind driver
        for these devices.  So it is still not practical to resort to GFP_NOIO
        for solving the problem.
      
      Also the introduced solution can be used by block subsystem or block
      drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
      actual I/O transfer.
      
      It is not a good idea to convert all these GFP_KERNEL in the affected
      path into GFP_NOIO because these functions doing that may be implemented
      as library and will be called in many other contexts.
      
      In fact, memalloc_noio_flags() can convert some of current static
      GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts,
      at least almost all GFP_NOIO in USB subsystem can be converted into
      GFP_KERNEL after applying the approach and make allocation with GFP_NOIO
      only happen in runtime resume/bus reset/block I/O transfer contexts
      generally.
      
      [1], several GFP_KERNEL allocation examples in runtime resume path
      
      - pci subsystem
      acpi_os_allocate
      	<-acpi_ut_allocate
      		<-ACPI_ALLOCATE_ZEROED
      			<-acpi_evaluate_object
      				<-__acpi_bus_set_power
      					<-acpi_bus_set_power
      						<-acpi_pci_set_power_state
      							<-platform_pci_set_power_state
      								<-pci_platform_power_transition
      									<-__pci_complete_power_transition
      										<-pci_set_power_state
      											<-pci_restore_standard_config
      												<-pci_pm_runtime_resume
      - usb subsystem
      usb_get_status
      	<-finish_port_resume
      		<-usb_port_resume
      			<-generic_resume
      				<-usb_resume_device
      					<-usb_resume_both
      						<-usb_runtime_resume
      
      - some individual usb drivers
      usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....
      
      That is just what I have found.  Unfortunately, this allocation can only
      be found by human being now, and there should be many not found since
      any function in the resume path(call tree) may allocate memory with
      GFP_KERNEL.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Cc: Jiri Kosina <jiri.kosina@suse.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21caf2fc
    • Z
      mm: don't wait on congested zones in balance_pgdat() · 258401a6
      Zlatko Calusic 提交于
      From: Zlatko Calusic <zlatko.calusic@iskon.hr>
      
      Commit 92df3a72 ("mm: vmscan: throttle reclaim if encountering too
      many dirty pages under writeback") introduced waiting on congested zones
      based on a sane algorithm in shrink_inactive_list().
      
      What this means is that there's no more need for throttling and
      additional heuristics in balance_pgdat().  So, let's remove it and tidy
      up the code.
      Signed-off-by: NZlatko Calusic <zlatko.calusic@iskon.hr>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      258401a6
    • N
      mm/memory-failure.c: fix wrong num_poisoned_pages in handling memory error on thp · 4db0e950
      Naoya Horiguchi 提交于
      num_poisoned_pages counts up the number of pages isolated by memory
      errors.  But for thp, only one subpage is isolated because memory error
      handler splits it, so it's wrong to add (1 << compound_trans_order).
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4db0e950
    • N
      mm/memory-failure.c: clean up soft_offline_page() · af8fae7c
      Naoya Horiguchi 提交于
      Currently soft_offline_page() is hard to maintain because it has many
      return points and goto statements.  All of this mess come from
      get_any_page().
      
      This function should only get page refcount as the name implies, but it
      does some page isolating actions like SetPageHWPoison() and dequeuing
      hugepage.  This patch corrects it and introduces some internal
      subroutines to make soft offlining code more readable and maintainable.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NAndi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af8fae7c
    • X
      memory-failure: use num_poisoned_pages instead of mce_bad_pages · 293c07e3
      Xishi Qiu 提交于
      Since MCE is an x86 concept, and this code is in mm/, it would be better
      to use the name num_poisoned_pages instead of mce_bad_pages.
      
      [akpm@linux-foundation.org: fix mm/sparse.c]
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Suggested-by: NBorislav Petkov <bp@alien8.de>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      293c07e3
    • X
      memory-failure: do code refactor of soft_offline_page() · fa8dd8a9
      Xishi Qiu 提交于
      There are too many return points randomly intermingled with some "goto
      done" return points.  So adjust the function structure, one for the
      success path, the other for the failure path.  Use atomic_long_inc
      instead of atomic_long_add.
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa8dd8a9
    • X
      memory-failure: fix an error of mce_bad_pages statistics · 0ebff32c
      Xishi Qiu 提交于
      When doing
      
          $ echo paddr > /sys/devices/system/memory/soft_offline_page
      
      to offline a *free* page, the value of mce_bad_pages will be added, and
      the page is set HWPoison flag, but it is still managed by page buddy
      alocator.
      
         $ cat /proc/meminfo | grep HardwareCorrupted
      
      shows the value.
      
      If we offline the same page, the value of mce_bad_pages will be added
      *again*, this means the value is incorrect now.  Assume the page is
      still free during this short time.
      
        soft_offline_page()
          get_any_page()
            "else if (is_free_buddy_page(p))" branch return 0
              "goto done";
                 "atomic_long_add(1, &mce_bad_pages);"
      
      This patch:
      
      Move poisoned page check at the beginning of the function in order to
      fix the error.
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Tested-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ebff32c
    • M
      mm: remove MIGRATE_ISOLATE check in hotpath · 194159fb
      Minchan Kim 提交于
      Several functions test MIGRATE_ISOLATE and some of those are hotpath but
      MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
      CMA, memory-hotplug and memory-failure) which are not common config
      option.  So let's not add unnecessary overhead and code when we don't
      enable CONFIG_MEMORY_ISOLATION.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      194159fb
    • J
      mm: increase totalram_pages when free pages allocated by bootmem allocator · c60514b6
      Jiang Liu 提交于
      Function put_page_bootmem() is used to free pages allocated by bootmem
      allocator, so it should increase totalram_pages when freeing pages into
      the buddy system.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
      Cc: Chris Clayton <chris2553@googlemail.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c60514b6
    • J
      mm: set zone->present_pages to number of existing pages in the zone · 306f2e9e
      Jiang Liu 提交于
      Now all users of "number of pages managed by the buddy system" have been
      converted to use zone->managed_pages, so set zone->present_pages to what
      it should be:
      
      	present_pages = spanned_pages - absent_pages;
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
      Cc: Chris Clayton <chris2553@googlemail.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      306f2e9e
    • J
      mm: use zone->present_pages instead of zone->managed_pages where appropriate · b40da049
      Jiang Liu 提交于
      Now we have zone->managed_pages for "pages managed by the buddy system
      in the zone", so replace zone->present_pages with zone->managed_pages if
      what the user really wants is number of allocatable pages.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
      Cc: Chris Clayton <chris2553@googlemail.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b40da049
    • T
      mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in... · f7210e6c
      Tang Chen 提交于
      mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in memblock_overlaps_region().
      
      The definition of struct movablecore_map is protected by
      CONFIG_HAVE_MEMBLOCK_NODE_MAP but its use in memblock_overlaps_region()
      is not.  So add CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect the use of
      movablecore_map in memblock_overlaps_region().
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7210e6c
    • T
      acpi, memory-hotplug: support getting hotplug info from SRAT · 01a178a9
      Tang Chen 提交于
      We now provide an option for users who don't want to specify physical
      memory address in kernel commandline.
      
               /*
                * For movablemem_map=acpi:
                *
                * SRAT:                |_____| |_____| |_________| |_________| ......
                * node id:                0       1         1           2
                * hotpluggable:           n       y         y           n
                * movablemem_map:              |_____| |_________|
                *
                * Using movablemem_map, we can prevent memblock from allocating memory
                * on ZONE_MOVABLE at boot time.
                */
      
      So user just specify movablemem_map=acpi, and the kernel will use
      hotpluggable info in SRAT to determine which memory ranges should be set
      as ZONE_MOVABLE.
      
      If all the memory ranges in SRAT is hotpluggable, then no memory can be
      used by kernel.  But before parsing SRAT, memblock has already reserve
      some memory ranges for other purposes, such as for kernel image, and so
      on.  We cannot prevent kernel from using these memory.  So we need to
      exclude these ranges even if these memory is hotpluggable.
      
      Furthermore, there could be several memory ranges in the single node
      which the kernel resides in.  We may skip one range that have memory
      reserved by memblock, but if the rest of memory is too small, then the
      kernel will fail to boot.  So, make the whole node which the kernel
      resides in un-hotpluggable.  Then the kernel has enough memory to use.
      
      NOTE: Using this way will cause NUMA performance down because the
            whole node will be set as ZONE_MOVABLE, and kernel cannot use memory
            on it.  If users don't want to lose NUMA performance, just don't use
            it.
      
      [akpm@linux-foundation.org: fix warning]
      [akpm@linux-foundation.org: use strcmp()]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Brown, Len" <len.brown@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01a178a9
    • T
      acpi, memory-hotplug: extend movablemem_map ranges to the end of node · 27168d38
      Tang Chen 提交于
      When implementing movablemem_map boot option, we introduced an array
      movablemem_map.map[] to store the memory ranges to be set as
      ZONE_MOVABLE.
      
      Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify
      the whole node memory range, we need to extend it to the node end so
      that we can use it to prevent memblock from allocating memory in the
      ranges user didn't specify.
      
      We now implement movablemem_map boot option like this:
      
              /*
               * For movablemem_map=nn[KMG]@ss[KMG]:
               *
               * SRAT:                |_____| |_____| |_________| |_________| ......
               * node id:                0       1         1           2
               * user specified:                |__|                 |___|
               * movablemem_map:                |___| |_________|    |______| ......
               *
               * Using movablemem_map, we can prevent memblock from allocating memory
               * on ZONE_MOVABLE at boot time.
               *
               * NOTE: In this case, SRAT info will be ingored.
               */
      
      [akpm@linux-foundation.org: clean up code, fix build warning]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Brown, Len" <len.brown@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27168d38
    • T
      acpi, memory-hotplug: parse SRAT before memblock is ready · e8d19552
      Tang Chen 提交于
      On linux, the pages used by kernel could not be migrated.  As a result,
      if a memory range is used by kernel, it cannot be hot-removed.  So if we
      want to hot-remove memory, we should prevent kernel from using it.
      
      The way now used to prevent this is specify a memory range by
      movablemem_map boot option and set it as ZONE_MOVABLE.
      
      But when the system is booting, memblock will allocate memory, and
      reserve the memory for kernel.  And before we parse SRAT, and know the
      node memory ranges, memblock is working.  And it may allocate memory in
      ranges to be set as ZONE_MOVABLE.  This memory can be used by kernel,
      and never be freed.
      
      So, let's parse SRAT before memblock is called first.  And it is early
      enough.
      
      The first call of memblock_find_in_range_node() is in:
      
        setup_arch()
          |-->setup_real_mode()
      
      so, this patch add a function early_parse_srat() to parse SRAT, and call
      it before setup_real_mode() is called.
      
      NOTE:
      
      1) early_parse_srat() is called before numa_init(), and has initialized
         numa_meminfo.  So DO NOT clear numa_nodes_parsed in numa_init() and DO
         NOT zero numa_meminfo in numa_init(), otherwise we will lose memory
         numa info.
      
      2) I don't know why using count of memory affinities parsed from SRAT
         as a return value in original acpi_numa_init().  So I add a static
         variable srat_mem_cnt to remember this count and use it as the return
         value of the new acpi_numa_init()
      
      [mhocko@suse.cz: parse SRAT before memblock is ready fix]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Brown, Len" <len.brown@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8d19552
    • T
      page_alloc: bootmem limit with movablecore_map · fb06bc8e
      Tang Chen 提交于
      Ensure the bootmem will not allocate memory from areas that may be
      ZONE_MOVABLE.  The map info is from movablecore_map boot option.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NWen Congyang <wency@cn.fujitsu.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Tested-by: NLin Feng <linfeng@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb06bc8e
    • T
      page_alloc: make movablemem_map have higher priority · 42f47e27
      Tang Chen 提交于
      If kernelcore or movablecore is specified at the same time with
      movablemem_map, movablemem_map will have higher priority to be
      satisfied.  This patch will make find_zone_movable_pfns_for_nodes()
      calculate zone_movable_pfn[] with the limit from zone_movable_limit[].
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Tested-by: NLin Feng <linfeng@cn.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42f47e27