1. 08 8月, 2020 9 次提交
    • W
      mm/page_alloc.c: extract the common part in pfn_to_bitidx() · 399b795b
      Wei Yang 提交于
      The return value calculation is the same both for SPARSEMEM or not.
      
      Just take it out.
      Signed-off-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Link: http://lkml.kernel.org/r/20200623124201.8199-2-richard.weiyang@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      399b795b
    • D
      mm/page_alloc: remove nr_free_pagecache_pages() · 56b9413b
      David Hildenbrand 提交于
      nr_free_pagecache_pages() isn't used outside page_alloc.c anymore - and
      the name does not really help to understand what's going on.  Let's
      open-code it instead and add a comment.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Link: http://lkml.kernel.org/r/20200619132410.23859-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56b9413b
    • D
      mm: remove vm_total_pages · 0a18e607
      David Hildenbrand 提交于
      The global variable "vm_total_pages" is a relic from older days.  There is
      only a single user that reads the variable - build_all_zonelists() - and
      the first thing it does is update it.
      
      Use a local variable in build_all_zonelists() instead and remove the
      global variable.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/20200619132410.23859-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a18e607
    • C
      mm, page_alloc: skip ->waternark_boost for atomic order-0 allocations · f80b08fc
      Charan Teja Reddy 提交于
      When boosting is enabled, it is observed that rate of atomic order-0
      allocation failures are high due to the fact that free levels in the
      system are checked with ->watermark_boost offset.  This is not a problem
      for sleepable allocations but for atomic allocations which looks like
      regression.
      
      This problem is seen frequently on system setup of Android kernel running
      on Snapdragon hardware with 4GB RAM size.  When no extfrag event occurred
      in the system, ->watermark_boost factor is zero, thus the watermark
      configurations in the system are:
      
         _watermark = (
                [WMARK_MIN] = 1272, --> ~5MB
                [WMARK_LOW] = 9067, --> ~36MB
                [WMARK_HIGH] = 9385), --> ~38MB
         watermark_boost = 0
      
      After launching some memory hungry applications in Android which can cause
      extfrag events in the system to an extent that ->watermark_boost can be
      set to max i.e.  default boost factor makes it to 150% of high watermark.
      
         _watermark = (
                [WMARK_MIN] = 1272, --> ~5MB
                [WMARK_LOW] = 9067, --> ~36MB
                [WMARK_HIGH] = 9385), --> ~38MB
         watermark_boost = 14077, -->~57MB
      
      With default system configuration, for an atomic order-0 allocation to
      succeed, having free memory of ~2MB will suffice.  But boosting makes the
      min_wmark to ~61MB thus for an atomic order-0 allocation to be successful
      system should have minimum of ~23MB of free memory(from calculations of
      zone_watermark_ok(), min = 3/4(min/2)).  But failures are observed despite
      system is having ~20MB of free memory.  In the testing, this is
      reproducible as early as first 300secs since boot and with furtherlowram
      configurations(<2GB) it is observed as early as first 150secs since boot.
      
      These failures can be avoided by excluding the ->watermark_boost in
      watermark caluculations for atomic order-0 allocations.
      
      [akpm@linux-foundation.org: fix comment grammar, reflow comment]
      [charante@codeaurora.org: fix suggested by Mel Gorman]
        Link: http://lkml.kernel.org/r/31556793-57b1-1c21-1a9d-22674d9bd938@codeaurora.orgSigned-off-by: NCharan Teja Reddy <charante@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Link: http://lkml.kernel.org/r/1589882284-21010-1-git-send-email-charante@codeaurora.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f80b08fc
    • J
      page_alloc: consider highatomic reserve in watermark fast · f27ce0e1
      Jaewon Kim 提交于
      zone_watermark_fast was introduced by commit 48ee5f36 ("mm,
      page_alloc: shortcut watermark checks for order-0 pages").  The commit
      simply checks if free pages is bigger than watermark without additional
      calculation such like reducing watermark.
      
      It considered free cma pages but it did not consider highatomic reserved.
      This may incur exhaustion of free pages except high order atomic free
      pages.
      
      Assume that reserved_highatomic pageblock is bigger than watermark min,
      and there are only few free pages except high order atomic free.  Because
      zone_watermark_fast passes the allocation without considering high order
      atomic free, normal reclaimable allocation like GFP_HIGHUSER will consume
      all the free pages.  Then finally order-0 atomic allocation may fail on
      allocation.
      
      This means watermark min is not protected against non-atomic allocation.
      The order-0 atomic allocation with ALLOC_HARDER unwantedly can be failed.
      Additionally the __GFP_MEMALLOC allocation with ALLOC_NO_WATERMARKS also
      can be failed.
      
      To avoid the problem, zone_watermark_fast should consider highatomic
      reserve.  If the actual size of high atomic free is counted accurately
      like cma free, we may use it.  On this patch just use
      nr_reserved_highatomic.  Additionally introduce
      __zone_watermark_unusable_free to factor out common parts between
      zone_watermark_fast and __zone_watermark_ok.
      
      This is an example of ALLOC_HARDER allocation failure using v4.19 based
      kernel.
      
       Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
       Call trace:
       [<ffffff8008f40f8c>] dump_stack+0xb8/0xf0
       [<ffffff8008223320>] warn_alloc+0xd8/0x12c
       [<ffffff80082245e4>] __alloc_pages_nodemask+0x120c/0x1250
       [<ffffff800827f6e8>] new_slab+0x128/0x604
       [<ffffff800827b0cc>] ___slab_alloc+0x508/0x670
       [<ffffff800827ba00>] __kmalloc+0x2f8/0x310
       [<ffffff80084ac3e0>] context_struct_to_string+0x104/0x1cc
       [<ffffff80084ad8fc>] security_sid_to_context_core+0x74/0x144
       [<ffffff80084ad880>] security_sid_to_context+0x10/0x18
       [<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28
       [<ffffff800849109c>] security_secid_to_secctx+0x3c/0x70
       [<ffffff8008bfe118>] binder_transaction+0xe68/0x454c
       Mem-Info:
       active_anon:102061 inactive_anon:81551 isolated_anon:0
        active_file:59102 inactive_file:68924 isolated_file:64
        unevictable:611 dirty:63 writeback:0 unstable:0
        slab_reclaimable:13324 slab_unreclaimable:44354
        mapped:83015 shmem:4858 pagetables:26316 bounce:0
        free:2727 free_pcp:1035 free_cma:178
       Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
       Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB
       lowmem_reserve[]: 0 0
       Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10236kB
       138826 total pagecache pages
       5460 pages in swap cache
       Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142
      
      This is an example of ALLOC_NO_WATERMARKS allocation failure using v4.14
      based kernel.
      
       kswapd0: page allocation failure: order:0, mode:0x140000a(GFP_NOIO|__GFP_HIGHMEM|__GFP_MOVABLE), nodemask=(null)
       kswapd0 cpuset=/ mems_allowed=0
       CPU: 4 PID: 1221 Comm: kswapd0 Not tainted 4.14.113-18770262-userdebug #1
       Call trace:
       [<0000000000000000>] dump_backtrace+0x0/0x248
       [<0000000000000000>] show_stack+0x18/0x20
       [<0000000000000000>] __dump_stack+0x20/0x28
       [<0000000000000000>] dump_stack+0x68/0x90
       [<0000000000000000>] warn_alloc+0x104/0x198
       [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0
       [<0000000000000000>] zs_malloc+0x148/0x3d0
       [<0000000000000000>] zram_bvec_rw+0x410/0x798
       [<0000000000000000>] zram_rw_page+0x88/0xdc
       [<0000000000000000>] bdev_write_page+0x70/0xbc
       [<0000000000000000>] __swap_writepage+0x58/0x37c
       [<0000000000000000>] swap_writepage+0x40/0x4c
       [<0000000000000000>] shrink_page_list+0xc30/0xf48
       [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c
       [<0000000000000000>] shrink_node_memcg+0x23c/0x618
       [<0000000000000000>] shrink_node+0x1c8/0x304
       [<0000000000000000>] kswapd+0x680/0x7c4
       [<0000000000000000>] kthread+0x110/0x120
       [<0000000000000000>] ret_from_fork+0x10/0x18
       Mem-Info:
       active_anon:111826 inactive_anon:65557 isolated_anon:0\x0a active_file:44260 inactive_file:83422 isolated_file:0\x0a unevictable:4158 dirty:117 writeback:0 unstable:0\x0a            slab_reclaimable:13943 slab_unreclaimable:43315\x0a mapped:102511 shmem:3299 pagetables:19566 bounce:0\x0a free:3510 free_pcp:553 free_cma:0
       Node 0 active_anon:447304kB inactive_anon:262228kB active_file:177040kB inactive_file:333688kB unevictable:16632kB isolated(anon):0kB isolated(file):0kB mapped:410044kB d irty:468kB writeback:0kB shmem:13196kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
       Normal free:14040kB min:7440kB low:94500kB high:98136kB reserved_highatomic:32768KB active_anon:447336kB inactive_anon:261668kB active_file:177572kB inactive_file:333768k           B unevictable:16632kB writepending:480kB present:4081664kB managed:3637088kB mlocked:16632kB kernel_stack:47072kB pagetables:78264kB bounce:0kB free_pcp:2280kB local_pcp:720kB free_cma:0kB        [ 4738.329607] lowmem_reserve[]: 0 0
       Normal: 860*4kB (H) 453*8kB (H) 180*16kB (H) 26*32kB (H) 34*64kB (H) 6*128kB (H) 2*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14232kB
      
      This is trace log which shows GFP_HIGHUSER consumes free pages right
      before ALLOC_NO_WATERMARKS.
      
        <...>-22275 [006] ....   889.213383: mm_page_alloc: page=00000000d2be5665 pfn=970744 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
        <...>-22275 [006] ....   889.213385: mm_page_alloc: page=000000004b2335c2 pfn=970745 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
        <...>-22275 [006] ....   889.213387: mm_page_alloc: page=00000000017272e1 pfn=970278 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
        <...>-22275 [006] ....   889.213389: mm_page_alloc: page=00000000c4be79fb pfn=970279 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
        <...>-22275 [006] ....   889.213391: mm_page_alloc: page=00000000f8a51d4f pfn=970260 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
        <...>-22275 [006] ....   889.213393: mm_page_alloc: page=000000006ba8f5ac pfn=970261 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
        <...>-22275 [006] ....   889.213395: mm_page_alloc: page=00000000819f1cd3 pfn=970196 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
        <...>-22275 [006] ....   889.213396: mm_page_alloc: page=00000000f6b72a64 pfn=970197 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
      kswapd0-1207  [005] ...1   889.213398: mm_page_alloc: page= (null) pfn=0 order=0 migratetype=1 nr_free=3650 gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE
      
      [jaewon31.kim@samsung.com: remove redundant code for high-order]
        Link: http://lkml.kernel.org/r/20200623035242.27232-1-jaewon31.kim@samsung.comReported-by: NYong-Taek Lee <ytk.lee@samsung.com>
      Suggested-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NJaewon Kim <jaewon31.kim@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Yong-Taek Lee <ytk.lee@samsung.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Link: http://lkml.kernel.org/r/20200619235958.11283-1-jaewon31.kim@samsung.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f27ce0e1
    • V
      mm, page_alloc: use unlikely() in task_capc() · deba0487
      Vlastimil Babka 提交于
      Hugh noted that task_capc() could use unlikely(), as most of the time
      there is no capture in progress and we are in page freeing hot path.
      Indeed adding unlikely() produces assembly that better matches the
      assumption and moves all the tests away from the hot path.
      
      I have also noticed that we don't need to test for cc->direct_compaction
      as the only place we set current->task_capture is compact_zone_order()
      which also always sets cc->direct_compaction true.
      Suggested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NHugh Dickins <hughd@googlecom>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Li Wang <liwang@redhat.com>
      Link: http://lkml.kernel.org/r/4a24f7af-3aa5-6e80-4ae6-8f253b562039@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      deba0487
    • M
      mm/sparse: cleanup the code surrounding memory_present() · c89ab04f
      Mike Rapoport 提交于
      After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent
      functions that call memory_present() for each region in memblock.memory:
      sparse_memory_present_with_active_regions() and membocks_present().
      
      Moreover, all architectures have a call to either of these functions
      preceding the call to sparse_init() and in the most cases they are called
      one after the other.
      
      Mark the regions from memblock.memory as present during sparce_init() by
      making sparse_init() call memblocks_present(), make memblocks_present()
      and memory_present() functions static and remove redundant
      sparse_memory_present_with_active_regions() function.
      
      Also remove no longer required HAVE_MEMORY_PRESENT configuration option.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c89ab04f
    • S
      mm: memcontrol: account kernel stack per node · 991e7673
      Shakeel Butt 提交于
      Currently the kernel stack is being accounted per-zone.  There is no need
      to do that.  In addition due to being per-zone, memcg has to keep a
      separate MEMCG_KERNEL_STACK_KB.  Make the stat per-node and deprecate
      MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
      node_stat_item.  In addition localize the kernel stack stats updates to
      account_kernel_stack().
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      991e7673
    • R
      mm: memcg: convert vmstat slab counters to bytes · d42f3245
      Roman Gushchin 提交于
      In order to prepare for per-object slab memory accounting, convert
      NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.
      
      To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
      NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).
      
      Internally global and per-node counters are stored in pages, however memcg
      and lruvec counters are stored in bytes.  This scheme may look weird, but
      only for now.  As soon as slab pages will be shared between multiple
      cgroups, global and node counters will reflect the total number of slab
      pages.  However memcg and lruvec counters will be used for per-memcg slab
      memory tracking, which will take separate kernel objects in the account.
      Keeping global and node counters in pages helps to avoid additional
      overhead.
      
      The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
      will fit into atomic_long_t we use for vmstats.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d42f3245
  2. 17 7月, 2020 1 次提交
  3. 04 7月, 2020 1 次提交
  4. 09 6月, 2020 1 次提交
  5. 05 6月, 2020 2 次提交
    • D
      mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE · aa218795
      David Hildenbrand 提交于
      virtio-mem wants to allow to offline memory blocks of which some parts
      were unplugged (allocated via alloc_contig_range()), especially, to later
      offline and remove completely unplugged memory blocks. The important part
      is that PageOffline() has to remain set until the section is offline, so
      these pages will never get accessed (e.g., when dumping). The pages should
      not be handed back to the buddy (which would require clearing PageOffline()
      and result in issues if offlining fails and the pages are suddenly in the
      buddy).
      
      Let's allow to do that by allowing to isolate any PageOffline() page
      when offlining. This way, we can reach the memory hotplug notifier
      MEM_GOING_OFFLINE, where the driver can signal that he is fine with
      offlining this page by dropping its reference count. PageOffline() pages
      with a reference count of 0 can then be skipped when offlining the
      pages (like if they were free, however they are not in the buddy).
      
      Anybody who uses PageOffline() pages and does not agree to offline them
      (e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
      decrement the reference count and make offlining fail when trying to
      migrate such an unmovable page. So there should be no observable change.
      Same applies to balloon compaction users (movable PageOffline() pages), the
      pages will simply be migrated.
      
      Note 1: If offlining fails, a driver has to increment the reference
      	count again in MEM_CANCEL_OFFLINE.
      
      Note 2: A driver that makes use of this has to be aware that re-onlining
      	the memory block has to be handled by hooking into onlining code
      	(online_page_callback_t), resetting the page PageOffline() and
      	not giving them to the buddy.
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-7-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      aa218795
    • D
      virtio-mem: Paravirtualized memory hotunplug part 2 · 255f5985
      David Hildenbrand 提交于
      We also want to unplug online memory (contained in online memory blocks
      and, therefore, managed by the buddy), and eventually replug it later.
      
      When requested to unplug memory, we use alloc_contig_range() to allocate
      subblocks in online memory blocks (so we are the owner) and send them to
      our hypervisor. When requested to plug memory, we can replug such memory
      using free_contig_range() after asking our hypervisor.
      
      We also want to mark all allocated pages PG_offline, so nobody will
      touch them. To differentiate pages that were never onlined when
      onlining the memory block from pages allocated via alloc_contig_range(), we
      use PageDirty(). Based on this flag, virtio_mem_fake_online() can either
      online the pages for the first time or use free_contig_range().
      
      It is worth noting that there are no guarantees on how much memory can
      actually get unplugged again. All device memory might completely be
      fragmented with unmovable data, such that no subblock can get unplugged.
      
      We are not touching the ZONE_MOVABLE. If memory is onlined to the
      ZONE_MOVABLE, it can only get unplugged after that memory was offlined
      manually by user space. In normal operation, virtio-mem memory is
      suggested to be onlined to ZONE_NORMAL. In the future, we will try to
      make unplug more likely to succeed.
      
      Add a module parameter to control if online memory shall be touched.
      
      As we want to access alloc_contig_range()/free_contig_range() from
      kernel module context, export the symbols.
      
      Note: Whenever virtio-mem uses alloc_contig_range(), all affected pages
      are on the same node, in the same zone, and contain no holes.
      
      Acked-by: Michal Hocko <mhocko@suse.com> # to export contig range allocator API
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-6-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      255f5985
  6. 04 6月, 2020 26 次提交