1. 28 5月, 2016 1 次提交
  2. 27 5月, 2016 1 次提交
  3. 21 5月, 2016 2 次提交
  4. 20 5月, 2016 2 次提交
    • V
      memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE · 8604d9e5
      Vitaly Kuznetsov 提交于
      This patchset continues the work I started with commit 31bc3858
      ("memory-hotplug: add automatic onlining policy for the newly added
      memory").
      
      Initially I was going to stop there and bring the policy setting logic
      to userspace.  I met two issues on this way:
      
       1) It is possible to have memory hotplugged at boot (e.g.  with QEMU).
          These blocks stay offlined if we turn the onlining policy on by
          userspace.
      
       2) My attempt to bring this policy setting to systemd failed, systemd
          maintainers suggest to change the default in kernel or ...  to use
          tmpfiles.d to alter the policy (which looks like a hack to me):
              https://github.com/systemd/systemd/pull/2938
      
      Here I suggest to add a config option to set the default value for the
      policy and a kernel command line parameter to make the override.
      
      This patch (of 2):
      
      Introduce config option to set the default value for memory hotplug
      onlining policy (/sys/devices/system/memory/auto_online_blocks).  The
      reason one would want to turn this option on are to have early onlining
      for hotpluggable memory available at boot and to not require any
      userspace actions to make memory hotplug work.
      
      [akpm@linux-foundation.org: tweak Kconfig text]
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Lennart Poettering <lennart@poettering.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8604d9e5
    • Y
      mm: slab: remove ZONE_DMA_FLAG · a3187e43
      Yang Shi 提交于
      Now we have IS_ENABLED helper to check if a Kconfig option is enabled or
      not, so ZONE_DMA_FLAG sounds no longer useful.
      
      And, the use of ZONE_DMA_FLAG in slab looks pointless according to the
      comment [1] from Johannes Weiner, so remove them and ORing passed in
      flags with the cache gfp flags has been done in kmem_getpages().
      
      [1] https://lkml.org/lkml/2014/9/25/553
      
      Link: http://lkml.kernel.org/r/1462381297-11009-1-git-send-email-yang.shi@linaro.orgSigned-off-by: NYang Shi <yang.shi@linaro.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3187e43
  5. 18 3月, 2016 3 次提交
  6. 19 2月, 2016 1 次提交
    • D
      mm/core, x86/mm/pkeys: Add arch_validate_pkey() · 66d37570
      Dave Hansen 提交于
      The syscall-level code is passed a protection key and need to
      return an appropriate error code if the protection key is bogus.
      We will be using this in subsequent patches.
      
      Note that this also begins a series of arch-specific calls that
      we need to expose in otherwise arch-independent code.  We create
      a linux/pkeys.h header where we will put *all* the stubs for
      these functions.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210232.774EEAAB@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      66d37570
  7. 18 2月, 2016 1 次提交
    • D
      mm/core, x86/mm/pkeys: Store protection bits in high VMA flags · 63c17fb8
      Dave Hansen 提交于
      vma->vm_flags is an 'unsigned long', so has space for 32 flags
      on 32-bit architectures.  The high 32 bits are unused on 64-bit
      platforms.  We've steered away from using the unused high VMA
      bits for things because we would have difficulty supporting it
      on 32-bit.
      
      Protection Keys are not available in 32-bit mode, so there is
      no concern about supporting this feature in 32-bit mode or on
      32-bit CPUs.
      
      This patch carves out 4 bits from the high half of
      vma->vm_flags and allows architectures to set config option
      to make them available.
      
      Sparse complains about these constants unless we explicitly
      call them "UL".
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Valentin Rothberg <valentinrothberg@gmail.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210208.81AF00D5@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      63c17fb8
  8. 06 2月, 2016 1 次提交
  9. 16 1月, 2016 2 次提交
  10. 07 11月, 2015 1 次提交
    • K
      mm: make compound_head() robust · 1d798ca3
      Kirill A. Shutemov 提交于
      Hugh has pointed that compound_head() call can be unsafe in some
      context. There's one example:
      
      	CPU0					CPU1
      
      isolate_migratepages_block()
        page_count()
          compound_head()
            !!PageTail() == true
      					put_page()
      					  tail->first_page = NULL
            head = tail->first_page
      					alloc_pages(__GFP_COMP)
      					   prep_compound_page()
      					     tail->first_page = head
      					     __SetPageTail(p);
            !!PageTail() == true
          <head == NULL dereferencing>
      
      The race is pure theoretical. I don't it's possible to trigger it in
      practice. But who knows.
      
      We can fix the race by changing how encode PageTail() and compound_head()
      within struct page to be able to update them in one shot.
      
      The patch introduces page->compound_head into third double word block in
      front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
      the rest bits are pointer to head page if bit zero is set.
      
      The patch moves page->pmd_huge_pte out of word, just in case if an
      architecture defines pgtable_t into something what can have the bit 0
      set.
      
      hugetlb_cgroup uses page->lru.next in the second tail page to store
      pointer struct hugetlb_cgroup. The patch switch it to use page->private
      in the second tail page instead. The space is free since ->first_page is
      removed from the union.
      
      The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
      limitation, since there's now space in first tail page to store struct
      hugetlb_cgroup pointer. But that's out of scope of the patch.
      
      That means page->compound_head shares storage space with:
      
       - page->lru.next;
       - page->next;
       - page->rcu_head.next;
      
      That's too long list to be absolutely sure, but looks like nobody uses
      bit 0 of the word.
      
      page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
      call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
      call_rcu_lazy() is not allowed as it makes use of the bit and we can
      get false positive PageTail().
      
      [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d798ca3
  11. 11 9月, 2015 1 次提交
    • V
      mm: introduce idle page tracking · 33c3fc71
      Vladimir Davydov 提交于
      Knowing the portion of memory that is not used by a certain application or
      memory cgroup (idle memory) can be useful for partitioning the system
      efficiently, e.g.  by setting memory cgroup limits appropriately.
      Currently, the only means to estimate the amount of idle memory provided
      by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
      access bit for all pages mapped to a particular process by writing 1 to
      clear_refs, wait for some time, and then count smaps:Referenced.  However,
      this method has two serious shortcomings:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      To overcome these drawbacks, this patch introduces two new page flags,
      Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
      A page's Idle flag can only be set from userspace by setting bit in
      /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
      and it is cleared whenever the page is accessed either through page tables
      (it is cleared in page_referenced() in this case) or using the read(2)
      system call (mark_page_accessed()). Thus by setting the Idle flag for
      pages of a particular workload, which can be found e.g.  by reading
      /proc/PID/pagemap, waiting for some time to let the workload access its
      working set, and then reading the bitmap file, one can estimate the amount
      of pages that are not used by the workload.
      
      The Young page flag is used to avoid interference with the memory
      reclaimer.  A page's Young flag is set whenever the Access bit of a page
      table entry pointing to the page is cleared by writing to the bitmap file.
      If page_referenced() is called on a Young page, it will add 1 to its
      return value, therefore concealing the fact that the Access bit was
      cleared.
      
      Note, since there is no room for extra page flags on 32 bit, this feature
      uses extended page flags when compiled on 32 bit.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: kpageidle requires an MMU]
      [akpm@linux-foundation.org: decouple from page-flags rework]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33c3fc71
  12. 28 8月, 2015 1 次提交
    • D
      mm: ZONE_DEVICE for "device memory" · 033fbae9
      Dan Williams 提交于
      While pmem is usable as a block device or via DAX mappings to userspace
      there are several usage scenarios that can not target pmem due to its
      lack of struct page coverage. In preparation for "hot plugging" pmem
      into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
      separately from the ones that are subject to standard page allocations.
      Importantly "device memory" can be removed at will by userspace
      unbinding the driver of the device.
      
      Having a separate zone prevents allocation and otherwise marks these
      pages that are distinct from typical uniform memory.  Device memory has
      different lifetime and performance characteristics than RAM.  However,
      since we have run out of ZONES_SHIFT bits this functionality currently
      depends on sacrificing ZONE_DMA.
      
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Jerome Glisse <j.glisse@gmail.com>
      [hch: various simplifications in the arch interface]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      033fbae9
  13. 17 8月, 2015 1 次提交
  14. 24 7月, 2015 1 次提交
  15. 01 7月, 2015 1 次提交
  16. 25 6月, 2015 1 次提交
  17. 15 4月, 2015 1 次提交
  18. 13 2月, 2015 1 次提交
    • G
      mm/zsmalloc: add statistics support · 0f050d99
      Ganesh Mahendran 提交于
      Keeping fragmentation of zsmalloc in a low level is our target.  But now
      we still need to add the debug code in zsmalloc to get the quantitative
      data.
      
      This patch adds a new configuration CONFIG_ZSMALLOC_STAT to enable the
      statistics collection for developers.  Currently only the objects
      statatitics in each class are collected.  User can get the information via
      debugfs.
      
           cat /sys/kernel/debug/zsmalloc/zram0/...
      
      For example:
      
      After I copied "jdk-8u25-linux-x64.tar.gz" to zram with ext4 filesystem:
       class  size obj_allocated   obj_used pages_used
           0    32             0          0          0
           1    48           256         12          3
           2    64            64         14          1
           3    80            51          7          1
           4    96           128          5          3
           5   112            73          5          2
           6   128            32          4          1
           7   144             0          0          0
           8   160             0          0          0
           9   176             0          0          0
          10   192             0          0          0
          11   208             0          0          0
          12   224             0          0          0
          13   240             0          0          0
          14   256            16          1          1
          15   272            15          9          1
          16   288             0          0          0
          17   304             0          0          0
          18   320             0          0          0
          19   336             0          0          0
          20   352             0          0          0
          21   368             0          0          0
          22   384             0          0          0
          23   400             0          0          0
          24   416             0          0          0
          25   432             0          0          0
          26   448             0          0          0
          27   464             0          0          0
          28   480             0          0          0
          29   496            33          1          4
          30   512             0          0          0
          31   528             0          0          0
          32   544             0          0          0
          33   560             0          0          0
          34   576             0          0          0
          35   592             0          0          0
          36   608             0          0          0
          37   624             0          0          0
          38   640             0          0          0
          40   672             0          0          0
          42   704             0          0          0
          43   720            17          1          3
          44   736             0          0          0
          46   768             0          0          0
          49   816             0          0          0
          51   848             0          0          0
          52   864            14          1          3
          54   896             0          0          0
          57   944            13          1          3
          58   960             0          0          0
          62  1024             4          1          1
          66  1088            15          2          4
          67  1104             0          0          0
          71  1168             0          0          0
          74  1216             0          0          0
          76  1248             0          0          0
          83  1360             3          1          1
          91  1488            11          1          4
          94  1536             0          0          0
         100  1632             5          1          2
         107  1744             0          0          0
         111  1808             9          1          4
         126  2048             4          4          2
         144  2336             7          3          4
         151  2448             0          0          0
         168  2720            15         15         10
         190  3072            28         27         21
         202  3264             0          0          0
         254  4096         36209      36209      36209
      
       Total               37022      36326      36288
      
      We can calculate the overall fragentation by the last line:
          Total               37022      36326      36288
          (37022 - 36326) / 37022 = 1.87%
      
      Also by analysing objects alocated in every class we know why we got so
      low fragmentation: Most of the allocated objects is in <class 254>.  And
      there is only 1 page in class 254 zspage.  So, No fragmentation will be
      introduced by allocating objs in class 254.
      
      And in future, we can collect other zsmalloc statistics as we need and
      analyse them.
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Suggested-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f050d99
  19. 07 1月, 2015 2 次提交
  20. 10 10月, 2014 2 次提交
    • K
      mm/balloon_compaction: add vmstat counters and kpageflags bit · 09316c09
      Konstantin Khlebnikov 提交于
      Always mark pages with PageBalloon even if balloon compaction is disabled
      and expose this mark in /proc/kpageflags as KPF_BALLOON.
      
      Also this patch adds three counters into /proc/vmstat: "balloon_inflate",
      "balloon_deflate" and "balloon_migrate".  They accumulate balloon
      activity.  Current size of balloon is (balloon_inflate - balloon_deflate)
      pages.
      
      All generic balloon code now gathered under option CONFIG_MEMORY_BALLOON.
      It should be selected by ballooning driver which wants use this feature.
      Currently virtio-balloon is the only user.
      Signed-off-by: NKonstantin Khlebnikov <k.khlebnikov@samsung.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09316c09
    • S
      mm: introduce a general RCU get_user_pages_fast() · 2667f50e
      Steve Capper 提交于
      This series implements general forms of get_user_pages_fast and
      __get_user_pages_fast in core code and activates them for arm and arm64.
      
      These are required for Transparent HugePages to function correctly, as a
      futex on a THP tail will otherwise result in an infinite loop (due to the
      core implementation of __get_user_pages_fast always returning 0).
      
      Unfortunately, a futex on THP tail can be quite common for certain
      workloads; thus THP is unreliable without a __get_user_pages_fast
      implementation.
      
      This series may also be beneficial for direct-IO heavy workloads and
      certain KVM workloads.
      
      This patch (of 6):
      
      get_user_pages_fast() attempts to pin user pages by walking the page
      tables directly and avoids taking locks.  Thus the walker needs to be
      protected from page table pages being freed from under it, and needs to
      block any THP splits.
      
      One way to achieve this is to have the walker disable interrupts, and rely
      on IPIs from the TLB flushing code blocking before the page table pages
      are freed.
      
      On some platforms we have hardware broadcast of TLB invalidations, thus
      the TLB flushing code doesn't necessarily need to broadcast IPIs; and
      spuriously broadcasting IPIs can hurt system performance if done too
      often.
      
      This problem has been solved on PowerPC and Sparc by batching up page
      table pages belonging to more than one mm_user, then scheduling an
      rcu_sched callback to free the pages.  This RCU page table free logic has
      been promoted to core code and is activated when one enables
      HAVE_RCU_TABLE_FREE.  Unfortunately, these architectures implement their
      own get_user_pages_fast routines.
      
      The RCU page table free logic coupled with an IPI broadcast on THP split
      (which is a rare event), allows one to protect a page table walker by
      merely disabling the interrupts during the walk.
      
      This patch provides a general RCU implementation of get_user_pages_fast
      that can be used by architectures that perform hardware broadcast of TLB
      invalidations.
      
      It is based heavily on the PowerPC implementation by Nick Piggin.
      
      [akpm@linux-foundation.org: various comment fixes]
      Signed-off-by: NSteve Capper <steve.capper@linaro.org>
      Tested-by: NDann Frazier <dann.frazier@canonical.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Christoffer Dall <christoffer.dall@linaro.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2667f50e
  21. 07 8月, 2014 3 次提交
  22. 05 6月, 2014 3 次提交
  23. 20 5月, 2014 1 次提交
  24. 15 5月, 2014 1 次提交
    • H
      parisc,metag: Do not hardcode maximum userspace stack size · 042d27ac
      Helge Deller 提交于
      This patch affects only architectures where the stack grows upwards
      (currently parisc and metag only). On those do not hardcode the maximum
      initial stack size to 1GB for 32-bit processes, but make it configurable
      via a config option.
      
      The main problem with the hardcoded stack size is, that we have two
      memory regions which grow upwards: stack and heap. To keep most of the
      memory available for heap in a flexmap memory layout, it makes no sense
      to hard allocate up to 1GB of the memory for stack which can't be used
      as heap then.
      
      This patch makes the stack size for 32-bit processes configurable and
      uses 80MB as default value which has been in use during the last few
      years on parisc and which hasn't showed any problems yet.
      Signed-off-by: NHelge Deller <deller@gmx.de>
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: linux-parisc@vger.kernel.org
      Cc: linux-metag@vger.kernel.org
      Cc: John David Anglin <dave.anglin@bell.net>
      042d27ac
  25. 08 4月, 2014 2 次提交
  26. 11 3月, 2014 1 次提交
  27. 31 1月, 2014 1 次提交
    • M
      zsmalloc: move it under mm · bcf1647d
      Minchan Kim 提交于
      This patch moves zsmalloc under mm directory.
      
      Before that, description will explain why we have needed custom
      allocator.
      
      Zsmalloc is a new slab-based memory allocator for storing compressed
      pages.  It is designed for low fragmentation and high allocation success
      rate on large object, but <= PAGE_SIZE allocations.
      
      zsmalloc differs from the kernel slab allocator in two primary ways to
      achieve these design goals.
      
      zsmalloc never requires high order page allocations to back slabs, or
      "size classes" in zsmalloc terms.  Instead it allows multiple
      single-order pages to be stitched together into a "zspage" which backs
      the slab.  This allows for higher allocation success rate under memory
      pressure.
      
      Also, zsmalloc allows objects to span page boundaries within the zspage.
      This allows for lower fragmentation than could be had with the kernel
      slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE.  With the
      kernel slab allocator, if a page compresses to 60% of it original size,
      the memory savings gained through compression is lost in fragmentation
      because another object of the same size can't be stored in the leftover
      space.
      
      This ability to span pages results in zsmalloc allocations not being
      directly addressable by the user.  The user is given an
      non-dereferencable handle in response to an allocation request.  That
      handle must be mapped, using zs_map_object(), which returns a pointer to
      the mapped region that can be used.  The mapping is necessary since the
      object data may reside in two different noncontigious pages.
      
      The zsmalloc fulfills the allocation needs for zram perfectly
      
      [sjenning@linux.vnet.ibm.com: borrow Seth's quote]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NNitin Gupta <ngupta@vflare.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcf1647d
  28. 19 12月, 2013 1 次提交