1. 14 12月, 2014 40 次提交
    • D
      syscalls: add selftest for execveat(2) · c9b26b81
      David Drysdale 提交于
      Signed-off-by: NDavid Drysdale <drysdale@google.com>
      Cc: Meredydd Luff <meredydd@senatehouse.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rich Felker <dalias@aerifal.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9b26b81
    • D
      x86: hook up execveat system call · 27d6ec7a
      David Drysdale 提交于
      Hook up x86-64, i386 and x32 ABIs.
      Signed-off-by: NDavid Drysdale <drysdale@google.com>
      Cc: Meredydd Luff <meredydd@senatehouse.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rich Felker <dalias@aerifal.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27d6ec7a
    • D
      syscalls: implement execveat() system call · 51f39a1f
      David Drysdale 提交于
      This patchset adds execveat(2) for x86, and is derived from Meredydd
      Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
      
      The primary aim of adding an execveat syscall is to allow an
      implementation of fexecve(3) that does not rely on the /proc filesystem,
      at least for executables (rather than scripts).  The current glibc version
      of fexecve(3) is implemented via /proc, which causes problems in sandboxed
      or otherwise restricted environments.
      
      Given the desire for a /proc-free fexecve() implementation, HPA suggested
      (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
      an appropriate generalization.
      
      Also, having a new syscall means that it can take a flags argument without
      back-compatibility concerns.  The current implementation just defines the
      AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
      added in future -- for example, flags for new namespaces (as suggested at
      https://lkml.org/lkml/2006/7/11/474).
      
      Related history:
       - https://lkml.org/lkml/2006/12/27/123 is an example of someone
         realizing that fexecve() is likely to fail in a chroot environment.
       - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
         documenting the /proc requirement of fexecve(3) in its manpage, to
         "prevent other people from wasting their time".
       - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
         problem where a process that did setuid() could not fexecve()
         because it no longer had access to /proc/self/fd; this has since
         been fixed.
      
      This patch (of 4):
      
      Add a new execveat(2) system call.  execveat() is to execve() as openat()
      is to open(): it takes a file descriptor that refers to a directory, and
      resolves the filename relative to that.
      
      In addition, if the filename is empty and AT_EMPTY_PATH is specified,
      execveat() executes the file to which the file descriptor refers.  This
      replicates the functionality of fexecve(), which is a system call in other
      UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
      so relies on /proc being mounted).
      
      The filename fed to the executed program as argv[0] (or the name of the
      script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
      (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
      reflecting how the executable was found.  This does however mean that
      execution of a script in a /proc-less environment won't work; also, script
      execution via an O_CLOEXEC file descriptor fails (as the file will not be
      accessible after exec).
      
      Based on patches by Meredydd Luff.
      Signed-off-by: NDavid Drysdale <drysdale@google.com>
      Cc: Meredydd Luff <meredydd@senatehouse.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rich Felker <dalias@aerifal.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51f39a1f
    • N
      fat: fix data past EOF resulting from fsx testsuite · c0ef0cc9
      Namjae Jeon 提交于
      When running FSX with direct I/O mode, fsx resulted in DATA past EOF issues.
      
        fsx ./file2 -Z -r 4096 -w 4096
        ...
        ..
        truncating to largest ever: 0x907c
        fallocating to largest ever: 0x11137
        truncating to largest ever: 0x2c6fe
        truncating to largest ever: 0x2cfdf
        fallocating to largest ever: 0x40000
        Mapped Read: non-zero data past EOF (0x18628) page offset 0x629 is 0x2a4e
        ...
        ..
      
      The reason being, it is doing a truncate down, but the zeroing does not
      happen on the last block boundary when offset is not aligned.  Even though
      it calls truncate_setsize()->truncate_inode_pages()->
      truncate_inode_pages_range() and considers the partial zeroout but it
      retrieves the page using find_lock_page() - which only looks the page in
      the cache.  So, zeroing out does not happen in case of direct IO.
      
      Make a truncate page based around block_truncate_page for FAT filesystem
      and invoke that helper to zerout in case the offset is not aligned with
      the blocksize.
      Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: NAmit Sahrawat <a.sahrawat@samsung.com>
      Acked-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0ef0cc9
    • J
      befs: remove dead code · f441ada0
      Jan Kara 提交于
      Coverity id: 1042674
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f441ada0
    • H
      mm/zbud: init user ops only when it is needed · 1dd61aa3
      Heesub Shin 提交于
      When zbud is initialized through the zpool wrapper, pool->ops which
      points to user-defined operations is always set regardless of whether it
      is specified from the upper layer. This causes zbud_reclaim_page() to
      iterate its loop for evicting pool pages out without any gain.
      
      This patch sets the user-defined ops only when it is needed, so that
      zbud_reclaim_page() can bail out the reclamation loop earlier if there
      is no user-defined operations specified.
      Signed-off-by: NHeesub Shin <heesub.shin@samsung.com>
      Acked-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Sunae Seo <sunae.seo@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1dd61aa3
    • M
      mm/zswap: delete unnecessary check before calling free_percpu() · 442cc432
      Markus Elfring 提交于
      free_percpu() tests whether its argument is NULL and then returns
      immediately.  Thus the test around the call is not needed.
      
      This issue was detected by using the Coccinelle software.
      Signed-off-by: NMarkus Elfring <elfring@users.sourceforge.net>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      442cc432
    • M
      mm/zswap: add __init to some functions in zswap · dd01d7d8
      Mahendran Ganesh 提交于
      zswap_cpu_init/zswap_comp_exit/zswap_entry_cache_create is only called by
      __init init_zswap()
      Signed-off-by: NMahendran Ganesh <opensource.ganesh@gmail.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd01d7d8
    • G
      zram: use DEVICE_ATTR_[RW|RO|WO] to define zram sys device attribute · 083914ea
      Ganesh Mahendran 提交于
      In current zram, we use DEVICE_ATTR() to define sys device attributes.
      SO, we need to set (S_IRUGO | S_IWUSR) permission and other arguments
      manually.  Linux already provids the macro DEVICE_ATTR_[RW|RO|WO] to
      define sys device attribute.  It is simple and readable.
      
      This patch uses kernel defined macro DEVICE_ATTR_[RW|RO|WO] to define
      zram device attribute.
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      083914ea
    • G
      mm/zsmalloc: allocate exactly size of struct zs_pool · 18136656
      Ganesh Mahendran 提交于
      In zs_create_pool(), we allocate memory more then sizeof(struct zs_pool)
        ovhd_size = roundup(sizeof(*pool), PAGE_SIZE);
      
      This patch allocate memory of exactly needed size.
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18136656
    • G
      mm/zsmalloc: avoid duplicate assignment of prev_class · df8b5bb9
      Ganesh Mahendran 提交于
      In zs_create_pool(), prev_class is assigned (ZS_SIZE_CLASSES - 1) times.
      And the prev_class only references to the previous size_class.  So we do
      not need unnecessary assignement.
      
      This patch assigns *prev_class* when a new size_class structure is
      allocated and uses prev_class to check whether the first class has been
      allocated.
      
      [akpm@linux-foundation.org: remove now-unused ZS_SIZE_CLASSES]
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Reviewed-by: NDan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df8b5bb9
    • M
      mm/zram: correct ZRAM_ZERO flag bit position · d49b1c25
      Mahendran Ganesh 提交于
      In struct zram_table_entry, the element *value* contains obj size and obj
      zram flags.  Bit 0 to bit (ZRAM_FLAG_SHIFT - 1) represent obj size, and
      bit ZRAM_FLAG_SHIFT to the highest bit of unsigned long represent obj
      zram_flags.  So the first zram flag(ZRAM_ZERO) should be from
      ZRAM_FLAG_SHIFT instead of (ZRAM_FLAG_SHIFT + 1).
      
      This patch fixes this cosmetic issue.
      
      Also fix a typo, "page in now accessed" -> "page is now accessed"
      Signed-off-by: NMahendran Ganesh <opensource.ganesh@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NWeijie Yang <weijie.yang@samsung.com>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d49b1c25
    • M
      mm/zsmalloc: support allocating obj with size of ZS_MAX_ALLOC_SIZE · 40f9fb8c
      Mahendran Ganesh 提交于
      I sent a patch [1] for unnecessary check in zsmalloc.  And Minchan Kim
      found zsmalloc even does not support allocating an obj with the size of
      ZS_MAX_ALLOC_SIZE in some situations.
      
      For example:
         In system with 64KB PAGE_SIZE and 32 bit of physical addr. Then:
         ZS_MIN_ALLOC_SIZE is 32 bytes which is calculated by:
            MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
         ZS_MAX_ALLOC_SIZE is 64KB(in current code, is PAGE_SIZE)
         ZS_SIZE_CLASS_DELTA is 256 bytes
         So, ZS_SIZE_CLASSES = (ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) /
                                ZS_SIZE_CLASS_DELTA + 1
                             = 256
      
         In zs_create_pool(), the max size obj which can be allocated will be:
            ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA = 32 + 255*256 = 65312
      
         We can see that 65312 < 65536 (ZS_MAX_ALLOC_SIZE). So we can NOT
         allocate objs with size ZS_MAX_ALLOC_SIZE(65536) which we promise upper
         users we can do.
      
       [1]  http://lkml.iu.edu/hypermail/linux/kernel/1411.2/03835.html
       [2]  http://lkml.iu.edu/hypermail/linux/kernel/1411.2/04534.html
      
      This patch fixes this issue by dynamiclly calculating zs_size_classes when
      module is loaded, allocates buffer with size ZS_MAX_ALLOC_SIZE.  Then the
      max obj(size is ZS_MAX_ALLOC_SIZE) can be stored in it.
      
      [akpm@linux-foundation.org: restore ZS_SIZE_CLASSES to fix bisectability]
      Signed-off-by: NMahendran Ganesh <opensource.ganesh@gmail.com>
      Suggested-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40f9fb8c
    • M
      zsmalloc: correct fragile [kmap|kunmap]_atomic use · af4ee5e9
      Minchan Kim 提交于
      The kunmap_atomic should use virtual address getting by kmap_atomic.
      However, some pieces of code in zsmalloc uses modified address, not the
      one got by kmap_atomic for kunmap_atomic.
      
      It's okay for working because zsmalloc modifies the address inner
      PAGE_SIZE bounday so it works with current kmap_atomic's implementation.
      But it's still fragile with potential changing of kmap_atomic so let's
      correct it.
      
      I got a subtle bug when I implemented a new feature of zsmalloc
      (compaction) due to a link's mishandling (the link was over page
      boundary).  Although it was totally my mistake, it took a while to find
      the cause because an unpredictable kmapped address was unmapped causing an
      almost random crash.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af4ee5e9
    • S
      zsmalloc: fix zs_init cpu notifier error handling · b1b00a5b
      Sergey Senozhatsky 提交于
      Mahendran Ganesh reported that zpool-enabled zsmalloc should not call
      zpool_unregister_driver() from zs_init() if cpu notifier registration has
      failed, because error handling is performed before we register the driver
      via zpool_register_driver() call.
      
      Factor out cpu notifier registration and unregistration code and fix
      zs_init() error handling.
      
      link: http://lkml.iu.edu//hypermail/linux/kernel/1411.1/04156.html
      [akpm@linux-foundation.org: squash bogus gcc warning]
      [akpm@linux-foundation.org: use __init and __exit]
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: NMahendran Ganesh <opensource.ganesh@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1b00a5b
    • K
      zram: implement rw_page operation of zram · 8c7f0102
      karam.lee 提交于
      This patch implements rw_page operation for zram block device.
      
      I implemented the feature in zram and tested it.  Test bed was the G2, LG
      electronic mobile device, whtich has msm8974 processor and 2GB memory.
      
      With a memory allocation test program consuming memory, the system
      generates swap.
      
      Operating time of swap_write_page() was measured.
      
      --------------------------------------------------
      |             |   operating time   | improvement |
      |             |  (20 runs average) |             |
      --------------------------------------------------
      |with patch   |    1061.15 us      |    +2.4%    |
      --------------------------------------------------
      |without patch|    1087.35 us      |             |
      --------------------------------------------------
      
      Each test(with paged_io,with BIO) result set shows normal distribution and
      has equal variance.  I mean the two values are valid result to compare.  I
      can say operation with paged I/O(without BIO) is faster 2.4% with
      confidence level 95%.
      
      [minchan@kernel.org: make rw_page opeartion return 0]
      [minchan@kernel.org: rely on the bi_end_io for zram_rw_page fails]
      [sergey.senozhatsky@gmail.com: code cleanup]
      [minchan@kernel.org: add comment]
      Signed-off-by: Nkaram.lee <karam.lee@lge.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: <seungho1.park@lge.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c7f0102
    • K
      zram: change parameter from vaild_io_request() · 54850e73
      karam.lee 提交于
      This patch changes parameter of valid_io_request for common usage.  The
      purpose of valid_io_request() is to determine if bio request is valid or
      not.
      
      This patch use I/O start address and size instead of a BIO parameter for
      common usage.
      Signed-off-by: Nkaram.lee <karam.lee@lge.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: <seungho1.park@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54850e73
    • K
      zram: remove bio parameter from zram_bvec_rw() · b627cff3
      karam.lee 提交于
      Recently rw_page block device operation has been added.  This patchset
      implements rw_page operation for zram block device and does some clean-up.
      
      This patch (of 3):
      
      Remove an unnecessary parameter(bio) from zram_bvec_rw() and
      zram_bvec_read().  zram_bvec_read() doesn't use a bio parameter, so remove
      it.  zram_bvec_rw() calls a read/write operation not using bio, so a rw
      parameter replaces a bio parameter.
      Signed-off-by: Nkaram.lee <karam.lee@lge.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: <seungho1.park@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b627cff3
    • J
      zsmalloc: merge size_class to reduce fragmentation · 9eec4cd5
      Joonsoo Kim 提交于
      zsmalloc has many size_classes to reduce fragmentation and they are in 16
      bytes unit, for example, 16, 32, 48, etc., if PAGE_SIZE is 4096.  And,
      zsmalloc has constraint that each zspage has 4 pages at maximum.
      
      In this situation, we can see interesting aspect.  Let's think about
      size_class for 1488, 1472, ..., 1376.  To prevent external fragmentation,
      they uses 4 pages per zspage and so all they can contain 11 objects at
      maximum.
      
      16384 (4096 * 4) = 1488 * 11 + remains
      16384 (4096 * 4) = 1472 * 11 + remains
      16384 (4096 * 4) = ...
      16384 (4096 * 4) = 1376 * 11 + remains
      
      It means that they have same characteristics and classification between
      them isn't needed.  If we use one size_class for them, we can reduce
      fragementation and save some memory since both the 1488 and 1472 sized
      classes can only fit 11 objects into 4 pages, and an object that's 1472
      bytes can fit into an object that's 1488 bytes, merging these classes to
      always use objects that are 1488 bytes will reduce the total number of
      size classes.  And reducing the total number of size classes reduces
      overall fragmentation, because a wider range of compressed pages can fit
      into a single size class, leaving less unused objects in each size class.
      
      For this purpose, this patch implement size_class merging.  If there is
      size_class that have same pages_per_zspage and same number of objects per
      zspage with previous size_class, we don't create new size_class.  Instead,
      we use previous, same characteristic size_class.  With this way, above
      example sizes (1488, 1472, ..., 1376) use just one size_class so we can
      get much more memory utilization.
      
      Below is result of my simple test.
      
      TEST ENV: EXT4 on zram, mount with discard option WORKLOAD: untar kernel
      source code, remove directory in descending order in size.  (drivers arch
      fs sound include net Documentation firmware kernel tools)
      
      Each line represents orig_data_size, compr_data_size, mem_used_total,
      fragmentation overhead (mem_used - compr_data_size) and overhead ratio
      (overhead to compr_data_size), respectively, after untar and remove
      operation is executed.
      
      * untar-nomerge.out
      
      orig_size compr_size used_size overhead overhead_ratio
      525.88MB 199.16MB 210.23MB  11.08MB 5.56%
      288.32MB  97.43MB 105.63MB   8.20MB 8.41%
      177.32MB  61.12MB  69.40MB   8.28MB 13.55%
      146.47MB  47.32MB  56.10MB   8.78MB 18.55%
      124.16MB  38.85MB  48.41MB   9.55MB 24.58%
      103.93MB  31.68MB  40.93MB   9.25MB 29.21%
       84.34MB  22.86MB  32.72MB   9.86MB 43.13%
       66.87MB  14.83MB  23.83MB   9.00MB 60.70%
       60.67MB  11.11MB  18.60MB   7.49MB 67.48%
       55.86MB   8.83MB  16.61MB   7.77MB 88.03%
       53.32MB   8.01MB  15.32MB   7.31MB 91.24%
      
      * untar-merge.out
      
      orig_size compr_size used_size overhead overhead_ratio
      526.23MB 199.18MB 209.81MB  10.64MB 5.34%
      288.68MB  97.45MB 104.08MB   6.63MB 6.80%
      177.68MB  61.14MB  66.93MB   5.79MB 9.47%
      146.83MB  47.34MB  52.79MB   5.45MB 11.51%
      124.52MB  38.87MB  44.30MB   5.43MB 13.96%
      104.29MB  31.70MB  36.83MB   5.13MB 16.19%
       84.70MB  22.88MB  27.92MB   5.04MB 22.04%
       67.11MB  14.83MB  19.26MB   4.43MB 29.86%
       60.82MB  11.10MB  14.90MB   3.79MB 34.17%
       55.90MB   8.82MB  12.61MB   3.79MB 42.97%
       53.32MB   8.01MB  11.73MB   3.73MB 46.53%
      
      As you can see above result, merged one has better utilization (overhead
      ratio, 5th column) and uses less memory (mem_used_total, 3rd column).
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: <juno.choi@lge.com>
      Cc: "seungho1.park" <seungho1.park@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9eec4cd5
    • R
      mm/memcontrol.c: remove unused mem_cgroup_lru_names_not_uptodate() · 70bc068c
      Rickard Strandqvist 提交于
      Remove unused mem_cgroup_lru_names_not_uptodate() and move BUILD_BUG_ON()
      to the beginning of memcg_stat_show().
      
      This was partially found by using a static code analysis program called
      cppcheck.
      Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70bc068c
    • V
      memcg: fix possible use-after-free in memcg_kmem_get_cache() · 8135be5a
      Vladimir Davydov 提交于
      Suppose task @t that belongs to a memory cgroup @memcg is going to
      allocate an object from a kmem cache @c.  The copy of @c corresponding to
      @memcg, @mc, is empty.  Then if kmem_cache_alloc races with the memory
      cgroup destruction we can access the memory cgroup's copy of the cache
      after it was destroyed:
      
      CPU0				CPU1
      ----				----
      [ current=@t
        @mc->memcg_params->nr_pages=0 ]
      
      kmem_cache_alloc(@c):
        call memcg_kmem_get_cache(@c);
        proceed to allocation from @mc:
          alloc a page for @mc:
            ...
      
      				move @t from @memcg
      				destroy @memcg:
      				  mem_cgroup_css_offline(@memcg):
      				    memcg_unregister_all_caches(@memcg):
      				      kmem_cache_destroy(@mc)
      
          add page to @mc
      
      We could fix this issue by taking a reference to a per-memcg cache, but
      that would require adding a per-cpu reference counter to per-memcg caches,
      which would look cumbersome.
      
      Instead, let's take a reference to a memory cgroup, which already has a
      per-cpu reference counter, in the beginning of kmem_cache_alloc to be
      dropped in the end, and move per memcg caches destruction from css offline
      to css free.  As a side effect, per-memcg caches will be destroyed not one
      by one, but all at once when the last page accounted to the memory cgroup
      is freed.  This doesn't sound as a high price for code readability though.
      
      Note, this patch does add some overhead to the kmem_cache_alloc hot path,
      but it is pretty negligible - it's just a function call plus a per cpu
      counter decrement, which is comparable to what we already have in
      memcg_kmem_get_cache.  Besides, it's only relevant if there are memory
      cgroups with kmem accounting enabled.  I don't think we can find a way to
      handle this race w/o it, because alloc_page called from kmem_cache_alloc
      may sleep so we can't flush all pending kmallocs w/o reference counting.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8135be5a
    • M
      mm/memcontrol.c: fix defined but not used compiler warning · ae6e71d3
      Michele Curti 提交于
      test_mem_cgroup_node_reclaimable() is used only when MAX_NUMNODES > 1, so
      move it into the compiler if statement
      
      [akpm@linux-foundation.org: clean up layout]
      Signed-off-by: NMichele Curti <michele.curti@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae6e71d3
    • M
      mm: fadvise: document the fadvise(FADV_DONTNEED) behaviour for partial pages · 441c228f
      Mel Gorman 提交于
      A random seek IO benchmark appeared to regress because of a change to
      readahead but the real problem was the benchmark.  To ensure the IO
      request accesssed disk, it used fadvise(FADV_DONTNEED) on a block boundary
      (512K) but the hint is ignored by the kernel.  This is correct but not
      necessarily obvious behaviour.  As much as I dislike comment patches, the
      explanation for this behaviour predates current git history.  Clarify why
      it behaves like this in case someone "fixes" fadvise or readahead for the
      wrong reasons.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      441c228f
    • D
      mm/vmalloc.c: fix memory ordering bug · 7e5b528b
      Dmitry Vyukov 提交于
      Read memory barriers must follow the read operations.
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e5b528b
    • O
      oom: kill the insufficient and no longer needed PT_TRACE_EXIT check · 6a2d5679
      Oleg Nesterov 提交于
      After the previous patch we can remove the PT_TRACE_EXIT check in
      oom_scan_process_thread(), it was added to handle the case when the
      coredumping was "frozen" by ptrace, but it doesn't really work.  If
      nothing else, we would need to check all threads which could share the
      same ->mm to make it more or less correct.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a2d5679
    • O
      oom: don't assume that a coredumping thread will exit soon · d003f371
      Oleg Nesterov 提交于
      oom_kill.c assumes that PF_EXITING task should exit and free the memory
      soon.  This is wrong in many ways and one important case is the coredump.
      A task can sleep in exit_mm() "forever" while the coredumping sub-thread
      can need more memory.
      
      Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account,
      we add the new trivial helper for that.
      
      Note: this is only the first step, this patch doesn't try to solve other
      problems.  The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can
      participate in coredump after it was already observed in PF_EXITING state,
      so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set.
      fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so
      out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it.
       And even the name/usage of the new helper is confusing, an exiting thread
      can only free its ->mm if it is the only/last task in thread group.
      
      [akpm@linux-foundation.org: add comment]
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d003f371
    • Z
      mm: remove the highmem zones' memmap in the highmem zone · ba914f48
      Zhong Hongbo 提交于
      Since 01cefaef ("mm: provide more accurate estimation
      of pages occupied by memmap") allocate the pages from lowmem for the
      highmem zones' memmap. So It is not need to reserver the memmap's for
      the highmem.
      
      A 2G DDR3 for the arm platform:
      On node 0 totalpages: 524288
      free_area_init_node: node 0, pgdat 80ccd380, node_mem_map 80d38000
        DMA zone: 3568 pages used for memmap
        DMA zone: 0 pages reserved
        DMA zone: 456704 pages, LIFO batch:31
        HighMem zone: 528 pages used for memmap
        HighMem zone: 67584 pages, LIFO batch:15
      
      On node 0 totalpages: 524288
      free_area_init_node: node 0, pgdat 80cd6f40, node_mem_map 80d42000
        DMA zone: 3568 pages used for memmap
        DMA zone: 0 pages reserved
        DMA zone: 456704 pages, LIFO batch:31
        HighMem zone: 67584 pages, LIFO batch:15
      Signed-off-by: NHongbo Zhong <hongbo.zhong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba914f48
    • H
      mm: unmapped page migration avoid unmap+remap overhead · 2ebba6b7
      Hugh Dickins 提交于
      Page migration's __unmap_and_move(), and rmap's try_to_unmap(), were
      created for use on pages almost certainly mapped into userspace.  But
      nowadays compaction often applies them to unmapped page cache pages: which
      may exacerbate contention on i_mmap_rwsem quite unnecessarily, since
      try_to_unmap_file() makes no preliminary page_mapped() check.
      
      Now check page_mapped() in __unmap_and_move(); and avoid repeating the
      same overhead in rmap_walk_file() - don't remove_migration_ptes() when we
      never inserted any.
      
      (The PageAnon(page) comment blocks now look even sillier than before, but
      clean that up on some other occasion.  And note in passing that
      try_to_unmap_one() does not use a migration entry when PageSwapCache, so
      remove_migration_ptes() will then not update that swap entry to newpage
      pte: not a big deal, but something else to clean up later.)
      
      Davidlohr remarked in "mm,fs: introduce helpers around the i_mmap_mutex"
      conversion to i_mmap_rwsem, that "The biggest winner of these changes is
      migration": a part of the reason might be all of that unnecessary taking
      of i_mmap_mutex in page migration; and it's rather a shame that I didn't
      get around to sending this patch in before his - this one is much less
      useful after Davidlohr's conversion to rwsem, but still good.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ebba6b7
    • D
      fs, seq_file: fallback to vmalloc instead of oom kill processes · 5cec38ac
      David Rientjes 提交于
      Since commit 058504ed ("fs/seq_file: fallback to vmalloc allocation"),
      seq_buf_alloc() falls back to vmalloc() when the kmalloc() for contiguous
      memory fails.  This was done to address order-4 slab allocations for
      reading /proc/stat on large machines and noticed because
      PAGE_ALLOC_COSTLY_ORDER < 4, so there is no infinite loop in the page
      allocator when allocating new slab for such high-order allocations.
      
      Contiguous memory isn't necessary for caller of seq_buf_alloc(), however.
      Other GFP_KERNEL high-order allocations that are <=
      PAGE_ALLOC_COSTLY_ORDER will simply loop forever in the page allocator and
      oom kill processes as a result.
      
      We don't want to kill processes so that we can allocate contiguous memory
      in situations when contiguous memory isn't necessary.
      
      This patch does the kmalloc() allocation with __GFP_NORETRY for high-order
      allocations.  This still utilizes memory compaction and direct reclaim in
      the allocation path, the only difference is that it will fail immediately
      instead of oom kill processes when out of memory.
      
      [akpm@linux-foundation.org: add comment]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5cec38ac
    • J
      mm: vmscan: invoke slab shrinkers from shrink_zone() · 6b4f7799
      Johannes Weiner 提交于
      The slab shrinkers are currently invoked from the zonelist walkers in
      kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
      eligible LRU pages and assemble a nodemask to pass to NUMA-aware
      shrinkers, which then again have to walk over the nodemask.  This is
      redundant code, extra runtime work, and fairly inaccurate when it comes to
      the estimation of actually scannable LRU pages.  The code duplication will
      only get worse when making the shrinkers cgroup-aware and requiring them
      to have out-of-band cgroup hierarchy walks as well.
      
      Instead, invoke the shrinkers from shrink_zone(), which is where all
      reclaimers end up, to avoid this duplication.
      
      Take the count for eligible LRU pages out of get_scan_count(), which
      considers many more factors than just the availability of swap space, like
      zone_reclaimable_pages() currently does.  Accumulate the number over all
      visited lruvecs to get the per-zone value.
      
      Some nodes have multiple zones due to memory addressing restrictions.  To
      avoid putting too much pressure on the shrinkers, only invoke them once
      for each such node, using the class zone of the allocation as the pivot
      zone.
      
      For now, this integrates the slab shrinking better into the reclaim logic
      and gets rid of duplicative invocations from kswapd, direct reclaim, and
      zone reclaim.  It also prepares for cgroup-awareness, allowing
      memcg-capable shrinkers to be added at the lruvec level without much
      duplication of both code and runtime work.
      
      This changes kswapd behavior, which used to invoke the shrinkers for each
      zone, but with scan ratios gathered from the entire node, resulting in
      meaningless pressure quantities on multi-zone nodes.
      
      Zone reclaim behavior also changes.  It used to shrink slabs until the
      same amount of pages were shrunk as were reclaimed from the LRUs.  Now it
      merely invokes the shrinkers once with the zone's scan ratio, which makes
      the shrinkers go easier on caches that implement aging and would prefer
      feeding back pressure from recently used slab objects to unused LRU pages.
      
      [vdavydov@parallels.com: assure class zone is populated]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b4f7799
    • D
      mm,vmacache: count number of system-wide flushes · f5f302e2
      Davidlohr Bueso 提交于
      These flushes deal with sequence number overflows, such as for long lived
      threads.  These are rare, but interesting from a debugging PoV.  As such,
      display the number of flushes when vmacache debugging is enabled.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5f302e2
    • J
      Documentation: add new page_owner document · 16a7ade8
      Joonsoo Kim 提交于
      page owner is for the tracking about who allocated each page.  This
      document explains what is the page owner feature and what is the merit of
      it.  And, simple HOW-TO is also explained.  See the document for detailed
      information.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16a7ade8
    • J
      mm/page_owner: correct owner information for early allocated pages · 61cf5feb
      Joonsoo Kim 提交于
      Extended memory to store page owner information is initialized some time
      later than that page allocator starts.  Until initialization, many pages
      can be allocated and they have no owner information.  This make debugging
      using page owner harder, so some fixup will be helpful.
      
      This patch fixes up this situation by setting fake owner information
      immediately after page extension is initialized.  Information doesn't tell
      the right owner, but, at least, it can tell whether page is allocated or
      not, more correctly.
      
      On my testing, this patch catches 13343 early allocated pages, although
      they are mostly allocated from page extension feature.  Anyway, after
      then, there is no page left that it is allocated and has no page owner
      flag.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61cf5feb
    • J
      mm/page_owner: keep track of page owners · 48c96a36
      Joonsoo Kim 提交于
      This is the page owner tracking code which is introduced so far ago.  It
      is resident on Andrew's tree, though, nobody tried to upstream so it
      remain as is.  Our company uses this feature actively to debug memory leak
      or to find a memory hogger so I decide to upstream this feature.
      
      This functionality help us to know who allocates the page.  When
      allocating a page, we store some information about allocation in extra
      memory.  Later, if we need to know status of all pages, we can get and
      analyze it from this stored information.
      
      In previous version of this feature, extra memory is statically defined in
      struct page, but, in this version, extra memory is allocated outside of
      struct page.  It enables us to turn on/off this feature at boottime
      without considerable memory waste.
      
      Although we already have tracepoint for tracing page allocation/free,
      using it to analyze page owner is rather complex.  We need to enlarge the
      trace buffer for preventing overlapping until userspace program launched.
      And, launched program continually dump out the trace buffer for later
      analysis and it would change system behaviour with more possibility rather
      than just keeping it in memory, so bad for debug.
      
      Moreover, we can use page_owner feature further for various purposes.  For
      example, we can use it for fragmentation statistics implemented in this
      patch.  And, I also plan to implement some CMA failure debugging feature
      using this interface.
      
      I'd like to give the credit for all developers contributed this feature,
      but, it's not easy because I don't know exact history.  Sorry about that.
      Below is people who has "Signed-off-by" in the patches in Andrew's tree.
      
      Contributor:
      Alexander Nyberg <alexn@dsv.su.se>
      Mel Gorman <mgorman@suse.de>
      Dave Hansen <dave@linux.vnet.ibm.com>
      Minchan Kim <minchan@kernel.org>
      Michal Nazarewicz <mina86@mina86.com>
      Andrew Morton <akpm@linux-foundation.org>
      Jungsoo Son <jungsoo.son@lge.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48c96a36
    • J
      stacktrace: introduce snprint_stack_trace for buffer output · 9a92a6ce
      Joonsoo Kim 提交于
      Current stacktrace only have the function for console output.  page_owner
      that will be introduced in following patch needs to print the output of
      stacktrace into the buffer for our own output format so so new function,
      snprint_stack_trace(), is needed.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a92a6ce
    • J
      mm/nommu: use alloc_pages_exact() rather than its own implementation · dbc8358c
      Joonsoo Kim 提交于
      do_mmap_private() in nommu.c try to allocate physically contiguous pages
      with arbitrary size in some cases and we now have good abstract function
      to do exactly same thing, alloc_pages_exact().  So, change to use it.
      
      There is no functional change.  This is the preparation step for support
      page owner feature accurately.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbc8358c
    • J
      mm/debug-pagealloc: make debug-pagealloc boottime configurable · 031bc574
      Joonsoo Kim 提交于
      Now, we have prepared to avoid using debug-pagealloc in boottime.  So
      introduce new kernel-parameter to disable debug-pagealloc in boottime, and
      makes related functions to be disabled in this case.
      
      Only non-intuitive part is change of guard page functions.  Because guard
      page is effective only if debug-pagealloc is enabled, turning off
      according to debug-pagealloc is reasonable thing to do.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      031bc574
    • J
      mm/debug-pagealloc: prepare boottime configurable on/off · e30825f1
      Joonsoo Kim 提交于
      Until now, debug-pagealloc needs extra flags in struct page, so we need to
      recompile whole source code when we decide to use it.  This is really
      painful, because it takes some time to recompile and sometimes rebuild is
      not possible due to third party module depending on struct page.  So, we
      can't use this good feature in many cases.
      
      Now, we have the page extension feature that allows us to insert extra
      flags to outside of struct page.  This gets rid of third party module
      issue mentioned above.  And, this allows us to determine if we need extra
      memory for this page extension in boottime.  With these property, we can
      avoid using debug-pagealloc in boottime with low computational overhead in
      the kernel built with CONFIG_DEBUG_PAGEALLOC.  This will help our
      development process greatly.
      
      This patch is the preparation step to achive above goal.  debug-pagealloc
      originally uses extra field of struct page, but, after this patch, it will
      use field of struct page_ext.  Because memory for page_ext is allocated
      later than initialization of page allocator in CONFIG_SPARSEMEM, we should
      disable debug-pagealloc feature temporarily until initialization of
      page_ext.  This patch implements this.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e30825f1
    • J
      mm/page_ext: resurrect struct page extending code for debugging · eefa864b
      Joonsoo Kim 提交于
      When we debug something, we'd like to insert some information to every
      page.  For this purpose, we sometimes modify struct page itself.  But,
      this has drawbacks.  First, it requires re-compile.  This makes us
      hesitate to use the powerful debug feature so development process is
      slowed down.  And, second, sometimes it is impossible to rebuild the
      kernel due to third party module dependency.  At third, system behaviour
      would be largely different after re-compile, because it changes size of
      struct page greatly and this structure is accessed by every part of
      kernel.  Keeping this as it is would be better to reproduce errornous
      situation.
      
      This feature is intended to overcome above mentioned problems.  This
      feature allocates memory for extended data per page in certain place
      rather than the struct page itself.  This memory can be accessed by the
      accessor functions provided by this code.  During the boot process, it
      checks whether allocation of huge chunk of memory is needed or not.  If
      not, it avoids allocating memory at all.  With this advantage, we can
      include this feature into the kernel in default and can avoid rebuild and
      solve related problems.
      
      Until now, memcg uses this technique.  But, now, memcg decides to embed
      their variable to struct page itself and it's code to extend struct page
      has been removed.  I'd like to use this code to develop debug feature, so
      this patch resurrect it.
      
      To help these things to work well, this patch introduces two callbacks for
      clients.  One is the need callback which is mandatory if user wants to
      avoid useless memory allocation at boot-time.  The other is optional, init
      callback, which is used to do proper initialization after memory is
      allocated.  Detailed explanation about purpose of these functions is in
      code comment.  Please refer it.
      
      Others are completely same with previous extension code in memcg.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eefa864b
    • J
      mm, gfp: escalatedly define GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE · 2d48366b
      Jianyu Zhan 提交于
      GFP_USER, GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE are escalatedly confined
      defined, also implied by their names:
      
      GFP_USER                                  = GFP_USER
      GFP_USER + __GFP_HIGHMEM                  = GFP_HIGHUSER
      GFP_USER + __GFP_HIGHMEM + __GFP_MOVABLE  = GFP_HIGHUSER_MOVABLE
      
      So just make GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE escalatedly defined to
      reflect this fact.  It also makes the definition clear and texturally warn
      on any furture break-up of this escalated relastionship.
      Signed-off-by: NJianyu Zhan <jianyu.zhan@emc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d48366b