1. 14 10月, 2014 1 次提交
  2. 13 10月, 2014 2 次提交
  3. 10 10月, 2014 37 次提交
    • L
      Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu · c798360c
      Linus Torvalds 提交于
      Pull percpu updates from Tejun Heo:
       "A lot of activities on percpu front.  Notable changes are...
      
         - percpu allocator now can take @gfp.  If @gfp doesn't contain
           GFP_KERNEL, it tries to allocate from what's already available to
           the allocator and a work item tries to keep the reserve around
           certain level so that these atomic allocations usually succeed.
      
           This will replace the ad-hoc percpu memory pool used by
           blk-throttle and also be used by the planned blkcg support for
           writeback IOs.
      
           Please note that I noticed a bug in how @gfp is interpreted while
           preparing this pull request and applied the fix 6ae833c7
           ("percpu: fix how @gfp is interpreted by the percpu allocator")
           just now.
      
         - percpu_ref now uses longs for percpu and global counters instead of
           ints.  It leads to more sparse packing of the percpu counters on
           64bit machines but the overhead should be negligible and this
           allows using percpu_ref for refcnting pages and in-memory objects
           directly.
      
         - The switching between percpu and single counter modes of a
           percpu_ref is made independent of putting the base ref and a
           percpu_ref can now optionally be initialized in single or killed
           mode.  This allows avoiding percpu shutdown latency for cases where
           the refcounted objects may be synchronously created and destroyed
           in rapid succession with only a fraction of them reaching fully
           operational status (SCSI probing does this when combined with
           blk-mq support).  It's also planned to be used to implement forced
           single mode to detect underflow more timely for debugging.
      
        There's a separate branch percpu/for-3.18-consistent-ops which cleans
        up the duplicate percpu accessors.  That branch causes a number of
        conflicts with s390 and other trees.  I'll send a separate pull
        request w/ resolutions once other branches are merged"
      
      * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
        percpu: fix how @gfp is interpreted by the percpu allocator
        blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
        percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
        percpu_ref: add PERCPU_REF_INIT_* flags
        percpu_ref: decouple switching to percpu mode and reinit
        percpu_ref: decouple switching to atomic mode and killing
        percpu_ref: add PCPU_REF_DEAD
        percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
        percpu_ref: replace pcpu_ prefix with percpu_
        percpu_ref: minor code and comment updates
        percpu_ref: relocate percpu_ref_reinit()
        Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
        Revert "percpu: free percpu allocation info for uniprocessor system"
        percpu-refcount: make percpu_ref based on longs instead of ints
        percpu-refcount: improve WARN messages
        percpu: fix locking regression in the failure path of pcpu_alloc()
        percpu-refcount: add @gfp to percpu_ref_init()
        proportions: add @gfp to init functions
        percpu_counter: add @gfp to percpu_counter_init()
        percpu_counter: make percpu_counters_lock irq-safe
        ...
      c798360c
    • L
      Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · b211e9d7
      Linus Torvalds 提交于
      Pull cgroup updates from Tejun Heo:
       "Nothing too interesting.  Just a handful of cleanup patches"
      
      * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        Revert "cgroup: remove redundant variable in cgroup_mount()"
        cgroup: remove redundant variable in cgroup_mount()
        cgroup: fix missing unlock in cgroup_release_agent()
        cgroup: remove CGRP_RELEASABLE flag
        perf/cgroup: Remove perf_put_cgroup()
        cgroup: remove redundant check in cgroup_ino()
        cpuset: simplify proc_cpuset_show()
        cgroup: simplify proc_cgroup_show()
        cgroup: use a per-cgroup work for release agent
        cgroup: remove bogus comments
        cgroup: remove redundant code in cgroup_rmdir()
        cgroup: remove some useless forward declarations
        cgroup: fix a typo in comment.
      b211e9d7
    • L
      Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata · d9428f09
      Linus Torvalds 提交于
      Pull libata update from Tejun Heo:
       "AHCI is getting per-port irq handling and locks for better
        scalability.  The gain is not huge but measureable with multiple high
        iops devices connected to the same host; however, the value of
        threaded IRQ handling seems negligible for AHCI and it likely will
        revert to non-threaded handling soon.
      
        Another noteworthy change is George Spelvin's "libata: Un-break ATA
        blacklist".  During 3.17 devel cycle, the libata blacklist glob
        matching got generalized and rewritten; unfortunately, the patch
        forgot to swap arguments to match the new match function and ended up
        breaking blacklist matching completely.  It got noticed only a couple
        days ago so it couldn't make for-3.17-fixes either.  :(
      
        Other than the above two, nothing too interesting - the usual cleanup
        churns and device-specific changes"
      
      * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: (22 commits)
        pata_serverworks: disable 64-KB DMA transfers on Broadcom OSB4 IDE Controller
        libata: Un-break ATA blacklist
        AHCI: Do not acquire ata_host::lock from single IRQ handler
        AHCI: Optimize single IRQ interrupt processing
        AHCI: Do not read HOST_IRQ_STAT reg in multi-MSI mode
        AHCI: Make few function names more descriptive
        AHCI: Move host activation code into ahci_host_activate()
        AHCI: Move ahci_host_activate() function to libahci.c
        AHCI: Pass SCSI host template as arg to ahci_host_activate()
        ata: pata_imx: Use the SIMPLE_DEV_PM_OPS() macro
        AHCI: Cleanup checking of multiple MSIs/SLM modes
        libata-sff: Fix controllers with no ctl port
        ahci_xgene: Fix the error print invalid resource for APM X-Gene SoC AHCI SATA Host Controller driver.
        libata: change ata_<foo>_printk routines to return void
        ata: qcom: Add device tree bindings information
        ahci-platform: Bump max number of clocks to 5
        ahci: ahci_p5wdh_workaround - constify DMI table
        libahci_platform: Staticize ahci_platform_<en/dis>able_phys()
        pata_platform: Remove useless irq_flags field
        pata_of_platform: Remove "electra-ide" quirk
        ...
      d9428f09
    • L
      Merge branch 'akpm' (fixes from Andrew Morton) · 0cf744bc
      Linus Torvalds 提交于
      Merge patch-bomb from Andrew Morton:
       - part of OCFS2 (review is laggy again)
       - procfs
       - slab
       - all of MM
       - zram, zbud
       - various other random things: arch, filesystems.
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (164 commits)
        nosave: consolidate __nosave_{begin,end} in <asm/sections.h>
        include/linux/screen_info.h: remove unused ORIG_* macros
        kernel/sys.c: compat sysinfo syscall: fix undefined behavior
        kernel/sys.c: whitespace fixes
        acct: eliminate compile warning
        kernel/async.c: switch to pr_foo()
        include/linux/blkdev.h: use NULL instead of zero
        include/linux/kernel.h: deduplicate code implementing clamp* macros
        include/linux/kernel.h: rewrite min3, max3 and clamp using min and max
        alpha: use Kbuild logic to include <asm-generic/sections.h>
        frv: remove deprecated IRQF_DISABLED
        frv: remove unused cpuinfo_frv and friends to fix future build error
        zbud: avoid accessing last unused freelist
        zsmalloc: simplify init_zspage free obj linking
        mm/zsmalloc.c: correct comment for fullness group computation
        zram: use notify_free to account all free notifications
        zram: report maximum used memory
        zram: zram memory size limitation
        zsmalloc: change return value unit of zs_get_total_size_bytes
        zsmalloc: move pages_allocated to zs_pool
        ...
      0cf744bc
    • G
      nosave: consolidate __nosave_{begin,end} in <asm/sections.h> · 7f8998c7
      Geert Uytterhoeven 提交于
      The different architectures used their own (and different) declarations:
      
          extern __visible const void __nosave_begin, __nosave_end;
          extern const void __nosave_begin, __nosave_end;
          extern long __nosave_begin, __nosave_end;
      
      Consolidate them using the first variant in <asm/sections.h>.
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f8998c7
    • G
      include/linux/screen_info.h: remove unused ORIG_* macros · 578b25df
      Geert Uytterhoeven 提交于
      The ORIG_* macros definitions to access struct screen_info members and all
      of their users were removed 7 years ago by commit 3ea33510
      ("Remove magic macros for screen_info structure members"), but (only) the
      definitions reappeared a few days later in commit ee8e7cfe ("Make
      asm-x86/bootparam.h includable from userspace.").
      
      Remove them for good. Amen.
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      578b25df
    • S
      kernel/sys.c: compat sysinfo syscall: fix undefined behavior · 0baae41e
      Scotty Bauer 提交于
      Fix undefined behavior and compiler warning by replacing right shift 32
      with upper_32_bits macro
      Signed-off-by: NScotty Bauer <sbauer@eng.utah.edu>
      Cc: Clemens Ladisch <clemens@ladisch.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0baae41e
    • V
      kernel/sys.c: whitespace fixes · ec94fc3d
      vishnu.ps 提交于
      Fix minor errors and warning messages in kernel/sys.c.  These errors were
      reported by checkpatch while working with some modifications in sys.c
      file.  Fixing this first will help me to improve my further patches.
      
      ERROR: trailing whitespace - 9
      ERROR: do not use assignment in if condition - 4
      ERROR: spaces required around that '?' (ctx:VxO) - 10
      ERROR: switch and case should be at the same indent - 3
      
      total 26 errors & 3 warnings fixed.
      Signed-off-by: Nvishnu.ps <vishnu.ps@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec94fc3d
    • Y
      acct: eliminate compile warning · 067b722f
      Ying Xue 提交于
      If ACCT_VERSION is not defined to 3, below warning appears:
        CC      kernel/acct.o
        kernel/acct.c: In function `do_acct_process':
        kernel/acct.c:475:24: warning: unused variable `ns' [-Wunused-variable]
      
      [akpm@linux-foundation.org: retain the local for code size improvements
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      067b722f
    • I
      kernel/async.c: switch to pr_foo() · 27fb10ed
      Ionut Alexa 提交于
      Signed-off-by: NIonut Alexa <ionut.m.alexa@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27fb10ed
    • M
      include/linux/blkdev.h: use NULL instead of zero · 61a04e5b
      Michele Curti 提交于
      Quite useless but it shuts up some warnings.
      Signed-off-by: NMichele Curti <michele.curti@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61a04e5b
    • M
      include/linux/kernel.h: deduplicate code implementing clamp* macros · c185b07f
      Michal Nazarewicz 提交于
      Instead of open-coding clamp_t macro min_t and max_t the way clamp macro
      does and instead of open-coding clamp_val simply use clamp_t.
      Furthermore, normalise argument naming in the macros to be lo and hi.
      Signed-off-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Mark Rustad <mark.d.rustad@intel.com>
      Cc: "Kirsher, Jeffrey T" <jeffrey.t.kirsher@intel.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c185b07f
    • M
      include/linux/kernel.h: rewrite min3, max3 and clamp using min and max · 2e1d06e1
      Michal Nazarewicz 提交于
      It appears that gcc is better at optimising a double call to min and max
      rather than open coded min3 and max3.  This can be observed here:
      
          $ cat min-max.c
          #define min(x, y) ({				\
          	typeof(x) _min1 = (x);			\
          	typeof(y) _min2 = (y);			\
          	(void) (&_min1 == &_min2);		\
          	_min1 < _min2 ? _min1 : _min2; })
          #define min3(x, y, z) ({			\
          	typeof(x) _min1 = (x);			\
          	typeof(y) _min2 = (y);			\
          	typeof(z) _min3 = (z);			\
          	(void) (&_min1 == &_min2);		\
          	(void) (&_min1 == &_min3);		\
          	_min1 < _min2 ? (_min1 < _min3 ? _min1 : _min3) : \
          		(_min2 < _min3 ? _min2 : _min3); })
      
          int fmin3(int x, int y, int z) { return min3(x, y, z); }
          int fmin2(int x, int y, int z) { return min(min(x, y), z); }
      
          $ gcc -O2 -o min-max.s -S min-max.c; cat min-max.s
          	.file	"min-max.c"
          	.text
          	.p2align 4,,15
          	.globl	fmin3
          	.type	fmin3, @function
          fmin3:
          .LFB0:
          	.cfi_startproc
          	cmpl	%esi, %edi
          	jl	.L5
          	cmpl	%esi, %edx
          	movl	%esi, %eax
          	cmovle	%edx, %eax
          	ret
          	.p2align 4,,10
          	.p2align 3
          .L5:
          	cmpl	%edi, %edx
          	movl	%edi, %eax
          	cmovle	%edx, %eax
          	ret
          	.cfi_endproc
          .LFE0:
          	.size	fmin3, .-fmin3
          	.p2align 4,,15
          	.globl	fmin2
          	.type	fmin2, @function
          fmin2:
          .LFB1:
          	.cfi_startproc
          	cmpl	%edi, %esi
          	movl	%edx, %eax
          	cmovle	%esi, %edi
          	cmpl	%edx, %edi
          	cmovle	%edi, %eax
          	ret
          	.cfi_endproc
          .LFE1:
          	.size	fmin2, .-fmin2
          	.ident	"GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
          	.section	.note.GNU-stack,"",@progbits
      
      fmin3 function, which uses open-coded min3 macro, is compiled into total
      of ten instructions including a conditional branch, whereas fmin2
      function, which uses two calls to min2 macro, is compiled into six
      instructions with no branches.
      
      Similarly, open-coded clamp produces the same code as clamp using min and
      max macros, but the latter is much shorter:
      
          $ cat clamp.c
          #define clamp(val, min, max) ({			\
          	typeof(val) __val = (val);		\
          	typeof(min) __min = (min);		\
          	typeof(max) __max = (max);		\
          	(void) (&__val == &__min);		\
          	(void) (&__val == &__max);		\
          	__val = __val < __min ? __min: __val;	\
          	__val > __max ? __max: __val; })
          #define min(x, y) ({				\
          	typeof(x) _min1 = (x);			\
          	typeof(y) _min2 = (y);			\
          	(void) (&_min1 == &_min2);		\
          	_min1 < _min2 ? _min1 : _min2; })
          #define max(x, y) ({				\
          	typeof(x) _max1 = (x);			\
          	typeof(y) _max2 = (y);			\
          	(void) (&_max1 == &_max2);		\
          	_max1 > _max2 ? _max1 : _max2; })
      
          int fclamp(int v, int min, int max) { return clamp(v, min, max); }
          int fclampmm(int v, int min, int max) { return min(max(v, min), max); }
      
          $ gcc -O2 -o clamp.s -S clamp.c; cat clamp.s
          	.file	"clamp.c"
          	.text
          	.p2align 4,,15
          	.globl	fclamp
          	.type	fclamp, @function
          fclamp:
          .LFB0:
          	.cfi_startproc
          	cmpl	%edi, %esi
          	movl	%edx, %eax
          	cmovge	%esi, %edi
          	cmpl	%edx, %edi
          	cmovle	%edi, %eax
          	ret
          	.cfi_endproc
          .LFE0:
          	.size	fclamp, .-fclamp
          	.p2align 4,,15
          	.globl	fclampmm
          	.type	fclampmm, @function
          fclampmm:
          .LFB1:
          	.cfi_startproc
          	cmpl	%edi, %esi
          	cmovge	%esi, %edi
          	cmpl	%edi, %edx
          	movl	%edi, %eax
          	cmovle	%edx, %eax
          	ret
          	.cfi_endproc
          .LFE1:
          	.size	fclampmm, .-fclampmm
          	.ident	"GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
          	.section	.note.GNU-stack,"",@progbits
      
          Linux mpn-glaptop 3.13.0-29-generic #53~precise1-Ubuntu SMP Wed Jun 4 22:06:25 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
          Copyright (C) 2011 Free Software Foundation, Inc.
          This is free software; see the source for copying conditions.  There is NO
          warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
      
          -rwx------ 1 mpn eng 51224656 Jun 17 14:15 vmlinux.before
          -rwx------ 1 mpn eng 51224608 Jun 17 13:57 vmlinux.after
      
      48 bytes reduction.  The do_fault_around was a few instruction shorter
      and as far as I can tell saved 12 bytes on the stack, i.e.:
      
          $ grep -e rsp -e pop -e push do_fault_around.*
          do_fault_around.before.s:push   %rbp
          do_fault_around.before.s:mov    %rsp,%rbp
          do_fault_around.before.s:push   %r13
          do_fault_around.before.s:push   %r12
          do_fault_around.before.s:push   %rbx
          do_fault_around.before.s:sub    $0x38,%rsp
          do_fault_around.before.s:add    $0x38,%rsp
          do_fault_around.before.s:pop    %rbx
          do_fault_around.before.s:pop    %r12
          do_fault_around.before.s:pop    %r13
          do_fault_around.before.s:pop    %rbp
      
          do_fault_around.after.s:push   %rbp
          do_fault_around.after.s:mov    %rsp,%rbp
          do_fault_around.after.s:push   %r12
          do_fault_around.after.s:push   %rbx
          do_fault_around.after.s:sub    $0x30,%rsp
          do_fault_around.after.s:add    $0x30,%rsp
          do_fault_around.after.s:pop    %rbx
          do_fault_around.after.s:pop    %r12
          do_fault_around.after.s:pop    %rbp
      
      or here side-by-side:
      
          Before                    After
          push   %rbp               push   %rbp
          mov    %rsp,%rbp          mov    %rsp,%rbp
          push   %r13
          push   %r12               push   %r12
          push   %rbx               push   %rbx
          sub    $0x38,%rsp         sub    $0x30,%rsp
          add    $0x38,%rsp         add    $0x30,%rsp
          pop    %rbx               pop    %rbx
          pop    %r12               pop    %r12
          pop    %r13
          pop    %rbp               pop    %rbp
      
      There are also fewer branches:
      
          $ grep ^j do_fault_around.*
          do_fault_around.before.s:jae    ffffffff812079b7
          do_fault_around.before.s:jmp    ffffffff812079c5
          do_fault_around.before.s:jmp    ffffffff81207a14
          do_fault_around.before.s:ja     ffffffff812079f9
          do_fault_around.before.s:jb     ffffffff81207a10
          do_fault_around.before.s:jmp    ffffffff81207a63
          do_fault_around.before.s:jne    ffffffff812079df
      
          do_fault_around.after.s:jmp    ffffffff812079fd
          do_fault_around.after.s:ja     ffffffff812079e2
          do_fault_around.after.s:jb     ffffffff812079f9
          do_fault_around.after.s:jmp    ffffffff81207a4c
          do_fault_around.after.s:jne    ffffffff812079c8
      
      And here's with allyesconfig on a different machine:
      
          $ uname -a; gcc --version; ls -l vmlinux.*
          Linux erwin 3.14.7-mn #54 SMP Sun Jun 15 11:25:08 CEST 2014 x86_64 AMD Phenom(tm) II X3 710 Processor AuthenticAMD GNU/Linux
          gcc (GCC) 4.8.3
          Copyright (C) 2013 Free Software Foundation, Inc.
          This is free software; see the source for copying conditions.  There is NO
          warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
      
          -rwx------ 1 mpn eng 437027411 Jun 20 16:04 vmlinux.before
          -rwx------ 1 mpn eng 437026881 Jun 20 15:30 vmlinux.after
      
      530 bytes reduction.
      Signed-off-by: NMichal Nazarewicz <mina86@mina86.com>
      Signed-off-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Rustad, Mark D" <mark.d.rustad@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e1d06e1
    • G
    • M
      frv: remove deprecated IRQF_DISABLED · 08e4cf4b
      Michael Opdenacker 提交于
      Remove the IRQF_DISABLED flag from FRV architecture code.  It's a NOOP
      since 2.6.35 and it will be removed one day.
      Signed-off-by: NMichael Opdenacker <michael.opdenacker@free-electrons.com>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08e4cf4b
    • G
      frv: remove unused cpuinfo_frv and friends to fix future build error · 21f45660
      Geert Uytterhoeven 提交于
      Frv has a macro named cpu_data, interfering with variables and struct
      members with the same name:
      
      include/linux/pm_domain.h:75:24: error: expected identifier or '('
      before '&' token
        struct gpd_cpu_data *cpu_data;
      
      As struct cpuinfo_frv, boot_cpu_data, cpu_data, and current_cpu_data are
      not used, removed them to fix this.
      Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21f45660
    • C
      zbud: avoid accessing last unused freelist · f203c3b3
      Chao Yu 提交于
      For now, there are NCHUNKS of 64 freelists in zbud_pool, the last
      unbuddied[63] freelist linked with all zbud pages which have free chunks
      of 63.  Calculating according to context of num_free_chunks(), our max
      chunk number of unbuddied zbud page is 62, so none of zbud pages will be
      added/removed in last freelist, but still we will try to find an unbuddied
      zbud page in the last unused freelist, it is unneeded.
      
      This patch redefines NCHUNKS to 63 as free chunk number in one zbud page,
      hence we can decrease size of zpool and avoid accessing the last unused
      freelist whenever failing to allocate zbud from freelist in zbud_alloc.
      Signed-off-by: NChao Yu <chao2.yu@samsung.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f203c3b3
    • D
      zsmalloc: simplify init_zspage free obj linking · 5538c562
      Dan Streetman 提交于
      Change zsmalloc init_zspage() logic to iterate through each object on each
      of its pages, checking the offset to verify the object is on the current
      page before linking it into the zspage.
      
      The current zsmalloc init_zspage free object linking code has logic that
      relies on there only being one page per zspage when PAGE_SIZE is a
      multiple of class->size.  It calculates the number of objects for the
      current page, and iterates through all of them plus one, to account for
      the assumed partial object at the end of the page.  While this currently
      works, the logic can be simplified to just link the object at each
      successive offset until the offset is larger than PAGE_SIZE, which does
      not rely on PAGE_SIZE being a multiple of class->size.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5538c562
    • W
      mm/zsmalloc.c: correct comment for fullness group computation · 6dd9737e
      Wang Sheng-Hui 提交于
      The letter 'f' in "n <= N/f" stands for fullness_threshold_frac, not
      1/fullness_threshold_frac.
      Signed-off-by: NWang Sheng-Hui <shhuiw@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6dd9737e
    • S
      zram: use notify_free to account all free notifications · 015254da
      Sergey Senozhatsky 提交于
      `notify_free' device attribute accounts the number of slot free
      notifications and internally represents the number of zram_free_page()
      calls.  Slot free notifications are sent only when device is used as a
      swap device, hence `notify_free' is used only for swap devices.  Since
      f4659d8e (zram: support REQ_DISCARD) ZRAM handles yet another one
      free notification (also via zram_free_page() call) -- REQ_DISCARD
      requests, which are sent by a filesystem, whenever some data blocks are
      discarded.  However, there is no way to know the number of notifications
      in the latter case.
      
      Use `notify_free' to account the number of pages freed by
      zram_bio_discard() and zram_slot_free_notify().  Depending on usage
      scenario `notify_free' represents:
      
       a) the number of pages freed because of slot free notifications, which is
         equal to the number of swap_slot_free_notify() calls, so there is no
         behaviour change
      
       b) the number of pages freed because of REQ_DISCARD notifications
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Chao Yu <chao2.yu@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      015254da
    • M
      zram: report maximum used memory · 461a8eee
      Minchan Kim 提交于
      Normally, zram user could get maximum memory usage zram consumed via
      polling mem_used_total with sysfs in userspace.
      
      But it has a critical problem because user can miss peak memory usage
      during update inverval of polling.  For avoiding that, user should poll it
      with shorter interval(ie, 0.0000000001s) with mlocking to avoid page fault
      delay when memory pressure is heavy.  It would be troublesome.
      
      This patch adds new knob "mem_used_max" so user could see the maximum
      memory usage easily via reading the knob and reset it via "echo 0 >
      /sys/block/zram0/mem_used_max".
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: <juno.choi@lge.com>
      Cc: <seungho1.park@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Reviewed-by: NDavid Horner <ds2horner@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      461a8eee
    • M
      zram: zram memory size limitation · 9ada9da9
      Minchan Kim 提交于
      Since zram has no control feature to limit memory usage, it makes hard to
      manage system memrory.
      
      This patch adds new knob "mem_limit" via sysfs to set up the a limit so
      that zram could fail allocation once it reaches the limit.
      
      In addition, user could change the limit in runtime so that he could
      manage the memory more dynamically.
      
      Initial state is no limit so it doesn't break old behavior.
      
      [akpm@linux-foundation.org: fix typo, per Sergey]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: <juno.choi@lge.com>
      Cc: <seungho1.park@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: David Horner <ds2horner@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ada9da9
    • M
      zsmalloc: change return value unit of zs_get_total_size_bytes · 722cdc17
      Minchan Kim 提交于
      zs_get_total_size_bytes returns a amount of memory zsmalloc consumed with
      *byte unit* but zsmalloc operates *page unit* rather than byte unit so
      let's change the API so benefit we could get is that reduce unnecessary
      overhead (ie, change page unit with byte unit) in zsmalloc.
      
      Since return type is pages, "zs_get_total_pages" is better than
      "zs_get_total_size_bytes".
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: <juno.choi@lge.com>
      Cc: <seungho1.park@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: David Horner <ds2horner@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      722cdc17
    • M
      zsmalloc: move pages_allocated to zs_pool · 13de8933
      Minchan Kim 提交于
      Currently, zram has no feature to limit memory so theoretically zram can
      deplete system memory.  Users have asked for a limit several times as even
      without exhaustion zram makes it hard to control memory usage of the
      platform.  This patchset adds the feature.
      
      Patch 1 makes zs_get_total_size_bytes faster because it would be used
      frequently in later patches for the new feature.
      
      Patch 2 changes zs_get_total_size_bytes's return unit from bytes to page
      so that zsmalloc doesn't need unnecessary operation(ie, << PAGE_SHIFT).
      
      Patch 3 adds new feature.  I added the feature into zram layer, not
      zsmalloc because limiation is zram's requirement, not zsmalloc so any
      other user using zsmalloc(ie, zpool) shouldn't affected by unnecessary
      branch of zsmalloc.  In future, if every users of zsmalloc want the
      feature, then, we could move the feature from client side to zsmalloc
      easily but vice versa would be painful.
      
      Patch 4 adds news facility to report maximum memory usage of zram so that
      this avoids user polling frequently via /sys/block/zram0/ mem_used_total
      and ensures transient max are not missed.
      
      This patch (of 4):
      
      pages_allocated has counted in size_class structure and when user of
      zsmalloc want to see total_size_bytes, it should gather all of count from
      each size_class to report the sum.
      
      It's not bad if user don't see the value often but if user start to see
      the value frequently, it would be not a good deal for performance pov.
      
      This patch moves the count from size_class to zs_pool so it could reduce
      memory footprint (from [255 * 8byte] to [sizeof(atomic_long_t)]).
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: <juno.choi@lge.com>
      Cc: <seungho1.park@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Reviewed-by: NDavid Horner <ds2horner@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13de8933
    • D
      m68k: call find_vma with the mmap_sem held in sys_cacheflush() · cd2567b6
      Davidlohr Bueso 提交于
      Performing vma lookups without taking the mm->mmap_sem is asking for
      trouble.  While doing the search, the vma in question can be modified or
      even removed before returning to the caller.  Take the lock (shared) in
      order to avoid races while iterating through the vmacache and/or rbtree.
      In addition, this guarantees that the address space will remain intact
      during the CPU flushing.
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd2567b6
    • C
      vmstat: on-demand vmstat workers V8 · 7cc36bbd
      Christoph Lameter 提交于
      vmstat workers are used for folding counter differentials into the zone,
      per node and global counters at certain time intervals.  They currently
      run at defined intervals on all processors which will cause some holdoff
      for processors that need minimal intrusion by the OS.
      
      The current vmstat_update mechanism depends on a deferrable timer firing
      every other second by default which registers a work queue item that runs
      on the local CPU, with the result that we have 1 interrupt and one
      additional schedulable task on each CPU every 2 seconds If a workload
      indeed causes VM activity or multiple tasks are running on a CPU, then
      there are probably bigger issues to deal with.
      
      However, some workloads dedicate a CPU for a single CPU bound task.  This
      is done in high performance computing, in high frequency financial
      applications, in networking (Intel DPDK, EZchip NPS) and with the advent
      of systems with more and more CPUs over time, this may become more and
      more common to do since when one has enough CPUs one cares less about
      efficiently sharing a CPU with other tasks and more about efficiently
      monopolizing a CPU per task.
      
      The difference of having this timer firing and workqueue kernel thread
      scheduled per second can be enormous.  An artificial test measuring the
      worst case time to do a simple "i++" in an endless loop on a bare metal
      system and under Linux on an isolated CPU with dynticks and with and
      without this patch, have Linux match the bare metal performance (~700
      cycles) with this patch and loose by couple of orders of magnitude (~200k
      cycles) without it[*].  The loss occurs for something that just calculates
      statistics.  For networking applications, for example, this could be the
      difference between dropping packets or sustaining line rate.
      
      Statistics are important and useful, but it would be great if there would
      be a way to not cause statistics gathering produce a huge performance
      difference.  This patche does just that.
      
      This patch creates a vmstat shepherd worker that monitors the per cpu
      differentials on all processors.  If there are differentials on a
      processor then a vmstat worker local to the processors with the
      differentials is created.  That worker will then start folding the diffs
      in regular intervals.  Should the worker find that there is no work to be
      done then it will make the shepherd worker monitor the differentials
      again.
      
      With this patch it is possible then to have periods longer than
      2 seconds without any OS event on a "cpu" (hardware thread).
      
      The patch shows a very minor increased in system performance.
      
      hackbench -s 512 -l 2000 -g 15 -f 25 -P
      
      Results before the patch:
      
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.992
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.971
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 5.063
      
      Hackbench after the patch:
      
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.973
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.990
      Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
      Each sender will pass 2000 messages of 512 bytes
      Time: 4.993
      
      [fengguang.wu@intel.com: cpu_stat_off can be static]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NGilad Ben-Yossef <gilad@benyossef.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Max Krasnyansky <maxk@qti.qualcomm.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cc36bbd
    • J
      CMA: document cma=0 · f0d6d1f6
      Jean Delvare 提交于
      It isn't obvious that CMA can be disabled on the kernel's command line, so
      document it.
      Signed-off-by: NJean Delvare <jdelvare@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0d6d1f6
    • S
      fs/buffer.c: increase the buffer-head per-CPU LRU size · 86cf78d7
      Sebastien Buisson 提交于
      Increase the buffer-head per-CPU LRU size to allow efficient filesystem
      operations that access many blocks for each transaction.  For example,
      creating a file in a large ext4 directory with quota enabled will access
      multiple buffer heads and will overflow the LRU at the default 8-block LRU
      size:
      
      * parent directory inode table block (ctime, nlinks for subdirs)
      * new inode bitmap
      * inode table block
      * 2 quota blocks
      * directory leaf block (not reused, but pollutes one cache entry)
      * 2 levels htree blocks (only one is reused, other pollutes cache)
      * 2 levels indirect/index blocks (only one is reused)
      
      The buffer-head per-CPU LRU size is raised to 16, as it shows in metadata
      performance benchmarks up to 10% gain for create, 4% for lookup and 7% for
      destroy.
      Signed-off-by: NLiang Zhen <liang.zhen@intel.com>
      Signed-off-by: NAndreas Dilger <andreas.dilger@intel.com>
      Signed-off-by: NSebastien Buisson <sebastien.buisson@bull.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86cf78d7
    • M
      mm: mempolicy: skip inaccessible VMAs when setting MPOL_MF_LAZY · 2c0346a3
      Mel Gorman 提交于
      PROT_NUMA VMAs are skipped to avoid problems distinguishing between
      present, prot_none and special entries.  MPOL_MF_LAZY is not visible from
      userspace since commit a720094d ("mm: mempolicy: Hide MPOL_NOOP and
      MPOL_MF_LAZY from userspace for now") but it should still skip VMAs the
      same way task_numa_work does.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c0346a3
    • K
      selftests/vm/transhuge-stress: stress test for memory compaction · 0085d61f
      Konstantin Khlebnikov 提交于
      This tool induces memory fragmentation via sequential allocation of
      transparent huge pages and splitting off everything except their last
      sub-pages.  It easily generates pressure to the memory compaction code.
      
      $ perf stat -e 'compaction:*' -e 'migrate:*' ./transhuge-stress
      transhuge-stress: allocate 7858 transhuge pages, using 15716 MiB virtual memory and 61 MiB of ram
      transhuge-stress: 1.653 s/loop, 0.210 ms/page,   9504.828 MiB/s	7858 succeed,    0 failed, 2439 different pages
      transhuge-stress: 1.537 s/loop, 0.196 ms/page,  10226.227 MiB/s	7858 succeed,    0 failed, 2364 different pages
      transhuge-stress: 1.658 s/loop, 0.211 ms/page,   9479.215 MiB/s	7858 succeed,    0 failed, 2179 different pages
      transhuge-stress: 1.617 s/loop, 0.206 ms/page,   9716.992 MiB/s	7858 succeed,    0 failed, 2421 different pages
      ^C./transhuge-stress: Interrupt
      
       Performance counter stats for './transhuge-stress':
      
               1.744.051      compaction:mm_compaction_isolate_migratepages
                   1.014      compaction:mm_compaction_isolate_freepages
               1.744.051      compaction:mm_compaction_migratepages
                   1.647      compaction:mm_compaction_begin
                   1.647      compaction:mm_compaction_end
               1.744.051      migrate:mm_migrate_pages
                       0      migrate:mm_numa_migrate_ratelimit
      
             7,964696835 seconds time elapsed
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0085d61f
    • K
      mm/balloon_compaction: add vmstat counters and kpageflags bit · 09316c09
      Konstantin Khlebnikov 提交于
      Always mark pages with PageBalloon even if balloon compaction is disabled
      and expose this mark in /proc/kpageflags as KPF_BALLOON.
      
      Also this patch adds three counters into /proc/vmstat: "balloon_inflate",
      "balloon_deflate" and "balloon_migrate".  They accumulate balloon
      activity.  Current size of balloon is (balloon_inflate - balloon_deflate)
      pages.
      
      All generic balloon code now gathered under option CONFIG_MEMORY_BALLOON.
      It should be selected by ballooning driver which wants use this feature.
      Currently virtio-balloon is the only user.
      Signed-off-by: NKonstantin Khlebnikov <k.khlebnikov@samsung.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09316c09
    • K
      mm/balloon_compaction: remove balloon mapping and flag AS_BALLOON_MAP · 9d1ba805
      Konstantin Khlebnikov 提交于
      Now ballooned pages are detected using PageBalloon().  Fake mapping is no
      longer required.  This patch links ballooned pages to balloon device using
      field page->private instead of page->mapping.  Also this patch embeds
      balloon_dev_info directly into struct virtio_balloon.
      Signed-off-by: NKonstantin Khlebnikov <k.khlebnikov@samsung.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d1ba805
    • K
      mm/balloon_compaction: redesign ballooned pages management · d6d86c0a
      Konstantin Khlebnikov 提交于
      Sasha Levin reported KASAN splash inside isolate_migratepages_range().
      Problem is in the function __is_movable_balloon_page() which tests
      AS_BALLOON_MAP in page->mapping->flags.  This function has no protection
      against anonymous pages.  As result it tried to check address space flags
      inside struct anon_vma.
      
      Further investigation shows more problems in current implementation:
      
      * Special branch in __unmap_and_move() never works:
        balloon_page_movable() checks page flags and page_count.  In
        __unmap_and_move() page is locked, reference counter is elevated, thus
        balloon_page_movable() always fails.  As a result execution goes to the
        normal migration path.  virtballoon_migratepage() returns
        MIGRATEPAGE_BALLOON_SUCCESS instead of MIGRATEPAGE_SUCCESS,
        move_to_new_page() thinks this is an error code and assigns
        newpage->mapping to NULL.  Newly migrated page lose connectivity with
        balloon an all ability for further migration.
      
      * lru_lock erroneously required in isolate_migratepages_range() for
        isolation ballooned page.  This function releases lru_lock periodically,
        this makes migration mostly impossible for some pages.
      
      * balloon_page_dequeue have a tight race with balloon_page_isolate:
        balloon_page_isolate could be executed in parallel with dequeue between
        picking page from list and locking page_lock.  Race is rare because they
        use trylock_page() for locking.
      
      This patch fixes all of them.
      
      Instead of fake mapping with special flag this patch uses special state of
      page->_mapcount: PAGE_BALLOON_MAPCOUNT_VALUE = -256.  Buddy allocator uses
      PAGE_BUDDY_MAPCOUNT_VALUE = -128 for similar purpose.  Storing mark
      directly in struct page makes everything safer and easier.
      
      PagePrivate is used to mark pages present in page list (i.e.  not
      isolated, like PageLRU for normal pages).  It replaces special rules for
      reference counter and makes balloon migration similar to migration of
      normal pages.  This flag is protected by page_lock together with link to
      the balloon device.
      Signed-off-by: NKonstantin Khlebnikov <k.khlebnikov@samsung.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Link: http://lkml.kernel.org/p/53E6CEAA.9020105@oracle.com
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: <stable@vger.kernel.org>	[3.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6d86c0a
    • S
      arm64: mm: enable RCU fast_gup · 29e56940
      Steve Capper 提交于
      Activate the RCU fast_gup for ARM64.  We also need to force THP splits to
      broadcast an IPI s.t.  we block in the fast_gup page walker.  As THP
      splits are comparatively rare, this should not lead to a noticeable
      performance degradation.
      
      Some pre-requisite functions pud_write and pud_page are also added.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NSteve Capper <steve.capper@linaro.org>
      Tested-by: NDann Frazier <dann.frazier@canonical.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Christoffer Dall <christoffer.dall@linaro.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29e56940
    • S
      arm64: mm: enable HAVE_RCU_TABLE_FREE logic · 5e5f6dc1
      Steve Capper 提交于
      In order to implement fast_get_user_pages we need to ensure that the page
      table walker is protected from page table pages being freed from under it.
      
      This patch enables HAVE_RCU_TABLE_FREE, any page table pages belonging to
      address spaces with multiple users will be call_rcu_sched freed.  Meaning
      that disabling interrupts will block the free and protect the fast gup
      page walker.
      Signed-off-by: NSteve Capper <steve.capper@linaro.org>
      Tested-by: NDann Frazier <dann.frazier@canonical.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Christoffer Dall <christoffer.dall@linaro.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e5f6dc1
    • S
      arm: mm: enable RCU fast_gup · b8cd51af
      Steve Capper 提交于
      Activate the RCU fast_gup for ARM.  We also need to force THP splits to
      broadcast an IPI s.t.  we block in the fast_gup page walker.  As THP
      splits are comparatively rare, this should not lead to a noticeable
      performance degradation.
      
      Some pre-requisite functions pud_write and pud_page are also added.
      Signed-off-by: NSteve Capper <steve.capper@linaro.org>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Dann Frazier <dann.frazier@canonical.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Christoffer Dall <christoffer.dall@linaro.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8cd51af
    • S
      arm: mm: enable HAVE_RCU_TABLE_FREE logic · a0ad5496
      Steve Capper 提交于
      In order to implement fast_get_user_pages we need to ensure that the page
      table walker is protected from page table pages being freed from under it.
      
      This patch enables HAVE_RCU_TABLE_FREE, any page table pages belonging to
      address spaces with multiple users will be call_rcu_sched freed.  Meaning
      that disabling interrupts will block the free and protect the fast gup
      page walker.
      Signed-off-by: NSteve Capper <steve.capper@linaro.org>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Dann Frazier <dann.frazier@canonical.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Christoffer Dall <christoffer.dall@linaro.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0ad5496