1. 07 8月, 2014 40 次提交
    • A
      printk: fix some comments · 0b90fec3
      Alex Elder 提交于
      Fix a few comments that don't accurately describe their corresponding
      code.  It also fixes some minor typographical errors.
      Signed-off-by: NAlex Elder <elder@linaro.org>
      Reviewed-by: NPetr Mladek <pmladek@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b90fec3
    • A
      printk: rename DEFAULT_MESSAGE_LOGLEVEL · 42a9dc0b
      Alex Elder 提交于
      Commit a8fe19eb ("kernel/printk: use symbolic defines for console
      loglevels") makes consistent use of symbolic values for printk() log
      levels.
      
      The naming scheme used is different from the one used for
      DEFAULT_MESSAGE_LOGLEVEL though.  Change that symbol name to be
      MESSAGE_LOGLEVEL_DEFAULT for consistency.  And because the value of that
      symbol comes from a similarly-named config option, rename
      CONFIG_DEFAULT_MESSAGE_LOGLEVEL as well.
      Signed-off-by: NAlex Elder <elder@linaro.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42a9dc0b
    • A
      printk: tweak do_syslog() to match comments · e97e1267
      Alex Elder 提交于
      In do_syslog() there's a path used by kmsg_poll() and kmsg_read() that
      only needs to know whether there's any data available to read (and not
      its size).  These callers only check for non-zero return.  As a
      shortcut, do_syslog() returns the difference between what has been
      logged and what has been "seen."
      
      The comments say that the "count of records" should be returned but it's
      not.  Instead it returns (log_next_idx - syslog_idx), which is a
      difference between buffer offsets--and the result could be negative.
      
      The behavior is the same (it'll be zero or not in the same cases), but
      the count of records is more meaningful and it matches what the comments
      say.  So change the code to return that.
      Signed-off-by: NAlex Elder <elder@linaro.org>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e97e1267
    • L
      printk: allow increasing the ring buffer depending on the number of CPUs · 23b2899f
      Luis R. Rodriguez 提交于
      The default size of the ring buffer is too small for machines with a
      large amount of CPUs under heavy load.  What ends up happening when
      debugging is the ring buffer overlaps and chews up old messages making
      debugging impossible unless the size is passed as a kernel parameter.
      An idle system upon boot up will on average spew out only about one or
      two extra lines but where this really matters is on heavy load and that
      will vary widely depending on the system and environment.
      
      There are mechanisms to help increase the kernel ring buffer for tracing
      through debugfs, and those interfaces even allow growing the kernel ring
      buffer per CPU.  We also have a static value which can be passed upon
      boot.  Relying on debugfs however is not ideal for production, and
      relying on the value passed upon bootup is can only used *after* an
      issue has creeped up.  Instead of being reactive this adds a proactive
      measure which lets you scale the amount of contributions you'd expect to
      the kernel ring buffer under load by each CPU in the worst case
      scenario.
      
      We use num_possible_cpus() to avoid complexities which could be
      introduced by dynamically changing the ring buffer size at run time,
      num_possible_cpus() lets us use the upper limit on possible number of
      CPUs therefore avoiding having to deal with hotplugging CPUs on and off.
      This introduces the kernel configuration option LOG_CPU_MAX_BUF_SHIFT
      which is used to specify the maximum amount of contributions to the
      kernel ring buffer in the worst case before the kernel ring buffer flips
      over, the size is specified as a power of 2.  The total amount of
      contributions made by each CPU must be greater than half of the default
      kernel ring buffer size (1 << LOG_BUF_SHIFT bytes) in order to trigger
      an increase upon bootup.  The kernel ring buffer is increased to the
      next power of two that would fit the required minimum kernel ring buffer
      size plus the additional CPU contribution.  For example if LOG_BUF_SHIFT
      is 18 (256 KB) you'd require at least 128 KB contributions by other CPUs
      in order to trigger an increase of the kernel ring buffer.  With a
      LOG_CPU_BUF_SHIFT of 12 (4 KB) you'd require at least anything over > 64
      possible CPUs to trigger an increase.  If you had 128 possible CPUs the
      amount of minimum required kernel ring buffer bumps to:
      
         ((1 << 18) + ((128 - 1) * (1 << 12))) / 1024 = 764 KB
      
      Since we require the ring buffer to be a power of two the new required
      size would be 1024 KB.
      
      This CPU contributions are ignored when the "log_buf_len" kernel
      parameter is used as it forces the exact size of the ring buffer to an
      expected power of two value.
      
      [pmladek@suse.cz: fix build]
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.cz>
      Tested-by: NDavidlohr Bueso <davidlohr@hp.com>
      Tested-by: NPetr Mladek <pmladek@suse.cz>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Arun KS <arunks.linux@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23b2899f
    • L
      printk: make dynamic units clear for the kernel ring buffer · f5405172
      Luis R. Rodriguez 提交于
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Suggested-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Arun KS <arunks.linux@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5405172
    • L
      printk: move power of 2 practice of ring buffer size to a helper · c0a318a3
      Luis R. Rodriguez 提交于
      In practice the power of 2 practice of the size of the kernel ring
      buffer remains purely historical but not a requirement, specially now
      that we have LOG_ALIGN and use it for both static and dynamic
      allocations.  It could have helped with implicit alignment back in the
      days given the even the dynamically sized ring buffer was guaranteed to
      be aligned so long as CONFIG_LOG_BUF_SHIFT was set to produce a
      __LOG_BUF_LEN which is architecture aligned, since log_buf_len=n would
      be allowed only if it was > __LOG_BUF_LEN and we always ended up
      rounding the log_buf_len=n to the next power of 2 with
      roundup_pow_of_two(), any multiple of 2 then should be also architecture
      aligned.  These assumptions of course relied heavily on
      CONFIG_LOG_BUF_SHIFT producing an aligned value but users can always
      change this.
      
      We now have precise alignment requirements set for the log buffer size
      for both static and dynamic allocations, but lets upkeep the old
      practice of using powers of 2 for its size to help with easy expected
      scalable values and the allocators for dynamic allocations.  We'll reuse
      this later so move this into a helper.
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Arun KS <arunks.linux@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0a318a3
    • L
      printk: make dynamic kernel ring buffer alignment explicit · 70300177
      Luis R. Rodriguez 提交于
      We have to consider alignment for the ring buffer both for the default
      static size, and then also for when an dynamic allocation is made when
      the log_buf_len=n kernel parameter is passed to set the size
      specifically to a size larger than the default size set by the
      architecture through CONFIG_LOG_BUF_SHIFT.
      
      The default static kernel ring buffer can be aligned properly if
      architectures set CONFIG_LOG_BUF_SHIFT properly, we provide ranges for
      the size though so even if CONFIG_LOG_BUF_SHIFT has a sensible aligned
      value it can be reduced to a non aligned value.  Commit 6ebb017d
      ("printk: Fix alignment of buf causing crash on ARM EABI") by Andrew
      Lunn ensures the static buffer is always aligned and the decision of
      alignment is done by the compiler by using __alignof__(struct log).
      
      When log_buf_len=n is used we allocate the ring buffer dynamically.
      Dynamic allocation varies, for the early allocation called before
      setup_arch() memblock_virt_alloc() requests a page aligment and for the
      default kernel allocation memblock_virt_alloc_nopanic() requests no
      special alignment, which in turn ends up aligning the allocation to
      SMP_CACHE_BYTES, which is L1 cache aligned.
      
      Since we already have the required alignment for the kernel ring buffer
      though we can do better and request explicit alignment for LOG_ALIGN.
      This does that to be safe and make dynamic allocation alignment
      explicit.
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Tested-by: NPetr Mladek <pmladek@suse.cz>
      Acked-by: NPetr Mladek <pmladek@suse.cz>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Stephen Warren <swarren@wwwdotorg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Arun KS <arunks.linux@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70300177
    • G
    • J
      fs.h, drivers/hwmon/asus_atk0110.c: fix DEFINE_SIMPLE_ATTRIBUTE semicolon definition and use · 68be3029
      Joe Perches 提交于
      The DEFINE_SIMPLE_ATTRIBUTE macro should not end in a ; Fix the one use
      in the kernel tree that did not have a semicolon.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-by: NGuenter Roeck <linux@roeck-us.net>
      Acked-by: NLuca Tettamanti <kronos.it@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68be3029
    • J
      ./Makefile: tell gcc optimizer to never introduce new data races · 69102311
      Jiri Kosina 提交于
      We have been chasing a memory corruption bug, which turned out to be
      caused by very old gcc (4.3.4), which happily turned conditional load
      into a non-conditional one, and that broke correctness (the condition
      was met only if lock was held) and corrupted memory.
      
      This particular problem with that particular code did not happen when
      never gccs were used.  I've brought this up with our gcc folks, as I
      wanted to make sure that this can't really happen again, and it turns
      out it actually can.
      
      Quoting Martin Jambor <mjambor@suse.cz>:
       "More current GCCs are more careful when it comes to replacing a
        conditional load with a non-conditional one, most notably they check
        that a store happens in each iteration of _a_ loop but they assume
        loops are executed.  They also perform a simple check whether the
        store cannot trap which currently passes only for non-const
        variables.  A simple testcase demonstrating it on an x86_64 is for
        example the following:
      
        $ cat cond_store.c
      
        int g_1 = 1;
      
        int g_2[1024] __attribute__((section ("safe_section"), aligned (4096)));
      
        int c = 4;
      
        int __attribute__ ((noinline))
        foo (void)
        {
          int l;
          for (l = 0; (l != 4); l++) {
            if (g_1)
              return l;
            for (g_2[0] = 0; (g_2[0] >= 26); ++g_2[0])
              ;
          }
          return 2;
        }
      
        int main (int argc, char* argv[])
        {
          if (mprotect (g_2, sizeof(g_2), PROT_READ) == -1)
            {
              int e = errno;
              error (e, e, "mprotect error %i", e);
            }
          foo ();
          __builtin_printf("OK\n");
          return 0;
        }
        /* EOF */
        $ ~/gcc/trunk/inst/bin/gcc cond_store.c -O2 --param allow-store-data-races=0
        $ ./a.out
        OK
        $ ~/gcc/trunk/inst/bin/gcc cond_store.c -O2 --param allow-store-data-races=1
        $ ./a.out
        Segmentation fault
      
        The testcase fails the same at least with 4.9, 4.8 and 4.7.  Therefore
        I would suggest building kernels with this parameter set to zero. I
        also agree with Jikos that the default should be changed for -O2.  I
        have run most of the SPEC 2k6 CPU benchmarks (gamess and dealII
        failed, at -O2, not sure why) compiled with and without this option
        and did not see any real difference between respective run-times"
      
      Hopefully the default will be changed in newer gccs, but let's force it
      for kernel builds so that we are on a safe side even when older gcc are
      used.
      
      The code in question was out-of-tree printk-in-NMI (yeah, surprise
      suprise, once again) patch written by Petr Mladek, let me quote his
      comment from our internal bugzilla:
      
       "I have spent few days investigating inconsistent state of kernel ring buffer.
        It went out that it was caused by speculative store generated by
        gcc-4.3.4.
      
        The problem is in assembly generated for make_free_space(). The functions is
        called the following way:
      
        + vprintk_emit();
            + log = MAIN_LOG; // with logbuf_lock
               or
               log = NMI_LOG; // with nmi_logbuf_lock
               cont_add(log, ...);
                + cont_flush(log, ...);
                    + log_store(log, ...);
                          + log_make_free_space(log, ...);
      
        If called with log = NMI_LOG then only nmi_log_* global variables are safe to
        modify but the generated code does store also into (main_)log_* global
        variables:
      
        <log_make_free_space>:
               55                      push   %rbp
               89 f6                   mov    %esi,%esi
      
               48 8b 05 03 99 51 01    mov    0x1519903(%rip),%rax       # ffffffff82620868 <nmi_log_next_id>
               44 8b 1d ec 98 51 01    mov    0x15198ec(%rip),%r11d      # ffffffff82620858 <log_next_idx>
               8b 35 36 60 14 01       mov    0x1146036(%rip),%esi       # ffffffff8224cfa8 <log_buf_len>
               44 8b 35 33 60 14 01    mov    0x1146033(%rip),%r14d      # ffffffff8224cfac <nmi_log_buf_len>
               4c 8b 2d d0 98 51 01    mov    0x15198d0(%rip),%r13       # ffffffff82620850 <log_next_seq>
               4c 8b 25 11 61 14 01    mov    0x1146111(%rip),%r12       # ffffffff8224d098 <log_buf>
               49 89 c2                mov    %rax,%r10
               48 21 c2                and    %rax,%rdx
               48 8b 1d 0c 99 55 01    mov    0x155990c(%rip),%rbx       # ffffffff826608a0 <nmi_log_buf>
               49 c1 ea 20             shr    $0x20,%r10
               48 89 55 d0             mov    %rdx,-0x30(%rbp)
               44 29 de                sub    %r11d,%esi
               45 29 d6                sub    %r10d,%r14d
               4c 8b 0d 97 98 51 01    mov    0x1519897(%rip),%r9	# ffffffff82620840 <log_first_seq>
               eb 7e                   jmp    ffffffff81107029	<log_make_free_space+0xe9>
        [...]
               85 ff                   test   %edi,%edi                  # edi = 1 for NMI_LOG
               4c 89 e8                mov    %r13,%rax
               4c 89 ca                mov    %r9,%rdx
               74 0a                   je     ffffffff8110703d	<log_make_free_space+0xfd>
               8b 15 27 98 51 01       mov    0x1519827(%rip),%edx       # ffffffff82620860 <nmi_log_first_id>
               48 8b 45 d0             mov    -0x30(%rbp),%rax
               48 39 c2                cmp    %rax,%rdx                  # end of loop
               0f 84 da 00 00 00       je     ffffffff81107120 <log_make_free_space+0x1e0>
        [...]
               85 ff                   test   %edi,%edi                  # edi = 1 for NMI_LOG
               4c 89 0d 17 97 51 01    mov    %r9,0x1519717(%rip)        # ffffffff82620840 <log_first_seq>
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^
                                       KABOOOM
               74 35                   je     ffffffff81107160		 <log_make_free_space+0x220>
      
        It stores log_first_seq when edi == NMI_LOG. This instructions are used also
        when edi == MAIN_LOG but the store is done speculatively before the condition
        is decided.  It is unsafe because we do not have "logbuf_lock" in NMI context
        and some other process migh modify "log_first_seq" in parallel"
      
      I believe that the best course of action is both
      
       - building kernel (and anything multi-threaded, I guess) with that
         optimization turned off
       - persuade gcc folks to change the default for future releases
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Cc: Martin Jambor <mjambor@suse.cz>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Marek Polacek <polacek@redhat.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Richard Biener <richard.guenther@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69102311
    • D
      mm/zpool: update zswap to use zpool · 12d79d64
      Dan Streetman 提交于
      Change zswap to use the zpool api instead of directly using zbud.  Add a
      boot-time param to allow selecting which zpool implementation to use,
      with zbud as the default.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Tested-by: NSeth Jennings <sjennings@variantweb.net>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12d79d64
    • D
      mm/zpool: zbud/zsmalloc implement zpool · c795779d
      Dan Streetman 提交于
      Update zbud and zsmalloc to implement the zpool api.
      
      [fengguang.wu@intel.com: make functions static]
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Tested-by: NSeth Jennings <sjennings@variantweb.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c795779d
    • D
      mm/zpool: implement common zpool api to zbud/zsmalloc · af8d417a
      Dan Streetman 提交于
      Add zpool api.
      
      zpool provides an interface for memory storage, typically of compressed
      memory.  Users can select what backend to use; currently the only
      implementations are zbud, a low density implementation with up to two
      compressed pages per storage page, and zsmalloc, a higher density
      implementation with multiple compressed pages per storage page.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Tested-by: NSeth Jennings <sjennings@variantweb.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af8d417a
    • D
      mm/zbud: change zbud_alloc size type to size_t · 99eef8e9
      Dan Streetman 提交于
      Change the type of the zbud_alloc() size param from unsigned int to
      size_t.
      
      Technically, this should not make any difference, as the zbud
      implementation already restricts the size to well within either type's
      limits; but as zsmalloc (and kmalloc) use size_t, and zpool will use
      size_t, this brings the size parameter type in line with zsmalloc/zpool.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Acked-by: NSeth Jennings <sjennings@variantweb.net>
      Tested-by: NSeth Jennings <sjennings@variantweb.net>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99eef8e9
    • W
      zram: replace global tb_lock with fine grain lock · d2d5e762
      Weijie Yang 提交于
      Currently, we use a rwlock tb_lock to protect concurrent access to the
      whole zram meta table.  However, according to the actual access model,
      there is only a small chance for upper user to access the same
      table[index], so the current lock granularity is too big.
      
      The idea of optimization is to change the lock granularity from whole
      meta table to per table entry (table -> table[index]), so that we can
      protect concurrent access to the same table[index], meanwhile allow the
      maximum concurrency.
      
      With this in mind, several kinds of locks which could be used as a
      per-entry lock were tested and compared:
      
      Test environment:
      x86-64 Intel Core2 Q8400, system memory 4GB, Ubuntu 12.04,
      kernel v3.15.0-rc3 as base, zram with 4 max_comp_streams LZO.
      
      iozone test:
      iozone -t 4 -R -r 16K -s 200M -I +Z
      (1GB zram with ext4 filesystem, take the average of 10 tests, KB/s)
      
            Test       base      CAS    spinlock    rwlock   bit_spinlock
      -------------------------------------------------------------------
       Initial write  1381094   1425435   1422860   1423075   1421521
             Rewrite  1529479   1641199   1668762   1672855   1654910
                Read  8468009  11324979  11305569  11117273  10997202
             Re-read  8467476  11260914  11248059  11145336  10906486
        Reverse Read  6821393   8106334   8282174   8279195   8109186
         Stride read  7191093   8994306   9153982   8961224   9004434
         Random read  7156353   8957932   9167098   8980465   8940476
      Mixed workload  4172747   5680814   5927825   5489578   5972253
        Random write  1483044   1605588   1594329   1600453   1596010
              Pwrite  1276644   1303108   1311612   1314228   1300960
               Pread  4324337   4632869   4618386   4457870   4500166
      
      To enhance the possibility of access the same table[index] concurrently,
      set zram a small disksize(10MB) and let threads run with large loop
      count.
      
      fio test:
      fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers
      --scramble_buffers=1 --direct=1 --loops=3000 --numjobs=4
      --filename=/dev/zram0 --name=seq-write --rw=write --stonewall
      --name=seq-read --rw=read --stonewall --name=seq-readwrite
      --rw=rw --stonewall --name=rand-readwrite --rw=randrw --stonewall
      (10MB zram raw block device, take the average of 10 tests, KB/s)
      
          Test     base     CAS    spinlock    rwlock  bit_spinlock
      -------------------------------------------------------------
      seq-write   933789   999357   1003298    995961   1001958
       seq-read  5634130  6577930   6380861   6243912   6230006
         seq-rw  1405687  1638117   1640256   1633903   1634459
        rand-rw  1386119  1614664   1617211   1609267   1612471
      
      All the optimization methods show a higher performance than the base,
      however, it is hard to say which method is the most appropriate.
      
      On the other hand, zram is mostly used on small embedded system, so we
      don't want to increase any memory footprint.
      
      This patch pick the bit_spinlock method, pack object size and page_flag
      into an unsigned long table.value, so as to not increase any memory
      overhead on both 32-bit and 64-bit system.
      
      On the third hand, even though different kinds of locks have different
      performances, we can ignore this difference, because: if zram is used as
      zram swapfile, the swap subsystem can prevent concurrent access to the
      same swapslot; if zram is used as zram-blk for set up filesystem on it,
      the upper filesystem and the page cache also prevent concurrent access
      of the same block mostly.  So we can ignore the different performances
      among locks.
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2d5e762
    • M
      zram: use size_t instead of u16 · 023b409f
      Minchan Kim 提交于
      Some architectures (eg, hexagon and PowerPC) could use PAGE_SHIFT of 16
      or more.  In these cases u16 is not sufficiently large to represent a
      compressed page's size so use size_t.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NWeijie Yang <weijie.yang@samsung.com>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      023b409f
    • S
      zram: remove unused SECTOR_SIZE define · a830eff7
      Sergey Senozhatsky 提交于
      Drop SECTOR_SIZE define, because it's not used.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a830eff7
    • S
      zram: rename struct `table' to `zram_table_entry' · cb8f2eec
      Sergey Senozhatsky 提交于
      Andrew Morton has recently noted that `struct table' actually represents
      table entry and, thus, should be renamed.  Rename to `zram_table_entry'.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb8f2eec
    • M
      mm/highmem: make kmap cache coloring aware · 15de36a4
      Max Filippov 提交于
      User-visible effect:
       Architectures that choose this method of maintaining cache coherency
      (MIPS and xtensa currently) are able to use high memory on cores with
      aliasing data cache.  Without this fix such architectures can not use
      high memory (in case of xtensa it means that at most 128 MBytes of
      physical memory is available).
      
      The problem:
       VIPT cache with way size larger than MMU page size may suffer from
      aliasing problem: a single physical address accessed via different
      virtual addresses may end up in multiple locations in the cache.
      Virtual mappings of a physical address that always get cached in
      different cache locations are said to have different colors.  L1 caching
      hardware usually doesn't handle this situation leaving it up to
      software.  Software must avoid this situation as it leads to data
      corruption.
      
      What can be done:
       One way to handle this is to flush and invalidate data cache every time
      page mapping changes color.  The other way is to always map physical
      page at a virtual address with the same color.  Low memory pages already
      have this property.  Giving architecture a way to control color of high
      memory page mapping allows reusing of existing low memory cache alias
      handling code.
      
      How this is done with this patch:
       Provide hooks that allow architectures with aliasing cache to align
      mapping address of high pages according to their color.  Such
      architectures may enforce similar coloring of low- and high-memory page
      mappings and reuse existing cache management functions to support
      highmem.
      
      This code is based on the implementation of similar feature for MIPS by
      Leonid Yegoshin.
      Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com>
      Cc: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Marc Gauthier <marc@cadence.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Steven Hill <Steven.Hill@imgtec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15de36a4
    • P
      mmu_notifier: add call_srcu and sync function for listener to delay call and sync · b972216e
      Peter Zijlstra 提交于
      When kernel device drivers or subsystems want to bind their lifespan to
      t= he lifespan of the mm_struct, they usually use one of the following
      methods:
      
      1. Manually calling a function in the interested kernel module.  The
         funct= ion call needs to be placed in mmput.  This method was rejected
         by several ker= nel maintainers.
      
      2. Registering to the mmu notifier release mechanism.
      
      The problem with the latter approach is that the mmu_notifier_release
      cal= lback is called from__mmu_notifier_release (called from exit_mmap).
      That functi= on iterates over the list of mmu notifiers and don't expect
      the release call= back function to remove itself from the list.
      Therefore, the callback function= in the kernel module can't release the
      mmu_notifier_object, which is actuall= y the kernel module's object
      itself.  As a result, the destruction of the kernel module's object must
      to be done in a delayed fashion.
      
      This patch adds support for this delayed callback, by adding a new
      mmu_notifier_call_srcu function that receives a function ptr and calls
      th= at function with call_srcu.  In that function, the kernel module
      releases its object.  To use mmu_notifier_call_srcu, the calling module
      needs to call b= efore that a new function called
      mmu_notifier_unregister_no_release that as its= name implies,
      unregisters a notifier without calling its notifier release call= back.
      
      This patch also adds a function that will call barrier_srcu so those
      kern= el modules can sync with mmu_notifier.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NOded Gabbay <oded.gabbay@amd.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b972216e
    • C
      mm: BUG when __kmap_atomic_idx equals KM_TYPE_NR · 1d352bfd
      Chintan Pandya 提交于
      __kmap_atomic_idx is per_cpu variable.  Each CPU can use KM_TYPE_NR
      entries from FIXMAP i.e.  from 0 to KM_TYPE_NR - 1.  Allowing
      __kmap_atomic_idx to over- shoot to KM_TYPE_NR can mess up with next
      CPU's 0th entry which is a bug.  Hence BUG_ON if __kmap_atomic_idx >=
      KM_TYPE_NR.
      
      Fix the off-by-on in this test.
      Signed-off-by: NChintan Pandya <cpandya@codeaurora.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d352bfd
    • J
      mm: memcontrol: clean up reclaim size variable use in try_charge() · 61e02c74
      Johannes Weiner 提交于
      Charge reclaim and OOM currently use the charge batch variable, but
      batching is already disabled at that point.  To simplify the charge
      logic, the batch variable is reset to the original request size when
      reclaim is entered, so it's functionally equal, but it's misleading.
      
      Switch reclaim/OOM to nr_pages, which is the original request size.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61e02c74
    • S
      kernel/smp.c:on_each_cpu_cond(): fix warning in fallback path · 618fde87
      Sasha Levin 提交于
      The rarely-executed memry-allocation-failed callback path generates a
      WARN_ON_ONCE() when smp_call_function_single() succeeds.  Presumably
      it's supposed to warn on failures.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@gentwo.org>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      618fde87
    • R
      mm: change confusing #ifdef use in __access_remote_vm · dbffcd03
      Rik van Riel 提交于
      This patch changes confusing #ifdef use in __access_remote_vm into
      merely ugly #ifdef use.
      
      Addresses bug https://bugzilla.kernel.org/show_bug.cgi?id=81651Signed-off-by: NRik van Riel <riel@redhat.com>
      Reported-by: NDavid Binderman <dcb314@hotmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbffcd03
    • P
      mm: softdirty: respect VM_SOFTDIRTY in PTE holes · 68b5a652
      Peter Feiner 提交于
      After a VMA is created with the VM_SOFTDIRTY flag set, /proc/pid/pagemap
      should report that the VMA's virtual pages are soft-dirty until
      VM_SOFTDIRTY is cleared (i.e., by the next write of "4" to
      /proc/pid/clear_refs).  However, pagemap ignores the VM_SOFTDIRTY flag
      for virtual addresses that fall in PTE holes (i.e., virtual addresses
      that don't have a PMD, PUD, or PGD allocated yet).
      
      To observe this bug, use mmap to create a VMA large enough such that
      there's a good chance that the VMA will occupy an unused PMD, then test
      the soft-dirty bit on its pages.  In practice, I found that a VMA that
      covered a PMD's worth of address space was big enough.
      
      This patch adds the necessary VMA lookup to the PTE hole callback in
      /proc/pid/pagemap's page walk and sets soft-dirty according to the VMAs'
      VM_SOFTDIRTY flag.
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Acked-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68b5a652
    • K
      mm: mark fault_around_bytes __read_mostly · 3a91053a
      Kirill A. Shutemov 提交于
      fault_around_bytes can only be changed via debugfs.  Let's mark it
      read-mostly.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a91053a
    • K
      mm: close race between do_fault_around() and fault_around_bytes_set() · aecd6f44
      Kirill A. Shutemov 提交于
      Things can go wrong if fault_around_bytes will be changed under
      do_fault_around(): between fault_around_mask() and fault_around_pages().
      
      Let's read fault_around_bytes only once during do_fault_around() and
      calculate mask based on the reading.
      
      Note: fault_around_bytes can only be updated via debug interface.  Also
      I've tried but was not able to trigger a bad behaviour without the
      patch.  So I would not consider this patch as urgent.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aecd6f44
    • J
      memcg, vmscan: Fix forced scan of anonymous pages · 2ab051e1
      Jerome Marchand 提交于
      When memory cgoups are enabled, the code that decides to force to scan
      anonymous pages in get_scan_count() compares global values (free,
      high_watermark) to a value that is restricted to a memory cgroup (file).
      It make the code over-eager to force anon scan.
      
      For instance, it will force anon scan when scanning a memcg that is
      mainly populated by anonymous page, even when there is plenty of file
      pages to get rid of in others memcgs, even when swappiness == 0.  It
      breaks user's expectation about swappiness and hurts performance.
      
      This patch makes sure that forced anon scan only happens when there not
      enough file pages for the all zone, not just in one random memcg.
      
      [hannes@cmpxchg.org: cleanups]
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ab051e1
    • J
      mm, vmscan: fix an outdated comment still mentioning get_scan_ratio · 7c0db9e9
      Jerome Marchand 提交于
      Quite a while ago, get_scan_ratio() has been renamed get_scan_count(),
      however a comment in shrink_active_list() still mention it.  This patch
      fixes the outdated comment.
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c0db9e9
    • D
      mm, oom: remove unnecessary exit_state check · fb794bcb
      David Rientjes 提交于
      The oom killer scans each process and determines whether it is eligible
      for oom kill or whether the oom killer should abort because of
      concurrent memory freeing.  It will abort when an eligible process is
      found to have TIF_MEMDIE set, meaning it has already been oom killed and
      we're waiting for it to exit.
      
      Processes with task->mm == NULL should not be considered because they
      are either kthreads or have already detached their memory and killing
      them would not lead to memory freeing.  That memory is only freed after
      exit_mm() has returned, however, and not when task->mm is first set to
      NULL.
      
      Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process
      is no longer considered for oom kill, but only until exit_mm() has
      returned.  This was fragile in the past because it relied on
      exit_notify() to be reached before no longer considering TIF_MEMDIE
      processes.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb794bcb
    • L
      mm: fix potential infinite loop in dissolve_free_huge_pages() · d0177639
      Li Zhong 提交于
      It is possible for some platforms, such as powerpc to set HPAGE_SHIFT to
      0 to indicate huge pages not supported.
      
      When this is the case, hugetlbfs could be disabled during boot time:
      hugetlbfs: disabling because there are no supported hugepage sizes
      
      Then in dissolve_free_huge_pages(), order is kept maximum (64 for
      64bits), and the for loop below won't end: for (pfn = start_pfn; pfn <
      end_pfn; pfn += 1 << order)
      
      As suggested by Naoya, below fix checks hugepages_supported() before
      calling dissolve_free_huge_pages().
      
      [rientjes@google.com: no legitimate reason to call dissolve_free_huge_pages() when !hugepages_supported()]
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0177639
    • D
      mm, thp: restructure thp avoidance of light synchronous migration · 8fe78048
      David Rientjes 提交于
      __GFP_NO_KSWAPD, once the way to determine if an allocation was for thp
      or not, has gained more users.  Their use is not necessarily wrong, they
      are trying to do a memory allocation that can easily fail without
      disturbing kswapd, so the bit has gained additional usecases.
      
      This restructures the check to determine whether MIGRATE_SYNC_LIGHT
      should be used for memory compaction in the page allocator.  Rather than
      testing solely for __GFP_NO_KSWAPD, test for all bits that must be set
      for thp allocations.
      
      This also moves the check to be done only after the page allocator is
      aborted for deferred or contended memory compaction since setting
      migration_mode for this case is pointless.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fe78048
    • D
      mm, oom: rename zonelist locking functions · e972a070
      David Rientjes 提交于
      try_set_zonelist_oom() and clear_zonelist_oom() are not named properly
      to imply that they require locking semantics to avoid out_of_memory()
      being reordered.
      
      zone_scan_lock is required for both functions to ensure that there is
      proper locking synchronization.
      
      Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename
      clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper
      locking semantics.
      
      At the same time, convert oom_zonelist_trylock() to return bool instead
      of int since only success and failure are tested.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e972a070
    • D
      mm, oom: ensure memoryless node zonelist always includes zones · 8d060bf4
      David Rientjes 提交于
      With memoryless node support being worked on, it's possible that for
      optimizations that a node may not have a non-NULL zonelist.  When
      CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist
      for first_online_node may become NULL.
      
      The oom killer requires a zonelist that includes all memory zones for
      the sysrq trigger and pagefault out of memory handler.
      
      Ensure that a non-NULL zonelist is always passed to the oom killer.
      
      [akpm@linux-foundation.org: fix non-numa build]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d060bf4
    • W
      memory-hotplug: sh: suitable memory should go to ZONE_MOVABLE · 6e90b58b
      Wang Nan 提交于
      This patch introduces zone_for_memory() to arch_add_memory() on sh to
      ensure new, higher memory added into ZONE_MOVABLE if movable zone has
      already setup.
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: "Mel Gorman" <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e90b58b
    • W
      memory-hotplug: ppc: suitable memory should go to ZONE_MOVABLE · f51202de
      Wang Nan 提交于
      This patch introduces zone_for_memory() to arch_add_memory() on powerpc
      to ensure new, higher memory added into ZONE_MOVABLE if movable zone has
      already setup.
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: "Mel Gorman" <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f51202de
    • W
      memory-hotplug: ia64: suitable memory should go to ZONE_MOVABLE · ed562ae6
      Wang Nan 提交于
      This patch introduces zone_for_memory() to arch_add_memory() on ia64 to
      ensure new, higher memory added into ZONE_MOVABLE if movable zone has
      already setup.
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: "Mel Gorman" <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed562ae6
    • W
      memory-hotplug: x86_32: suitable memory should go to ZONE_MOVABLE · 03d4be64
      Wang Nan 提交于
      This patch introduces zone_for_memory() to arch_add_memory() on x86_32
      to ensure new, higher memory added into ZONE_MOVABLE if movable zone has
      already setup.
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: "Mel Gorman" <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03d4be64
    • W
      memory-hotplug: x86_64: suitable memory should go to ZONE_MOVABLE · 9bfc4113
      Wang Nan 提交于
      This patch introduces zone_for_memory() to arch_add_memory() on x86_64
      to ensure new, higher memory added into ZONE_MOVABLE if movable zone has
      already setup.
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: "Mel Gorman" <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9bfc4113
    • W
      memory-hotplug: add zone_for_memory() for selecting zone for new memory · 63264400
      Wang Nan 提交于
      This series of patches fixes a problem when adding memory in bad manner.
      For example: for a x86_64 machine booted with "mem=400M" and with 2GiB
      memory installed, following commands cause problem:
      
        # echo 0x40000000 > /sys/devices/system/memory/probe
       [   28.613895] init_memory_mapping: [mem 0x40000000-0x47ffffff]
        # echo 0x48000000 > /sys/devices/system/memory/probe
       [   28.693675] init_memory_mapping: [mem 0x48000000-0x4fffffff]
        # echo online_movable > /sys/devices/system/memory/memory9/state
        # echo 0x50000000 > /sys/devices/system/memory/probe
       [   29.084090] init_memory_mapping: [mem 0x50000000-0x57ffffff]
        # echo 0x58000000 > /sys/devices/system/memory/probe
       [   29.151880] init_memory_mapping: [mem 0x58000000-0x5fffffff]
        # echo online_movable > /sys/devices/system/memory/memory11/state
        # echo online> /sys/devices/system/memory/memory8/state
        # echo online> /sys/devices/system/memory/memory10/state
        # echo offline> /sys/devices/system/memory/memory9/state
       [   30.558819] Offlined Pages 32768
        # free
                    total       used       free     shared    buffers     cached
       Mem:        780588 18014398509432020     830552          0          0      51180
       -/+ buffers/cache: 18014398509380840     881732
       Swap:            0          0          0
      
      This is because the above commands probe higher memory after online a
      section with online_movable, which causes ZONE_HIGHMEM (or ZONE_NORMAL
      for systems without ZONE_HIGHMEM) overlaps ZONE_MOVABLE.
      
      After the second online_movable, the problem can be observed from
      zoneinfo:
      
        # cat /proc/zoneinfo
        ...
        Node 0, zone  Movable
          pages free     65491
                min      250
                low      312
                high     375
                scanned  0
                spanned  18446744073709518848
                present  65536
                managed  65536
        ...
      
      This series of patches solve the problem by checking ZONE_MOVABLE when
      choosing zone for new memory.  If new memory is inside or higher than
      ZONE_MOVABLE, makes it go there instead.
      
      After applying this series of patches, following are free and zoneinfo
      result (after offlining memory9):
      
        bash-4.2# free
                      total       used       free     shared    buffers     cached
         Mem:        780956      80112     700844          0          0      51180
         -/+ buffers/cache:      28932     752024
         Swap:            0          0          0
      
        bash-4.2# cat /proc/zoneinfo
      
        Node 0, zone      DMA
          pages free     3389
                min      14
                low      17
                high     21
                scanned  0
                spanned  4095
                present  3998
                managed  3977
            nr_free_pages 3389
        ...
          start_pfn:         1
          inactive_ratio:    1
        Node 0, zone    DMA32
          pages free     73724
                min      341
                low      426
                high     511
                scanned  0
                spanned  98304
                present  98304
                managed  92958
            nr_free_pages 73724
          ...
          start_pfn:         4096
          inactive_ratio:    1
        Node 0, zone   Normal
          pages free     32630
                min      120
                low      150
                high     180
                scanned  0
                spanned  32768
                present  32768
                managed  32768
            nr_free_pages 32630
        ...
          start_pfn:         262144
          inactive_ratio:    1
        Node 0, zone  Movable
          pages free     65476
                min      241
                low      301
                high     361
                scanned  0
                spanned  98304
                present  65536
                managed  65536
            nr_free_pages 65476
        ...
          start_pfn:         294912
          inactive_ratio:    1
      
      This patch (of 7):
      
      Introduce zone_for_memory() in arch independent code for
      arch_add_memory() use.
      
      Many arch_add_memory() function simply selects ZONE_HIGHMEM or
      ZONE_NORMAL and add new memory into it.  However, with the existance of
      ZONE_MOVABLE, the selection method should be carefully considered: if
      new, higher memory is added after ZONE_MOVABLE is setup, the default
      zone and ZONE_MOVABLE may overlap each other.
      
      should_add_memory_movable() checks the status of ZONE_MOVABLE.  If it
      has already contain memory, compare the address of new memory and
      movable memory.  If new memory is higher than movable, it should be
      added into ZONE_MOVABLE instead of default zone.
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: "Mel Gorman" <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63264400