1. 11 6月, 2014 5 次提交
  2. 07 6月, 2014 12 次提交
    • S
      svcrdma: refactor marshalling logic · 0bf48289
      Steve Wise 提交于
      This patch refactors the NFSRDMA server marshalling logic to
      remove the intermediary map structures.  It also fixes an existing bug
      where the NFSRDMA server was not minding the device fast register page
      list length limitations.
      Signed-off-by: NTom Tucker <tom@opengridcomputing.com>
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      0bf48289
    • J
      nfs4: remove unused CHANGE_SECURITY_LABEL · 999e5683
      J. Bruce Fields 提交于
      This constant has the wrong value.  And we don't use it.  And it's been
      removed from the 4.2 spec anyway.
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      999e5683
    • G
      idle: remove cpu_idle() forward declarations · ae022622
      Geert Uytterhoeven 提交于
      After all architectures were converted to the generic idle framework,
      commit d190e819 ("idle: Remove GENERIC_IDLE_LOOP config switch")
      removed the last caller of cpu_idle().  The forward declarations in
      header files were forgotten.
      Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae022622
    • C
      mm: introduce kmemleak_update_trace() · ffe2c748
      Catalin Marinas 提交于
      The memory allocation stack trace is not always useful for debugging a
      memory leak (e.g.  radix_tree_preload).  This function, when called,
      updates the stack trace for an already allocated object.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ffe2c748
    • J
      key: convert use of typedef ctl_table to struct ctl_table · d6f50c95
      Joe Perches 提交于
      This typedef is unnecessary and should just be removed.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6f50c95
    • M
      ipc/shm.c: increase the defaults for SHMALL, SHMMAX · 060028ba
      Manfred Spraul 提交于
      System V shared memory
      
      a) can be abused to trigger out-of-memory conditions and the standard
         measures against out-of-memory do not work:
      
          - it is not possible to use setrlimit to limit the size of shm segments.
      
          - segments can exist without association with any processes, thus
            the oom-killer is unable to free that memory.
      
      b) is typically used for shared information - today often multiple GB.
         (e.g. database shared buffers)
      
      The current default is a maximum segment size of 32 MB and a maximum
      total size of 8 GB.  This is often too much for a) and not enough for
      b), which means that lots of users must change the defaults.
      
      This patch increases the default limits (nearly) to the maximum, which
      is perfect for case b).  The defaults are used after boot and as the
      initial value for each new namespace.
      
      Admins/distros that need a protection against a) should reduce the
      limits and/or enable shm_rmid_forced.
      
      Unix has historically required setting these limits for shared memory,
      and Linux inherited such behavior.  The consequence of this is added
      complexity for users and administrators.  One very common example are
      Database setup/installation documents and scripts, where users must
      manually calculate the values for these limits.  This also requires
      (some) knowledge of how the underlying memory management works, thus
      causing, in many occasions, the limits to just be flat out wrong.
      Disabling these limits sooner could have saved companies a lot of time,
      headaches and money for support.  But it's never too late, simplify
      users life now.
      
      Further notes:
      - The patch only changes default, overrides behave as before:
              # sysctl kernel.shmall=33554432
        would recreate the previous limit for SHMMAX (for the current namespace).
      
      - Disabling sysv shm allocation is possible with:
              # sysctl kernel.shmall=0
        (not a new feature, also per-namespace)
      
      - The limits are intentionally set to a value slightly less than ULONG_MAX,
        to avoid triggering overflows in user space apps.
        [not unreasonable, see http://marc.info/?l=linux-mm&m=139638334330127]
      Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reported-by: NDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      060028ba
    • L
      idr: reorder the fields · dcbff5d1
      Lai Jiangshan 提交于
      idr_layer->layer is always accessed in read path, move it in the front.
      
      idr_layer->bitmap is moved on the bottom.  And rcu_head shares with
      bitmap due to they do not be accessed at the same time.
      
      idr->id_free/id_free_cnt/lock are free list fields, and moved to the
      bottom.  They will be removed in near future.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcbff5d1
    • O
      signals: introduce kernel_sigaction() · b4e74264
      Oleg Nesterov 提交于
      Now that allow_signal() is really trivial we can unify it with
      disallow_signal().  Add the new helper, kernel_sigaction(), and
      reimplement allow_signal/disallow_signal as a trivial wrappers.
      
      This saves one EXPORT_SYMBOL() and the new helper can have more users.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4e74264
    • O
      signals: mv {dis,}allow_signal() from sched.h/exit.c to signal.[ch] · 0341729b
      Oleg Nesterov 提交于
      Move the declaration/definition of allow_signal/disallow_signal to
      signal.h/signal.c.  The new place is more logical and allows to use the
      static helpers in signal.c (see the next changes).
      
      While at it, make them return void and remove the valid_signal() check.
      Nobody checks the returned value, and in-kernel users must not pass the
      wrong signal number.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0341729b
    • O
      signals: kill sigfindinword() · 36fac0a2
      Oleg Nesterov 提交于
      It has no users and it doesn't look useful.  I do not know why/when it was
      introduced, I can't even find any user in the git history.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36fac0a2
    • M
      ptrace: fix fork event messages across pid namespaces · 4e52365f
      Matthew Dempsky 提交于
      When tracing a process in another pid namespace, it's important for fork
      event messages to contain the child's pid as seen from the tracer's pid
      namespace, not the parent's.  Otherwise, the tracer won't be able to
      correlate the fork event with later SIGTRAP signals it receives from the
      child.
      
      We still risk a race condition if a ptracer from a different pid
      namespace attaches after we compute the pid_t value.  However, sending a
      bogus fork event message in this unlikely scenario is still a vast
      improvement over the status quo where we always send bogus fork event
      messages to debuggers in a different pid namespace than the forking
      process.
      Signed-off-by: NMatthew Dempsky <mdempsky@chromium.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Julien Tinnes <jln@chromium.org>
      Cc: Roland McGrath <mcgrathr@chromium.org>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e52365f
    • M
      drivers/rtc/rtc-cmos.c: drivers/char/rtc.c features for DECstation support · 31632dbd
      Maciej W. Rozycki 提交于
      This brings in drivers/char/rtc.c functionality required for DECstation
      and, should the maintainers decide to switch, Alpha systems to use
      rtc-cmos.
      
      Specifically these features are made available:
      
      * RTC iomem rather than x86/PCI port I/O mapping, controlled with the
        RTC_IOMAPPED macro as with the original driver.  The DS1287A chip in all
        DECstation systems is mapped in the host bus address space as a
        contiguous block of 64 32-bit words of which the least significant byte
        accesses the RTC chip for both reads and writes.  All the address and
        data window register accesses are made transparently by the chipset glue
        logic so that the device appears directly mapped on the host bus.
      
      * A way to set the size of the address space explicitly with the
        newly-added `address_space' member of the platform part of the RTC
        device structure.  This avoids the unreliable heuristics that does not
        work in a setup where the RTC is not explicitly accessed with the usual
        address and data window register pair.
      
      * The ability to use the RTC periodic interrupt as a system clock
        device, which is implemented by arch/mips/kernel/cevt-ds1287.c for
        DECstation systems and takes the RTC interrupt away from the RTC driver.
         Eventually hooking back to the clock device's interrupt handler should
        be possible for the purpose of the alarm clock and possibly also
        update-in-progress interrupt, but this is not done by this change.
      
        o To avoid interfering with the clock interrupt all the places where
          the RTC interrupt mask is fiddled with are only executed if and IRQ
          has been assigned to the RTC driver.
      
        o To avoid changing the clock setup Register A is not fiddled with
          if CMOS_RTC_FLAGS_NOFREQ is set in the newly-added `flags' member of
          the platform part of the RTC device structure.  Originally, in
          drivers/char/rtc.c, this was keyed with the absence of the RTC
          interrupt, just like the interrupt mask, but there only the periodic
          interrupt frequency is set, whereas rtc-cmos also sets the divider
          bits.  Therefore a new flag is introduced so that systems where the
          RTC interrupt is not usable rather than used as a system clock device
          can fully initialise the RTC.
      
      * A small clean-up is made to the IRQ assignment code that makes the IRQ
        number hardcoded to -1 rather than arbitrary -ENXIO (or whatever error
        happens to be returned by platform_get_irq) where no IRQ has been
        assigned to the RTC driver (NO_IRQ might be another candidate, but it
        looks like this macro has inconsistent or missing definitions and
        limited use and might therefore be unsafe).
      
      Verified to work correctly with a DECstation 5000/240 system.
      
      [akpm@linux-foundation.org: fix weird code layout]
      Signed-off-by: NMaciej W. Rozycki <macro@linux-mips.org>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31632dbd
  3. 06 6月, 2014 3 次提交
    • J
      blk-mq: bump max tag depth to 10K tags · a4391c64
      Jens Axboe 提交于
      For some scsi-mq cases, the tag map can be huge. So increase the
      max number of tags we support.
      
      Additionally, don't fail with EINVAL if a user requests too many
      tags. Warn that the tag depth has been adjusted down, and store
      the new value inside the tag_set passed in.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a4391c64
    • J
      block: add blk_rq_set_block_pc() · f27b087b
      Jens Axboe 提交于
      With the optimizations around not clearing the full request at alloc
      time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC
      up to the user allocating the request.
      
      Add a blk_rq_set_block_pc() that sets the command type to
      REQ_TYPE_BLOCK_PC, and properly initializes the members associated
      with this type of request. Update callers to use this function instead
      of manipulating rq->cmd_type directly.
      
      Includes fixes from Christoph Hellwig <hch@lst.de> for my half-assed
      attempt.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f27b087b
    • J
      block: add notion of a chunk size for request merging · 762380ad
      Jens Axboe 提交于
      Some drivers have different limits on what size a request should
      optimally be, depending on the offset of the request. Similar to
      dividing a device into chunks. Add a setting that allows the driver
      to inform the block layer of such a chunk size. The block layer will
      then prevent merging across the chunks.
      
      This is needed to optimally support NVMe with a non-zero stripe size.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      762380ad
  4. 05 6月, 2014 20 次提交
    • A
      647f010b
    • O
      kthreads: kill CLONE_KERNEL, change kernel_thread(kernel_init) to avoid CLONE_SIGHAND · 34a1b723
      Oleg Nesterov 提交于
      1. Remove CLONE_KERNEL, it has no users and it is dangerous.
      
         The (old) comment says "List of flags we want to share for kernel
         threads" but this is not true, we do not want to share ->sighand by
         default. This flag can only be used if the caller is sure that both
         parent/child will never play with signals (say, allow_signal/etc).
      
      2. Change rest_init() to clone kernel_init() without CLONE_SIGHAND.
      
         In this case CLONE_SIGHAND does not really hurt, and it looks like
         optimization because copy_sighand() can avoid kmem_cache_alloc().
      
         But in fact this only adds the minor pessimization. kernel_init()
         is going to exec the init process, and de_thread() will need to
         unshare ->sighand and do kmem_cache_alloc(sighand_cachep) anyway,
         but it needs to do more work and take tasklist_lock and siglock.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34a1b723
    • B
      kernel/printk: use symbolic defines for console loglevels · a8fe19eb
      Borislav Petkov 提交于
      ... instead of naked numbers.
      
      Stuff in sysrq.c used to set it to 8 which is supposed to mean above
      default level so set it to DEBUG instead as we're terminating/killing all
      tasks and we want to be verbose there.
      
      Also, correct the check in x86_64_start_kernel which should be >= as
      we're clearly issuing the string there for all debug levels, not only
      the magical 10.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8fe19eb
    • D
      Documentation: expand/clarify debug documentation · 6e099f55
      Dan Streetman 提交于
      The pr_debug() and related debug print macros all differ from the normal
      pr_XXX() macros, in that the normal ones print unconditionally, while
      the debug macros are compiled out unless DEBUG is defined or
      CONFIG_DYNAMIC_DEBUG is set.  This isn't obvious, and the only way to
      find this out is either to review the actual printk.h code or to read
      CodingStyle, and the message there doesn't highlight the fact.
      
      Change Documentation/CodingStyle to clearly indicate that pr_debug() and
      related debug printing macros behave differently than all other pr_XXX()
      macros, and attempt to clarify when and where the different debug
      printing methods might be used.
      
      Add short comment to printk.h above the pr_XXX() macros indicating that
      while these macros print unconditionally, pr_debug() does not.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e099f55
    • J
      printk: Add printk_deferred_once · c224815d
      John Stultz 提交于
      Two of the three prink_deferred uses are really printk_once style
      uses, so add a printk_deferred_once macro to simplify those call
      sites.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c224815d
    • J
      printk: rename printk_sched to printk_deferred · aac74dc4
      John Stultz 提交于
      After learning we'll need some sort of deferred printk functionality in
      the timekeeping core, Peter suggested we rename the printk_sched function
      so it can be reused by needed subsystems.
      
      This only changes the function name. No logic changes.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aac74dc4
    • K
      kernel/user.c: drop unused field 'files' from user_struct · b300a4ea
      Kirill A. Shutemov 提交于
      Nobody seems uses it for a long time. Let's drop it.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b300a4ea
    • J
      compiler.h: avoid sparse errors in __compiletime_error_fallback() · 2c0d259e
      James Hogan 提交于
      Usually, BUG_ON and friends aren't even evaluated in sparse, but recently
      compiletime_assert_atomic_type() was added, and that now results in a
      sparse warning every time it is used.
      
      The reason turns out to be the temporary variable, after it sparse no
      longer considers the value to be a constant, and results in a warning and
      an error.  The error is the more annoying part of this as it suppresses
      any further warnings in the same file, hiding other problems.
      
      Unfortunately the condition cannot be simply expanded out to avoid the
      temporary variable since it breaks compiletime_assert on old versions of
      GCC such as GCC 4.2.4 which the latest metag compiler is based on.
      
      Therefore #ifndef __CHECKER__ out the __compiletime_error_fallback which
      uses the potentially negative size array to trigger a conditional compiler
      error, so that sparse doesn't see it.
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Johannes Berg <johannes.berg@intel.com>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Luciano Coelho <luciano.coelho@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c0d259e
    • F
      mm/zbud.c: make size unsigned like unique callsite · 50417c55
      Fabian Frederick 提交于
      zbud_alloc is only called by zswap_frontswap_store with unsigned int len.
      Change function parameter + update >= 0 check.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Acked-by: NSeth Jennings <sjennings@variantweb.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50417c55
    • N
      hugetlb: rename hugepage_migration_support() to ..._supported() · 100873d7
      Naoya Horiguchi 提交于
      We already have a function named hugepages_supported(), and the similar
      name hugepage_migration_support() is a bit unconfortable, so let's rename
      it hugepage_migration_supported().
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      100873d7
    • K
      mm/rmap.c: cleanup ttu_flags · daa5ba76
      Konstantin Khlebnikov 提交于
      Transform action part of ttu_flags into individiual bits.  These flags
      aren't part of any uses-space visible api or even trace events.
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      daa5ba76
    • J
      mm/vmscan.c: use DIV_ROUND_UP for calculation of zone's balance_gap and correct comments. · 4be89a34
      Jianyu Zhan 提交于
      Currently, we use (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1)
      / KSWAPD_ZONE_BALANCE_GAP_RATIO to avoid a zero gap value.  It's better to
      use DIV_ROUND_UP macro for neater code and clear meaning.
      
      Besides, the gap value is calculated against the per-zone "managed pages",
      not "present pages".  This patch also corrects the comment and do some
      rephrasing.
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4be89a34
    • A
      include/linux/gfp.h: exclude duplicate header · b7596fb4
      Andy Shevchenko 提交于
      mmdebug.h is included twice.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7596fb4
    • M
      mm: non-atomically mark page accessed during page cache allocation where possible · 2457aec6
      Mel Gorman 提交于
      aops->write_begin may allocate a new page and make it visible only to have
      mark_page_accessed called almost immediately after.  Once the page is
      visible the atomic operations are necessary which is noticable overhead
      when writing to an in-memory filesystem like tmpfs but should also be
      noticable with fast storage.  The objective of the patch is to initialse
      the accessed information with non-atomic operations before the page is
      visible.
      
      The bulk of filesystems directly or indirectly use
      grab_cache_page_write_begin or find_or_create_page for the initial
      allocation of a page cache page.  This patch adds an init_page_accessed()
      helper which behaves like the first call to mark_page_accessed() but may
      called before the page is visible and can be done non-atomically.
      
      The primary APIs of concern in this care are the following and are used
      by most filesystems.
      
      	find_get_page
      	find_lock_page
      	find_or_create_page
      	grab_cache_page_nowait
      	grab_cache_page_write_begin
      
      All of them are very similar in detail to the patch creates a core helper
      pagecache_get_page() which takes a flags parameter that affects its
      behavior such as whether the page should be marked accessed or not.  Then
      old API is preserved but is basically a thin wrapper around this core
      function.
      
      Each of the filesystems are then updated to avoid calling
      mark_page_accessed when it is known that the VM interfaces have already
      done the job.  There is a slight snag in that the timing of the
      mark_page_accessed() has now changed so in rare cases it's possible a page
      gets to the end of the LRU as PageReferenced where as previously it might
      have been repromoted.  This is expected to be rare but it's worth the
      filesystem people thinking about it in case they see a problem with the
      timing change.  It is also the case that some filesystems may be marking
      pages accessed that previously did not but it makes sense that filesystems
      have consistent behaviour in this regard.
      
      The test case used to evaulate this is a simple dd of a large file done
      multiple times with the file deleted on each iterations.  The size of the
      file is 1/10th physical memory to avoid dirty page balancing.  In the
      async case it will be possible that the workload completes without even
      hitting the disk and will have variable results but highlight the impact
      of mark_page_accessed for async IO.  The sync results are expected to be
      more stable.  The exception is tmpfs where the normal case is for the "IO"
      to not hit the disk.
      
      The test machine was single socket and UMA to avoid any scheduling or NUMA
      artifacts.  Throughput and wall times are presented for sync IO, only wall
      times are shown for async as the granularity reported by dd and the
      variability is unsuitable for comparison.  As async results were variable
      do to writback timings, I'm only reporting the maximum figures.  The sync
      results were stable enough to make the mean and stddev uninteresting.
      
      The performance results are reported based on a run with no profiling.
      Profile data is based on a separate run with oprofile running.
      
      async dd
                                          3.15.0-rc3            3.15.0-rc3
                                             vanilla           accessed-v2
      ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
      tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
      btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
      ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
      xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)
      
      The XFS figure is a bit strange as it managed to avoid a worst case by
      sheer luck but the average figures looked reasonable.
      
              samples percentage
      ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      
      [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Tested-by: NPrabhakar Lad <prabhakar.csengg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2457aec6
    • M
      mm: shmem: avoid atomic operation during shmem_getpage_gfp · 07a42788
      Mel Gorman 提交于
      shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
      before it's even added to the LRU or visible.  This is unnecessary as what
      could it possible race against?  Use an unlocked variant.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07a42788
    • M
      mm: page_alloc: convert hot/cold parameter and immediate callers to bool · b745bc85
      Mel Gorman 提交于
      cold is a bool, make it one.  Make the likely case the "if" part of the
      block instead of the else as according to the optimisation manual this is
      preferred.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b745bc85
    • M
      mm: page_alloc: use unsigned int for order in more places · 7aeb09f9
      Mel Gorman 提交于
      X86 prefers the use of unsigned types for iterators and there is a
      tendency to mix whether a signed or unsigned type if used for page order.
      This converts a number of sites in mm/page_alloc.c to use unsigned int for
      order where possible.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7aeb09f9
    • M
      mm: page_alloc: reduce number of times page_to_pfn is called · dc4b0caf
      Mel Gorman 提交于
      In the free path we calculate page_to_pfn multiple times. Reduce that.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc4b0caf
    • M
      mm: page_alloc: use word-based accesses for get/set pageblock bitmaps · e58469ba
      Mel Gorman 提交于
      The test_bit operations in get/set pageblock flags are expensive.  This
      patch reads the bitmap on a word basis and use shifts and masks to isolate
      the bits of interest.  Similarly masks are used to set a local copy of the
      bitmap and then use cmpxchg to update the bitmap if there have been no
      other changes made in parallel.
      
      In a test running dd onto tmpfs the overhead of the pageblock-related
      functions went from 1.27% in profiles to 0.5%.
      
      In addition to the performance benefits, this patch closes races that are
      possible between:
      
      a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
         reads part of the bits before and other part of the bits after
         set_pageblock_migratetype() has updated them.
      
      b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
         read-modify-update set bit operation in set_pageblock_skip() will cause
         lost updates to some bits changed in the set_pageblock_migratetype().
      
      Joonsoo Kim first reported the case a) via code inspection.  Vlastimil
      Babka's testing with a debug patch showed that either a) or b) occurs
      roughly once per mmtests' stress-highalloc benchmark (although not
      necessarily in the same pageblock).  Furthermore during development of
      unrelated compaction patches, it was observed that frequent calls to
      {start,undo}_isolate_page_range() the race occurs several thousands of
      times and has resulted in NULL pointer dereferences in move_freepages()
      and free_one_page() in places where free_list[migratetype] is
      manipulated by e.g.  list_move().  Further debugging confirmed that
      migratetype had invalid value of 6, causing out of bounds access to the
      free_list array.
      
      That confirmed that the race exist, although it may be extremely rare,
      and currently only fatal where page isolation is performed due to
      memory hot remove.  Races on pageblocks being updated by
      set_pageblock_migratetype(), where both old and new migratetype are
      lower MIGRATE_RESERVE, currently cannot result in an invalid value
      being observed, although theoretically they may still lead to
      unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
      Furthermore, things could get suddenly worse when memory isolation is
      used more, or when new migratetypes are added.
      
      After this patch, the race has no longer been observed in testing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-and-tested-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e58469ba
    • M
      mm: page_alloc: use jump labels to avoid checking number_of_cpusets · 664eedde
      Mel Gorman 提交于
      If cpusets are not in use then we still check a global variable on every
      page allocation.  Use jump labels to avoid the overhead.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      664eedde